Introduction
Earlier versions of Pega Platform™ initially adopted certain third-party software products, including Hazelcast, as embedded services to deliver client data and stabilize the computing resources of Pega Platform applications. Beginning in Pega Platform version 8.8 and later releases, the software became fully microservice-oriented. Provisioning third-party solutions as services that are separate from Pega Platform has several key benefits, including greater security, easier maintenance, agility and scalability, better performance and stability, and a modernized deployment. For details of these benefits, see Externalization of services in your deployment.
However, Pega acknowledges that you may face challenges with externalization. For Hazelcast, Pega will continue to offer support for embedded Hazelcast in 24.2. However, this extension is for embedded Hazelcast only. The other third-party software products – Cassandra, Kafka, and Elasticsearch – must still be externalized for all clients using Release 24.2.
There may be exceptions to the support for embedded Hazelcast. Clients have reported a number of issues to Pega’s support organization. After investigation, the root cause of these particular issues was the embedded Hazelcast software. The issues were resolved by externalizing Hazelcast for those clients.
So if you are encountering issues like those described below, Pega may still strongly recommend that you externalize Hazelcast to improve your operational stability, even though embedded Hazelcast is still supported.
Symptoms
There are a number of different symptoms which can point to a problem with embedded Hazelcast. These describe some situations, but there can be others.
1. Hazelcast Partition lost exceptions
Errors:
- Node (IP: [11.111.1.111]:1111) is not responding. Check the logs on [11.111.1.111]:1111 for more information - partition was lost: com.hazelcast.partition.PartitionLostEvent{partitionId=178, lostBackupCount=5, eventSource[11.111.1.111]:1111} -- Exceptions.Exception :- com.hazelcast.core.OperationTimeoutException: QueryPartitionOperation got rejected before execution due to not starting within the operation-call-timeout of: 60000 ms
2. “Split-brain” or cluster fracturing issue
In highly-available clustered environments, you might notice that certain nodes in your cluster cannot see one another. One node appears as if it is the one and only active node. The Cluster Management page does not show all the nodes in your Pega deployment. Upon inspection, you determine that some nodes are in a separate cluster. This error condition is sometimes referred to as Split-Brain Syndrome or cluster fracturing.
For details on this situation, see Split-Brain syndrome and Cluster Fracturing.
Note that several of the symptoms described below can be caused by “split brain.”
3. Admin Studio showing inconsistent status for nodes, listeners, agents or other general information
The node appears to be healthy and users are able to log into individual nodes; however, the node is not appearing in the Admin Studio (cluster management) page. This type of inconsistent information could also occur for listeners, agents, and other background processes usually available on the Admin Studio page.
In a Pega deployment which includes embedded Hazelcast, the same JVM hosts Pega Platform as well as the Hazelcast execution engine. This means that if there is a problem with the Pega Platform application using all the system memory or CPU (such as a badly-designed report which tries to display all two million lines from a database table), then Hazelcast will also be affected by that issue.
If the JVM becomes busy with the Pega software, then the Hazelcast “heartbeats” will be missing, which triggers a cluster regrouping, and the node can disappear from the Admin Studio page. Whenever this occurs, and a node is not part of the cluster an exception occurs. Pega relies on Hazelcast clustering technology, and if any nodes frequently leave or join the cluster, it disrupts the entire cluster’s stability.
Errors:
1. Could not connect to: /123.000.0.15:5701. Reason: SocketTimeoutException[null] 2023-08-30 23:54:22,306 [68c.cached.thread-11] [ ] [ ] (cp.TcpIpConnectionErrorHandler) WARN - [123.000.15.248]:5701 [bf3b51f9c00147c8a548e86931a7f68c] [3.12.10] Removing connection to endpoint [123.000.0.15]:5701 Cause => java.net.SocketTimeoutException {null}, Error-Count: 35 2023-08-30 23:54:22,406 [68c.cached.thread-14] [ ] [ ] ( nio.tcp.TcpIpConnector) INFO - [123.000.15.248]:5701 [bf3b51f9c00147c8a548e86931a7f68c] [3.12.10] Connecting to /123.000.0.15:5701, timeout: 10000, bind-any: true 2. java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.8.0_372] at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[?:1.8.0_372] at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[?:1.8.0_372] at sun.nio.ch.IOUtil.read(IOUtil.java:197) ~[?:1.8.0_372] 3. Could not connect to: /123.000.15.247:5701. Reason: SocketException[Connection refused to address /123.000.15.247:5701] 4. Unable to fetch running listeners from nodes [util-i-03e4b00000a8566e4, util-i-0c7cb5e61b0000fb8]
Solution
It is important that you be on the latest version of Pega Platform.
For all these issues, there are workarounds. Usually a full cluster restart is recommended. However, this is not a permanent solution, as the problems can reoccur.
The permanent solution is to externalize Hazelcast – switch to a client-server topology. This allows Hazelcast to use a separate JVM, and is much more stable. For details on the externalization process, see External Hazelcast in your Deployment.