Applies to Pega Platform™ versions 7.3 through 8.3.1
For information related to Pega Platform versions later than 8.3.1, see these prerequisite articles, Third-party externalized services FAQs and External Hazelcast in your deployment.
This document is the first in a series that includes the following companion documents:
Split-Brain Syndrome and cluster fracturing FAQs
Troubleshooting Hazelcast cluster management
Starting with Pega Platform™ version 7.1.7, Hazelcast was offered to improve performance of internode communication. In Pega 7.1.9, the near-instantaneous System Pulse feature replaced the older and much slower distribution of rule and data updates though database tables. Since then, many features have been rewritten to use the Hazelcast-backed distributed operations service.
Since Pega 7.1.7, many issues have been reported and many questions raised about Hazelcast and its notoriously talkative logging. Most of these issues can be prevented by applying best practices. Other issues can be resolved by understanding how to troubleshoot them.
This document focuses on Pega 7.3 and later releases of the Pega Platform. For Pega 7.2.2 and earlier releases to Pega 7.1.9, stability hotfixes are available to upgrade these releases to newer versions of Hazelcast.
Hazelcast Editions supported
Best practices
Port range
Node ID
PR_SYS_STATUSNODES
Network Address Translation (NAT)
Network Interface Controller (NIC)
Number of cores
Graceful shutdown
Concepts and terminology
Split-Brain Syndrome and Cluster Fracturing
Master node
Hazelcast Interceptor
Clock Drift
Settings
Common cluster settings
Clock Drift settings
Hazelcast Internet Control Message Protocol (ICMP) settings
Encryption settings
Security settings
Related content
Hazelcast Editions supported
Pega supports the Hazelcast Editions shown in the table below.
Hotfixes are available or planned for the following releases:
Pega 7.4: HFix-46618 (Hazelcast 3.10 EE Perpetual License) and HFix-48345 (Alerts)
Pega 7.3.1: HFix-46681 (Hazelcast 3.10 EE Perpetual License)
Pega 7.3: HFix-46682 (Hazelcast 3.10 EE Perpetual License)
Pega 7.2.2: HFix-47749 (Hazelcast 3.10 EE Perpetual License)
Pega Platform Release |
Hazelcast Edition |
---|---|
Pega 7.2.2, 7.2.1, 7.2 |
3.4.1 Community Edition (CE) |
Pega 7.4, 7.3.1, 7.3 |
3.8 Community Edition (CE) |
Pega 8.1 |
3.10 Enterprise Edition (EE) |
Pega 8.2 |
3.10.4 Enterprise Edition (EE) |
Pega 8.3 |
3.11 Enterprise Edition (EE) |
Best practices
For successful cluster management, practice the guidelines in this section:
- Port range
- Node ID
- PR_SYS_STATUSNODES
- Network Address Translation (NAT)
- Network Interface Controller (NIC)
- Number of cores
- Graceful shutdown
Port range
By default, a Pega node uses port range 5701 to 5800 for Hazelcast. In an environment where a different range is required, use the prconfig.xml property cluster/hazelcast/ports to set the range.
prconfig.xml example
<env name="cluster/Hazelcast/ports" value="5701-5750" />
DSS example
Setting Purpose: prconfig/cluster/hazelcast/ports/default
Value: 5701-5750
Owning Ruleset: Pega-Engine
When to set the port range
If multiple environments run on the same host (for example, QA and DEV), the administrator might need to set the port range to avoid port conflicts or if the default ports are already in use or blocked for any reason.
Node ID
Each Pega node is identified with a Node ID that must be unique in the cluster. If the same Node ID is already used in the cluster, the node fails to start.
Use this setting to more easily identify nodes and their purposes at a glance. A Node ID is generated by default based on certain system setting values. However, as a best practice, set the Node ID manually to reflect the node’s intended purpose.
To set the Node ID, use the JVM argument identification.nodeid as shown in the following examples.
-Didentification.nodeid=SearchNode1
-Didentification.nodeid=BackgroundProcessing1
PR_SYS_STATUSNODES
Pega nodes are registered into the table pr_sys_statusnodes at startup time. This table holds information such as the node IP address, node name, and other information.
When a node joins the cluster, a list of cluster-member candidates is loaded from the pr_sys_statusnodes table. The node then tries to establish a connection with the candidates to form a cluster.
Network Address Translation (NAT)
If Pega nodes are running behind a network address translation (NAT), they might not see each other. To ensure communication among the nodes, the system administrator should set the public address to the defined address on NAT. This configuration is mainly used when running in private VMs or Docker containers. To set the public address, use the following prconfig.xml setting:
identification/cluster/public/address
Network Interface Controller (NIC)
If you have multiple Network Interface Controllers (NICs) in your clustered environment, use the cluster/hazelcast/interface setting to specify the IP address that you want the node to communicate on. This forces Hazelcast to refer to the correct NIC. Avoid the issue described in the Example problem scenario.
The setting should be one IP address.
prconfig examples
<env name="cluster/hazelcast/interface" value="10.3.10.4"/>
DSS examples
This note was offered by Yas Ito as a Comment on this Support Document.
Setting Purpose: prconfig/cluster/hazelcast/interface
Value: 10.3.10.4
Owning Ruleset: Pega-Engine
Example problem scenario
According to the Hazelcast documentation, Other network configurations, you can specify an IP address range using the wildcard (*) on the last digit of the IP address, for example, 192.168.1.* or 192.168.1.100-110.
However, if you specify *.*.*.* as an IP address in cluster/hazelcast/interface, this value is not supported. The nodes of the cluster are not able to pick up the correct IP address.
Solution: Use wildcards in the last digits only of an IP Address, for example, 192.168.*.*.
Number of cores
Hazelcast recommends having at least 8 CPU cores. Having a low number of cores can cause instability in the cluster because threads might start blocking each other. The number of Hazelcast threads is printed in the log, as shown in the following sample:
2019-04-11 03:20:24,829 [ ip-10-123-2-41] (tor.impl.OperationExecutorImpl) INFO -[10.123.2.41]:5701 [49d9b0e8c5fa8b21c4ef8d490df72708] [3.10] Starting 2 partition threads and 3 generic threads (1 dedicated for priority tasks)
The line above shows that too few threads were started. For additional guidance, see the Hazelcast IMDG 3.11 Deployment and Operations Guide, the section Basic Optimization Recommendations, which includes the following guidelines:
- 8 cores per Hazelcast server instance
- Minimum of 8 GB RAM per Hazelcast member (if not using the High-Density Memory Store)
- Dedicated NIC per Hazelcast member
- Linux—any distribution
- All member nodes should run within the same subnet
- All member nodes should be attached to the same network switch
Graceful shutdown
Gracefully shut down Pega nodes to avoid losing Hazelcast data partitions. During a graceful shutdown, the data from the node that is shutting down is automatically migrated to the other nodes.
Do not use a kill -9 command! This command stops all processes immediately. The consequences of using this command are negative:
- No clean shutdown of socket connections
- No cleanup of temporary files
- No time to inform sub-processes that the node is going away
- No time for the node to reset its terminal characteristics
- Stops processes that are running on the node even if those processes are performing work; no clean exit occurs. Processing stops in mid-stream.
Concepts and terminology
This article and its companion articles assume that you read and understand the Pega Help topics under Managing your system, particularly the topics for Node configuration > Multi-node systems > Cluster deployment and High availability > Configuring nodes for high availability > Cluster management.
Split-Brain Syndrome and Cluster Fracturing
Several scenarios can lead to nodes in the cluster being unaware of one another, causing the cluster to split into several smaller clusters of nodes instead of one large one.
For understanding Split-Brain Syndrome and how to prevent it, detect it, and recover from it, see the related support document, Split-Brain Syndrome and cluster fracturing FAQs.
Master node
Hazelcast does not have a centralized master node as many other distributed operation technologies have. However, it does maintain an implicit master node whose responsibility it is to keep the other nodes up to date with the latest membership information. The first node to start, that is, the oldest node, is always considered the master node. When the master node leaves the cluster, the remaining nodes begin a mastership process to nominate a new master node. If a Split-Brain scenario develops, two master nodes will be present. When merging two fractured clusters, the master node with fewer nodes yields to (merges into) the master node of the larger cluster.
Hazelcast Interceptor
Because Hazelcast behaves in a fail-fast manner, it is possible for external traffic from other sources to cause Hazelcast instability. One example might be a DDoS attack that takes place on an inbound Hazelcast port. Other examples include security tools that might attempt to breach the port; a flood of traffic will cause poor performance. In these cases, the Hazelcast interceptor may be used to deny list IP addresses that Hazelcast should ignore. This helps Hazelcast filter traffic before attempting to consume it, leading to better communication performance in the advent of third-party traffic to its inbound port. For more information, see Security settings.
Clock Drift
Hazelcast operates outside of ‘time’, but Pega application operations can be severely impacted if the time of each node is not aligned. When clock times begin to drift, an alert is generated in the logs and through PDC. Ensure that systems are running clock synchronization software such as Network Time Protocol (NTP). In addition, Hazelcast might detect delays in traffic. Again, although Hazelcast operates outside of ‘time’, it does pay attention to the time it takes for traffic to propagate between nodes. If Hazelcast detects that traffic is taking abnormally long, it sends warning messages to the PegaCluster logs.
For example, Hazelcast reports inter-node traffic delays caused by system-wide processing such as Java heap garbage collection (GC). In case of larger Java heaps, garbage collection might cause your application to pause for tens of seconds (even minutes for large heaps), badly affecting your application performance and response times.
Settings
Some settings in the Pega configuration file (prconfig.xml) relate directly to Hazelcast. Understand important settings to specify in Hazelcast:
- Common cluster settings
- Clock Drift settings
- Hazelcast Internet Control Message Protocol (ICMP) settings
- Encryption settings
- Security settings
Common cluster settings
The following settings are frequently used for cluster management with Hazelcast.
Introduced in this release |
Setting name |
Prconfig value |
Description |
Default value or values |
Example value or values |
---|---|---|---|---|---|
7.3.0 |
Cluster Name |
identification/cluster/name |
The name for the cluster of nodes Nodes will only join other nodes that share the same cluster name. |
PRPC |
PRPC |
7.3.0 |
Cluster Protocol |
identification/cluster/protocol |
The operating protocol for the cluster |
Hazelcast |
Hazelcast |
7.3.0 |
Cluster Members |
initialization/cluster/members |
A static list of cluster member IP addresses, separated by commas |
Not Applicable |
<IP 1>, <IP 2>, …. |
7.3.0 |
Cluster Public Address |
identification/cluster/public/address |
The public IP address for the cluster |
Not Applicable |
<n.n.n.n> |
7.4.0 |
Hazelcast Outbound Ports |
cluster/hazelcast/outboundPortRange |
The configured range of outbound ephemeral ports of Hazelcast |
Values within the valid range from 5801 to 5900 |
5801, 5802, 5803-5810 |
7.3.0 |
Hazelcast Interface |
cluster/hazelcast/interface |
A list of valid network interfaces for Hazelcast |
Not Applicable |
<n.n.n.n> |
7.3.0 |
Cluster Ports |
initialization/cluster/ports |
A range of inbound ports (Hazelcast selects one for use.) |
Values within the valid range from 5701 to 5800 |
5701, 5702, 5703-5710 |
Clock Drift settings
The following Clock Drift settings were introduced in the Pega Platform release indicated.
Introduced in this release |
Setting name |
Prconfig value |
Description |
Default value |
Example value |
---|---|---|---|---|---|
7.3.0 |
Clock Drift Threshold |
alerts/cluster/clockdeltathreshold |
The maximum allowed difference between any two clocks in the cluster, in seconds |
10 seconds |
10 seconds |
7.3.0 |
Clock Drift Sampling Rate |
alerts/cluster/clocksamplerateminutes |
The frequency at which the clocks in the cluster are sampled, in minutes |
10 minutes |
10 minutes |
Hazelcast Internet Control Message Protocol (ICMP) settings
The Internet Control Message Protocol (ICMP) is a supporting protocol in the Internet protocol suite. Used by network devices, including routers, ICMP sends error messages and operational information indicating, for example, that a requested service is not available or that a host or router could not be reached. ICMP differs from transport protocols such as TCP and UDP in that it is not typically used to exchange data between systems, nor is it regularly employed by end-user network applications except for some diagnostic tools like ping and traceroute.
The Hazelcast Ping Failure Detector relies on ICMP. To prevent ping failures, consider adjusting the ICMP properties in your Hazelcast declarative configuration file.
To understand the scenarios for which you might need to adjust the ICMP settings, see Hazelcast Failure Detector Configuration
Here are some examples of the Hazelcast ICMP settings:
Introduced in this release |
Setting name |
Prconfig value |
Description |
Default value |
Example value |
---|---|---|---|---|---|
7.4.0 |
Hazelcast ICMP Enabled |
hazelcast/icmp/enabled |
Enables ICMP ping detector for Hazelcast. ICMP pings are used to determine which nodes are still alive. |
false |
true |
7.4.0 |
Hazelcast ICMP Parallel Mode |
hazelcast/icmp/parallel/mode |
Sets ICMP detector to parallel mode. |
false |
true |
7.4.0 |
Hazelcast ICMP Timeout |
hazelcast/icmp/timeout |
The amount of time to wait before declaring a ping failed, in milliseconds |
1000 ms |
1000 ms |
7.4.0 |
Hazelcast ICMP Max Attempts |
hazelcast/max/attempts |
Max ping attempts before suspecting a member |
3 |
3 |
7.4.0 |
Hazelcast ICMP Interval |
hazelcast/icmp/interval |
Time in milliseconds between each ping |
1000 ms |
1000 ms |
7.4.0 |
Hazlecast ICMP TTL |
hazelcast/icmp/ttl |
Maximum number of hops for an ICMP packet sent by Hazelcast or 0 for the default |
0 |
0 |
7.4.0 |
Hazelcast ICMP Fail Fast |
hazelcast/icmp/failfastonstartup |
If set, Hazelcast will fail to start if any ICMP requirement is not met. |
false |
false |
Encryption settings
Hazelcast offers features which allow to reach a required privacy on communication level by enabling encryption. Encryption is based on Java Cryptography Architecture (JCA).
The following Encryption settings were introduced in the Pega Platform release indicated.
Introduced in this release |
Setting name |
Prconfig value |
Description |
Default value |
Example value |
|
---|---|---|---|---|---|---|
7.3.0 |
Cluster Keystore File Path |
cluster/encryption/keystore/path |
Location of the keystore for cluster encryption on disk |
Not Applicable |
/home/cluster-keystore.jks |
|
7.3.0 |
Cluster Keystore Password |
cluster/encryption/keystore/password |
An encrypted keystore password |
Not Applicable |
|
|
7.3.0 |
Cluster Truststore File Path |
cluster/encryption/truststore/path |
Location of the truststore for cluster encryption on disk |
Not Applicable |
/home/cluster-truststore.jks |
|
7.3.0 |
Cluster Truststore Password |
cluster/encryption/truststore/password |
An encrypted truststore password |
Not Applicable |
|
|
7.3.0 |
Cluster Supported Key Manager Algorithm |
cluster/encryption/keymanager/algorithm |
Key manager algorithm |
X509 |
X509 |
|
7.3.0 |
Cluster Supported Trust Manager |
cluster/encryption/trustmanager/algorithm |
Trust manager algorithm |
X509 |
X509 |
|
7.3.0 |
Cluster Supported Encryption Protocol |
cluster/encryption/protocol |
Encryption protocol for the cluster |
TLSv1.2 |
TLSv1.2 See SA-67911. |
|
7.3.0 |
Cluster Encrypter Custom Class |
cluster/encryption/customclass |
Name of custom class used to decrypt key/truststore passwords |
com.example.mydecryptor |
com.example.mydecryptor |
|
7.3.0 |
Cluster Encryption Keystore |
cluster/encryption/keystorename |
Name of the keystore file if the file is in the database |
cluster-keystore.jks |
cluster-keystore.jks |
|
7.3.0 |
Cluster Encryption Truststore |
cluster/encryption/truststorename |
Name of the truststore file if the truststore is in the database |
cluster-truststore.jks |
cluster-truststore.jks |
|
7.3.0 |
Cluster Encryption Enabled |
cluster/encryption/enabled |
Enables encryption for the cluster |
false |
true |
|
7.3.0 |
Cluster SSL Context Factory Class |
cluster/encryption/ssl/factory/class |
Class for creating an SSL context |
|
com.hazelcast.examples.MySSLContextFactory |
Security settings
Understand Hazelcast Interceptor, which relates to the security settings.
The following Security settings were introduced in the Pega Platform release indicated.
Introduced in this release |
Setting name |
Prconfig value |
Description |
Default value |
Example value |
---|---|---|---|---|---|
7.4.1 (not in 8.1) |
Cluster Socket Interceptor |
cluster/network/interceptor/enabled |
Allows the interception of network traffic for analysis, for example, to prevent foreign traffic from inundating communication operations |
false |
false |
7.4.1 (not in 8.1) |
Cluster Intruder DenyList |
cluster/network/interceptor/denylistaddresses |
A list of addresses for the cluster to deny communications with |
<IP 1>,<IP 2> . . |
<IP 1>,<IP 2> . . |
Related content
Enabling encrypted communication between nodes
Managing Hazelcast client-server mode for Pega Platform
Deploying Hazelcast Management Center
Configuring client-server mode for Hazelcast on Pega Platform