Managing clusters with Hazelcast

Support Doc

MaryCarbonara

Member since 2010

216 posts

Posted: May 16, 2022

Last activity: Aug 14, 2023

Posted: 16 May 2022 12:46 EDT
Last activity: 14 Aug 2023 11:43 EDT

Managing clusters with Hazelcast

Applies to Pega Platform™ versions 7.3 through 8.3.1

For information related to Pega Platform versions later than 8.3.1, see these prerequisite articles, Third-party externalized services FAQs and External Hazelcast in your deployment.

This document is the first in a series that includes the following companion documents:

Updates to Hazelcast support

Split-Brain Syndrome and cluster fracturing FAQs

Troubleshooting Hazelcast cluster management

Starting with Pega Platform™ version 7.1.7, Hazelcast was offered to improve performance of internode communication. In Pega 7.1.9, the near-instantaneous System Pulse feature replaced the older and much slower distribution of rule and data updates though database tables. Since then, many features have been rewritten to use the Hazelcast-backed distributed operations service.

Since Pega 7.1.7, many issues have been reported and many questions raised about Hazelcast and its notoriously talkative logging. Most of these issues can be prevented by applying best practices. Other issues can be resolved by understanding how to troubleshoot them.

This document focuses on Pega 7.3 and later releases of the Pega Platform. For Pega 7.2.2 and earlier releases to Pega 7.1.9, stability hotfixes are available to upgrade these releases to newer versions of Hazelcast.

Hazelcast Editions supported
Best practices
Port range
Node ID
PR_SYS_STATUSNODES
Network Address Translation (NAT)
Network Interface Controller (NIC)
Number of cores
Graceful shutdown
Concepts and terminology
Split-Brain Syndrome and Cluster Fracturing
Master node
Hazelcast Interceptor
Clock Drift
Settings
Common cluster settings
   Clock Drift settings
   Hazelcast Internet Control Message Protocol (ICMP) settings
   Encryption settings
   Security settings
Related content

Hazelcast Editions supported

Pega supports the Hazelcast Editions shown in the table below.

Hotfixes are available or planned for the following releases:

Pega 7.4: HFix-46618 (Hazelcast 3.10 EE Perpetual License) and HFix-48345 (Alerts)
Pega 7.3.1: HFix-46681 (Hazelcast 3.10 EE Perpetual License)
Pega 7.3: HFix-46682 (Hazelcast 3.10 EE Perpetual License)
Pega 7.2.2: HFix-47749 (Hazelcast 3.10 EE Perpetual License)

Pega Platform Releases and Hazelcast Editions supported
Pega Platform Release	Hazelcast Edition
Pega 7.2.2, 7.2.1, 7.2	3.4.1 Community Edition (CE)
Pega 7.4, 7.3.1, 7.3	3.8 Community Edition (CE)
Pega 8.1	3.10 Enterprise Edition (EE)
Pega 8.2	3.10.4 Enterprise Edition (EE)
Pega 8.3	3.11 Enterprise Edition (EE)

Best practices

For successful cluster management, practice the guidelines in this section:

Port range
Node ID
PR_SYS_STATUSNODES
Network Address Translation (NAT)
Network Interface Controller (NIC)
Number of cores
Graceful shutdown

Port range

By default, a Pega node uses port range 5701 to 5800 for Hazelcast. In an environment where a different range is required, use the prconfig.xml property cluster/hazelcast/ports to set the range.

prconfig.xml example

<env name="cluster/Hazelcast/ports" value="5701-5750" />

DSS example

Setting Purpose: prconfig/cluster/hazelcast/ports/default Value: 5701-5750 Owning Ruleset: Pega-Engine

When to set the port range

If multiple environments run on the same host (for example, QA and DEV), the administrator might need to set the port range to avoid port conflicts or if the default ports are already in use or blocked for any reason.

Node ID

Each Pega node is identified with a Node ID that must be unique in the cluster. If the same Node ID is already used in the cluster, the node fails to start.

Use this setting to more easily identify nodes and their purposes at a glance. A Node ID is generated by default based on certain system setting values. However, as a best practice, set the Node ID manually to reflect the node’s intended purpose.

To set the Node ID, use the JVM argument identification.nodeid as shown in the following examples.

-Didentification.nodeid=SearchNode1

-Didentification.nodeid=BackgroundProcessing1

PR_SYS_STATUSNODES

Pega nodes are registered into the table pr_sys_statusnodes at startup time. This table holds information such as the node IP address, node name, and other information.

When a node joins the cluster, a list of cluster-member candidates is loaded from the pr_sys_statusnodes table. The node then tries to establish a connection with the candidates to form a cluster.

Do not truncate the pr_sys_statusnodes table while ANY cluster nodes are up and running! Doing so removes the information needed for a newly started node to discover other nodes that are already running. Consequently, the newly started node forms a new cluster instead of joining the cluster that is already running.

Network Address Translation (NAT)

If Pega nodes are running behind a network address translation (NAT), they might not see each other. To ensure communication among the nodes, the system administrator should set the public address to the defined address on NAT. This configuration is mainly used when running in private VMs or Docker containers. To set the public address, use the following prconfig.xml setting:

identification/cluster/public/address

See Common cluster settings.

Network Interface Controller (NIC)

If you have multiple Network Interface Controllers (NICs) in your clustered environment, use the cluster/hazelcast/interface setting to specify the IP address that you want the node to communicate on. This forces Hazelcast to refer to the correct NIC. Avoid the issue described in the Example problem scenario.

The setting should be one IP address.

See Common cluster settings.

prconfig examples

<env name="cluster/hazelcast/interface" value="10.3.10.4"/>

DSS examples

This note was offered by Yas Ito as a Comment on this Support Document.

Setting Purpose: prconfig/cluster/hazelcast/interface Value: 10.3.10.4 Owning Ruleset: Pega-Engine

Example problem scenario

According to the Hazelcast documentation, Other network configurations, you can specify an IP address range using the wildcard (*) on the last digit of the IP address, for example, 192.168.1.* or 192.168.1.100-110.

However, if you specify *.*.*.* as an IP address in cluster/hazelcast/interface, this value is not supported. The nodes of the cluster are not able to pick up the correct IP address.

Solution: Use wildcards in the last digits only of an IP Address, for example, 192.168.*.*.

Number of cores

Hazelcast recommends having at least 8 CPU cores. Having a low number of cores can cause instability in the cluster because threads might start blocking each other. The number of Hazelcast threads is printed in the log, as shown in the following sample:

2019-04-11 03:20:24,829 [ ip-10-123-2-41] (tor.impl.OperationExecutorImpl) INFO -[10.123.2.41]:5701 [49d9b0e8c5fa8b21c4ef8d490df72708] [3.10] Starting 2 partition threads and 3 generic threads (1 dedicated for priority tasks)

The line above shows that too few threads were started. For additional guidance, see the Hazelcast IMDG 3.11 Deployment and Operations Guide, the section Basic Optimization Recommendations, which includes the following guidelines:

8 cores per Hazelcast server instance
Minimum of 8 GB RAM per Hazelcast member (if not using the High-Density Memory Store)
Dedicated NIC per Hazelcast member
Linux—any distribution
All member nodes should run within the same subnet
All member nodes should be attached to the same network switch

Graceful shutdown

Gracefully shut down Pega nodes to avoid losing Hazelcast data partitions. During a graceful shutdown, the data from the node that is shutting down is automatically migrated to the other nodes.

Do not use a kill -9 command! This command stops all processes immediately. The consequences of using this command are negative:

No clean shutdown of socket connections
No cleanup of temporary files
No time to inform sub-processes that the node is going away
No time for the node to reset its terminal characteristics
Stops processes that are running on the node even if those processes are performing work; no clean exit occurs. Processing stops in mid-stream.

Concepts and terminology

This article and its companion articles assume that you read and understand the Pega Help topics under Managing your system, particularly the topics for Node configuration > Multi-node systems > Cluster deployment and High availability > Configuring nodes for high availability > Cluster management.

Split-Brain Syndrome and Cluster Fracturing

Several scenarios can lead to nodes in the cluster being unaware of one another, causing the cluster to split into several smaller clusters of nodes instead of one large one.

For understanding Split-Brain Syndrome and how to prevent it, detect it, and recover from it, see the related support document, Split-Brain Syndrome and cluster fracturing FAQs.

Master node

Hazelcast does not have a centralized master node as many other distributed operation technologies have. However, it does maintain an implicit master node whose responsibility it is to keep the other nodes up to date with the latest membership information. The first node to start, that is, the oldest node, is always considered the master node. When the master node leaves the cluster, the remaining nodes begin a mastership process to nominate a new master node. If a Split-Brain scenario develops, two master nodes will be present. When merging two fractured clusters, the master node with fewer nodes yields to (merges into) the master node of the larger cluster.

Hazelcast Interceptor

Because Hazelcast behaves in a fail-fast manner, it is possible for external traffic from other sources to cause Hazelcast instability. One example might be a DDoS attack that takes place on an inbound Hazelcast port. Other examples include security tools that might attempt to breach the port; a flood of traffic will cause poor performance. In these cases, the Hazelcast interceptor may be used to deny list IP addresses that Hazelcast should ignore. This helps Hazelcast filter traffic before attempting to consume it, leading to better communication performance in the advent of third-party traffic to its inbound port. For more information, see Security settings.

Clock Drift

Hazelcast operates outside of ‘time’, but Pega application operations can be severely impacted if the time of each node is not aligned. When clock times begin to drift, an alert is generated in the logs and through PDC. Ensure that systems are running clock synchronization software such as Network Time Protocol (NTP). In addition, Hazelcast might detect delays in traffic. Again, although Hazelcast operates outside of ‘time’, it does pay attention to the time it takes for traffic to propagate between nodes. If Hazelcast detects that traffic is taking abnormally long, it sends warning messages to the PegaCluster logs.

For example, Hazelcast reports inter-node traffic delays caused by system-wide processing such as Java heap garbage collection (GC). In case of larger Java heaps, garbage collection might cause your application to pause for tens of seconds (even minutes for large heaps), badly affecting your application performance and response times.

Settings

Some settings in the Pega configuration file (prconfig.xml) relate directly to Hazelcast. Understand important settings to specify in Hazelcast:

Common cluster settings
Clock Drift settings
Hazelcast Internet Control Message Protocol (ICMP) settings
Encryption settings
Security settings

Common cluster settings

The following settings are frequently used for cluster management with Hazelcast.

Common cluster settings for Pega Platform versions
Introduced in this release	Setting name	Prconfig value	Description	Default value or values	Example value or values
7.3.0	Cluster Name	identification/cluster/name	The name for the cluster of nodes Nodes will only join other nodes that share the same cluster name.	PRPC	PRPC
7.3.0	Cluster Protocol	identification/cluster/protocol	The operating protocol for the cluster	Hazelcast	Hazelcast
7.3.0	Cluster Members	initialization/cluster/members	A static list of cluster member IP addresses, separated by commas Disables automatic discovery	Not Applicable (user defined)	<IP 1>, <IP 2>, ….
7.3.0	Cluster Public Address	identification/cluster/public/address	The public IP address for the cluster See Network Address Translation (NAT).	Not Applicable (user defined)	<n.n.n.n>
7.4.0	Hazelcast Outbound Ports	cluster/hazelcast/outboundPortRange	The configured range of outbound ephemeral ports of Hazelcast	Values within the valid range from 5801 to 5900	5801, 5802, 5803-5810
7.3.0	Hazelcast Interface	cluster/hazelcast/interface	A list of valid network interfaces for Hazelcast See Network Interface Controller (NIC).	Not Applicable	<n.n.n.n>
7.3.0	Cluster Ports	initialization/cluster/ports	A range of inbound ports (Hazelcast selects one for use.)	Values within the valid range from 5701 to 5800	5701, 5702, 5703-5710

Clock Drift settings

The following Clock Drift settings were introduced in the Pega Platform release indicated.

Clock Drift settings
Introduced in this release	Setting name	Prconfig value	Description	Default value	Example value
7.3.0	Clock Drift Threshold	alerts/cluster/clockdeltathreshold	The maximum allowed difference between any two clocks in the cluster, in seconds	10 seconds	10 seconds
7.3.0	Clock Drift Sampling Rate	alerts/cluster/clocksamplerateminutes	The frequency at which the clocks in the cluster are sampled, in minutes	10 minutes	10 minutes

Hazelcast Internet Control Message Protocol (ICMP) settings

The Internet Control Message Protocol (ICMP) is a supporting protocol in the Internet protocol suite. Used by network devices, including routers, ICMP sends error messages and operational information indicating, for example, that a requested service is not available or that a host or router could not be reached. ICMP differs from transport protocols such as TCP and UDP in that it is not typically used to exchange data between systems, nor is it regularly employed by end-user network applications except for some diagnostic tools like ping and traceroute.

The Hazelcast Ping Failure Detector relies on ICMP. To prevent ping failures, consider adjusting the ICMP properties in your Hazelcast declarative configuration file.

To understand the scenarios for which you might need to adjust the ICMP settings, see Hazelcast Failure Detector Configuration

Here are some examples of the Hazelcast ICMP settings:

Hazelcast ICMP settings
Introduced in this release	Setting name	Prconfig value	Description	Default value	Example value
7.4.0	Hazelcast ICMP Enabled	hazelcast/icmp/enabled	Enables ICMP ping detector for Hazelcast. ICMP pings are used to determine which nodes are still alive.	false	true
7.4.0	Hazelcast ICMP Parallel Mode	hazelcast/icmp/parallel/mode	Sets ICMP detector to parallel mode.	false	true
7.4.0	Hazelcast ICMP Timeout	hazelcast/icmp/timeout	The amount of time to wait before declaring a ping failed, in milliseconds	1000 ms	1000 ms
7.4.0	Hazelcast ICMP Max Attempts	hazelcast/max/attempts	Max ping attempts before suspecting a member	3	3
7.4.0	Hazelcast ICMP Interval	hazelcast/icmp/interval	Time in milliseconds between each ping	1000 ms	1000 ms
7.4.0	Hazlecast ICMP TTL	hazelcast/icmp/ttl	Maximum number of hops for an ICMP packet sent by Hazelcast or 0 for the default	0	0
7.4.0	Hazelcast ICMP Fail Fast	hazelcast/icmp/failfastonstartup	If set, Hazelcast will fail to start if any ICMP requirement is not met.	false	false

Encryption settings

Hazelcast offers features which allow to reach a required privacy on communication level by enabling encryption. Encryption is based on Java Cryptography Architecture (JCA).

The following Encryption settings were introduced in the Pega Platform release indicated.

Encryption settings for clustered environments

Introduced in this release

Setting name

Prconfig value

Description

Default value

Example value

7.3.0

Cluster Keystore File Path

cluster/encryption/keystore/path

Location of the keystore for cluster encryption on disk

Not Applicable
(user defined)

/home/cluster-keystore.jks

7.3.0

Cluster Keystore Password

cluster/encryption/keystore/password

An encrypted keystore password

Not Applicable
(user defined)

7.3.0

Cluster Truststore File Path

cluster/encryption/truststore/path

Location of the truststore for cluster encryption on disk

Not Applicable
(user defined)

/home/cluster-truststore.jks

7.3.0

Cluster Truststore Password

cluster/encryption/truststore/password

An encrypted truststore password

Not Applicable
(user defined)

7.3.0

Cluster Supported Key Manager Algorithm

cluster/encryption/keymanager/algorithm

Key manager algorithm

X509

7.3.0

Cluster Supported Trust Manager

cluster/encryption/trustmanager/algorithm

Trust manager algorithm

X509

7.3.0

Cluster Supported Encryption Protocol

cluster/encryption/protocol

Encryption protocol for the cluster

TLSv1.2

See SA-67911.

7.3.0

Cluster Encrypter Custom Class

cluster/encryption/customclass

Name of custom class used to decrypt key/truststore passwords

com.example.mydecryptor

7.3.0

Cluster Encryption Keystore

cluster/encryption/keystorename

Name of the keystore file if the file is in the database

cluster-keystore.jks

7.3.0

Cluster Encryption Truststore

cluster/encryption/truststorename

Name of the truststore file if the truststore is in the database

cluster-truststore.jks

7.3.0

Cluster Encryption Enabled

cluster/encryption/enabled

Enables encryption for the cluster

false

true

7.3.0

Cluster SSL Context Factory Class

cluster/encryption/ssl/factory/class

Class for creating an SSL context

com.hazelcast.examples.MySSLContextFactory

Security settings

Understand Hazelcast Interceptor, which relates to the security settings.

The following Security settings were introduced in the Pega Platform release indicated.

Security settings for clustered environments
Introduced in this release	Setting name	Prconfig value	Description	Default value	Example value
7.4.1 (not in 8.1)	Cluster Socket Interceptor	cluster/network/interceptor/enabled	Allows the interception of network traffic for analysis, for example, to prevent foreign traffic from inundating communication operations	false	false
7.4.1 (not in 8.1)	Cluster Intruder DenyList	cluster/network/interceptor/denylistaddresses	A list of addresses for the cluster to deny communications with	<IP 1>,<IP 2> . .	<IP 1>,<IP 2> . .

Managing Hazelcast client-server mode for Pega Platform

Deploying Hazelcast Management Center

Configuring client-server mode for Hazelcast on Pega Platform

15May2022 This Support Document has been migrated from Pega Documentation:
https://docs.pega.com/pega-services-troubleshooting/managing-clusters-hazelcast
24July2023 Updated all links and created anchor links to address FDBK-103132 (SDoc feature not available in May 2022). Added links to the latest Pega Documentation.

To see attachments, please log in.

Pega Platform 8.3.1

Pega Platform

System Administration

Troubleshooting

Did you find this content helpful?

Yes

Want to help us improve this content?
Send Feedback

Reply
Like (0)
Share this page Facebook Twitter LinkedIn Email Copying... Copied!

Posted: 2 years ago

Posted: 9 Dec 2022 1:22 EST

Yas_ITO

PEGA

replied to MaryCarbonara

Report

@MaryCarbonara I think should not set DSS (database updated configuration) for NIC interface. it affects all of nodes.

DSS examples

Setting Purpose: prconfig/cluster/hazelcast/interface Value: Owning Ruleset: Pega-Engine

To see attachments, please log in.

Posted: 2 years ago

Posted: 26 Jan 2023 12:47 EST

MaryCarbonara replied to Yas_ITO

Report

@Yas_ITO Thanks for offering your Caution note.

I have updated this Support Document as you suggested.

I apologize for my delayed reply to your Comment here. Notifications were not working. Problem is now resolved.

To see attachments, please log in.

Posted: 1 year ago

Posted: 25 Jul 2023 2:16 EDT

SUMAN_GUMUDAVELLY

Ford Motor Company

replied to MaryCarbonara

Report

@MaryCarbonara

This is an excellent documentation, it reminded me almost 2 years of my struggle with Hazelcast. I think you have outlined almost all the information.

However, despite we make sure all the HFIX's in place, Graceful Shutdown and Nodes Status table is accurate etc etc.. we were often seeing an exception in AES then PDC:

Caused by: com.hazelcast.core.OperationTimeoutException: RegistrationOperation invocation failed to complete due to operation-heartbeat-timeout

After updating the uLimit values [on a WebSphere installation], our application was stabilized. Sample values shown here:

uLimit Vlaues

To see attachments, please log in.

Reply
Likes (1)

Mary Carbonara

Posted: 1 year ago

Posted: 28 Jul 2023 13:55 EDT

MaryCarbonara replied to SUMAN_GUMUDAVELLY

Report

@SUMAN_GUMUDAVELLY Thank you for your comment and for identifying that you are running IBM WebSphere Application Server. What version of the Pega Platform are you using?

I believe that your information belongs in the companion SDoc, Troubleshooting Hazelcast cluster management, which contains the three topics related to com.hazelcast.OperationTimeoutException.

Two topics are red herrings, identified in the section Differential Root Cause Analysis. In this section, first see the table in the subsection, https://support.pega.com/support-doc/troubleshooting-hazelcast-cluster-management#hazelcast-exceptions-misleading.

Then see the following subsections:

https://support.pega.com/support-doc/troubleshooting-hazelcast-cluster-management#hazelcast-logs-clog-server-space

@SUMAN_GUMUDAVELLY Thank you for your comment and for identifying that you are running IBM WebSphere Application Server. What version of the Pega Platform are you using?

I believe that your information belongs in the companion SDoc, Troubleshooting Hazelcast cluster management, which contains the three topics related to com.hazelcast.OperationTimeoutException.

Then see the following subsections:

https://support.pega.com/support-doc/troubleshooting-hazelcast-cluster-management#hazelcast-logs-clog-server-space

https://support.pega.com/support-doc/troubleshooting-hazelcast-cluster-management#query-partition-operation-fails-to-complete > pertains to Pega Platform 7.3.1

The third topic is in the section Common exceptions and error messages:

https://support.pega.com/support-doc/troubleshooting-hazelcast-cluster-management#operation-timeout-exception Please confirm that your information belongs in this topic as additional info in What to do. After you confirm this, I will add your information to that SDoc and that section. Thank you.

Show Less

To see attachments, please log in.

Posted: 1 year ago

Updated: 1 year ago

Posted: 31 Jul 2023 15:16 EDT
Updated: 31 Jul 2023 15:18 EDT

SUMAN_GUMUDAVELLY

Ford Motor Company

replied to MaryCarbonara

Report

@MaryCarbonara Since I represent the Operations Center of Excellence, I've seen these issues in most of the Platform Versions. However first we noticed in 7.1.9 and then got experienced all sort of issues in v7.3.1.

With Hazelcast version upgrade to Hazelcast EE in 7.4 plus backporting the Hazelcast EE HFIXs to 7.3.1 helped us to pass through the issues but each time if there is a peak load or usage of the app, we were running into issues untill we increase the uLimit values.

Last I have see the Hazelcast issues in Platform version v8.6.4 , and we are currently making these value increase as default and part of run book.

To see attachments, please log in.

Posted: 1 year ago

Posted: 3 Aug 2023 20:01 EDT

MaryCarbonara replied to SUMAN_GUMUDAVELLY

Report

@SUMAN_GUMUDAVELLY Thanks for your reply.

I will develop the information in your two comments and add it to the Troubleshooting SDoc, https://support.pega.com/support-doc/troubleshooting-hazelcast-cluster-management.

Thanks for your patience.

To see attachments, please log in.

Reply
Likes (1)

Suman Gumudavelly

Posted: 1 year ago

Posted: 14 Aug 2023 11:43 EDT

MaryCarbonara replied to SUMAN_GUMUDAVELLY

Report

@SUMAN_GUMUDAVELLY

I have added your Commented scenario to the Troubleshooting Hazelcast SDoc, this section, https://support.pega.com/support-doc/troubleshooting-hazelcast-cluster-management#operation-timeout-exception-registration-operation-invocation-failed.

To see attachments, please log in.

Reply
Likes (2)

Suman Gumudavelly Domenico Fodaro

Support Doc

Managing clusters with Hazelcast

Hazelcast Editions supported

Best practices

Port range

prconfig.xml example

DSS example

When to set the port range

Node ID

PR_SYS_STATUSNODES

Network Address Translation (NAT)

Network Interface Controller (NIC)

prconfig examples

DSS examples

Example problem scenario

Number of cores

Graceful shutdown

Concepts and terminology

Split-Brain Syndrome and Cluster Fracturing

Master node

Hazelcast Interceptor

Clock Drift

Settings

Common cluster settings

Clock Drift settings

Hazelcast Internet Control Message Protocol (ICMP) settings

Encryption settings

Security settings

Related content

DSS examples

Need help or want to help others?

Experience the benefits of Support Center when you log in.

Support Doc

Managing clusters with Hazelcast

Hazelcast Editions supported

Best practices

Port range

prconfig.xml example

DSS example

When to set the port range

Node ID

PR_SYS_STATUSNODES

Network Address Translation (NAT)

Network Interface Controller (NIC)

prconfig examples

DSS examples

Example problem scenario

Number of cores

Graceful shutdown

Concepts and terminology

Split-Brain Syndrome and Cluster Fracturing

Master node

Hazelcast Interceptor

Clock Drift

Settings

Common cluster settings

Clock Drift settings

Hazelcast Internet Control Message Protocol (ICMP) settings

Encryption settings

Security settings

Related content

DSS examples

Related content:

Need help or want to help others?

Experience the benefits of Support Center when you log in.

We'd prefer it if you saw us at our best.