Split-Brain Syndrome and cluster fracturing FAQs

Support Doc

MaryCarbonara

Member since 2010

216 posts

Posted: May 16, 2022

Last activity: Jul 24, 2023

Posted: 16 May 2022 12:44 EDT
Last activity: 24 Jul 2023 11:54 EDT

Split-Brain Syndrome and cluster fracturing FAQs

Applies to Pega Platform™ versions 7.3 through Infinity '24.

In highly-available clustered environments, you might notice that certain nodes in your cluster cannot see one another. One node appears as if it is the one and only active node. The Cluster Management page does not show all the nodes in your Pega deployment. Upon inspection, you determine that some nodes are in a separate cluster. This error condition is sometimes referred to as Split-Brain Syndrome or cluster fracturing.

What is Split-Brain Syndrome?
How do you detect Split-Brain Syndrome?
How do you detect actions that cause Split-Brain Syndrome?
Common root causes for cluster fracturing
Under-allocation of resources
Network issues
System management issues
How do you prevent cluster fracturing?
What steps can be taken to resolve Split-Brain Syndrome?
Related content

What is Split-Brain Syndrome?

Split-Brain is a state of decomposition in which a single cluster of nodes was separated into multiple clusters of nodes, each operating as if the other no longer exists.

Cluster fracturing is the process by which nodes end up in a Split-Brain state.

How do you detect Split-Brain Syndrome?

In most cases, you should not need to detect or watch for a Split-Brain state. Hazelcast automatically detects these situations and attempts to heal the cluster automatically. In situations where Hazelcast automatically recovers, you might first notice that some remote operations fail; however, service will be restored shortly afterward.

In situations where Hazelcast is unable to automatically recover, a Split-Brain merge failure message is reported in the logs.

How do you detect actions that cause Split-Brain Syndrome?

There are numerous actions that can cause Split-Brain, including, but not limited to, network outages, GC thrashing, and the over-allocation of hardware resources. Hazelcast includes several APIs that allow for monitoring actions that might lead to a Split-Brain state. This includes detecting lost partitions, failed merges, and dropped nodes. Other events that you should consider for monitoring and notification include high memory or CPU usage (or both), GC events, and known network failures.

Common root causes for cluster fracturing

Know how to recognize and isolate the following common causes of cluster fracturing:

Under-allocation of resources

Ran out of disk space: Many operations need disk space, and the node might crash if it does not have enough space.
High CPU usage: Hazelcast requests must be processed in a reasonable amount of time; otherwise, other nodes in the cluster might incorrectly think that a node has crashed or stopped responding.
Out of Memory (OOM): Lack of available memory leads to garbage collection (GC), which in turn may cause thrashing. This leads to High CPU usage.
Over-allocated systems: Even on systems with ample resources, resource spikes between Pega nodes and other applications sharing the same VM might lead to OOM, high CPU usage, and other negative conditions.

Network issues

High latency: If the latency between nodes and data centers is too high, requests are not processed in sufficient time.
Network outage: If the connection between two nodes is severed, Hazelcast cannot resume proper communication.
Domain or firewall issues: Nodes might also fracture because of incorrect firewall or DMZ settings.

System management issues

Cycling nodes: Frequent restarts of nodes causes excess partition migrations and merging leading to excess memory and CPU usage. Data might also be lost if nodes are ungracefully shutdown.
Long-running processes: Long-running processes on a JVM prevent Hazelcast from processing requests in a reasonable amount of time.

How do you prevent cluster fracturing?

An event or chain of events occurred which led the cluster to be in the Split-Brain state. How do you prevent this from occurring again? For each of the causal actions described in the previous section, you need to determine the root cause. In most cases, you need to investigate a number of factors to understand what caused the cluster to deteriorate and fracture because a single event will not explain the whole picture. You need to examine both the PegaRULES and PegaCLUSTER logs for this work.

For example, one exception you might see is a TargetNotMemberException. This error states that a request was made to a node that is no longer known to the node that issues the exception.

Why is this node no longer a part of my node's cluster?
Was the node purposely killed from the cluster? This exception occurs when a remote request is issued, but the target node dropped out of the cluster prior to processing.
Was the node kicked out of the cluster? To keep the cluster running smoothly, healthy nodes will kick unhealthy nodes out of the cluster.

As you can see, understanding why a cluster fractured is not a simple task. A node might have been terminated, causing repartitioning to occur. A distressed node might not be responding in a timely manner to other nodes and, consequently, is kicked out of the cluster. A node might also be operating in a healthy manner but became separated from the rest of the cluster because of network issues. Altogether, it is important to look at all the possibilities because cluster fracturing is usually the result of several issues.

To determine the root cause of a fractured cluster, ask yourself the following sequence of questions:

Do the root causes of previous Split-Brain scenarios apply to this issue?
While examining the logs, can you tell when the cluster began experiencing issues? Does the time factor correlate with other problems (CPU usage, memory, network issues, node cycling, and so on)?
Using JVM inspection tools, can you tell if there are issues with the nodes themselves?
Are there multiple nodes experiencing issues? Or are the nodes pointing to just one or two nodes that are causing instability?
Are the correct Hazelcast settings and JVM arguments in use?

What steps can be taken to resolve Split-Brain Syndrome?

First and foremost, verify you are on the latest version of Pega Platform. You should also upgrade to the latest Hazelcast Edition that is available for the Pega Platform release that you are using. Many stability and bug fixes are included in the latest versions of Hazelcast. See Hazelcast Editions supported in Managing clusters with Hazelcast, and Updates to Hazelcast support.

Although embedded Hazelcast is supported through Pega Infinity 24.2, if you run into the Split-Brain Syndrome, the best solution is to externalize your Hazelcast setup. For details, see External Hazelcast in your deployment.

Support Doc

Split-Brain Syndrome and cluster fracturing FAQs

What is Split-Brain Syndrome?

How do you detect Split-Brain Syndrome?

How do you detect actions that cause Split-Brain Syndrome?

Common root causes for cluster fracturing

Under-allocation of resources

Network issues

System management issues

How do you prevent cluster fracturing?

What steps can be taken to resolve Split-Brain Syndrome?

Related content

Need help or want to help others?

Experience the benefits of Support Center when you log in.

Support Doc

Split-Brain Syndrome and cluster fracturing FAQs

What is Split-Brain Syndrome?

How do you detect Split-Brain Syndrome?

How do you detect actions that cause Split-Brain Syndrome?

Common root causes for cluster fracturing

Under-allocation of resources

Network issues

System management issues

How do you prevent cluster fracturing?

What steps can be taken to resolve Split-Brain Syndrome?

Related content

Related content:

Need help or want to help others?

Experience the benefits of Support Center when you log in.

We'd prefer it if you saw us at our best.