8.6 PEGA multinode deployment can takes an hour to recover from a node failure during a campaign run

Question

SeanR17369101

Member since 2025

1 post

Credera

Posted: Feb 27, 2025

Last activity: Mar 31, 2025

Posted: 27 Feb 2025 18:20 EST
Last activity: 31 Mar 2025 8:39 EDT

Solved

8.6 PEGA multinode deployment can takes an hour to recover from a node failure during a campaign run

Report

We are deploying a non-standard 2-node universal deployment, and while we understand this is not recommended, we are investigating node recovery behavior in failure scenarios.

During our tests, we simulated a failure by deleting a single node while an active campaign was running. As part of the restoration process, we restored the embedded Kafka and Cassandra data before restarting the node. However, we observed inconsistent rejoin times:

In some cases, the node rejoined the cluster within our usual startup time of approximately 15 minutes.
In other cases, it took up to an hour for the node to rejoin.
On one occasion, the remaining active node also crashed, leaving both nodes offline for 15 minutes. Eventually, both nodes came back online and rejoined the cluster.

We would appreciate any insights into why a node might fail to rejoin promptly or cause instability in the remaining node. I can provide relevant logs for further analysis—please let me know which specific logs would be most helpful.

Thanks for your time and support.

To see attachments, please log in.

Pega Customer Decision Hub 8.6

Pega Customer Decision Hub

Communications and Media

Government

System/Cloud Ops Administrator

Reply
Like (0)
Share this page Facebook Twitter LinkedIn Email Copying... Copied!

Accepted Solution

Posted: 10 months ago

Updated: 9 months ago

Posted: 17 Mar 2025 13:55 EDT
Updated: 31 Mar 2025 8:39 EDT

MarijeSchillern

MOD

replied to SeanR17369101

Report

@SeanR17369101

⚠ This is a GenAI-powered answer. All generated answers require validation against the provided references.

Based on your scenario, I can recommend specific logs that would be most valuable for diagnosing these issues.

For a comprehensive analysis of node rejoining problems in your Pega 8.6 deployment with embedded Kafka and Cassandra, please collect the following logs from both nodes:

@SeanR17369101

⚠ This is a GenAI-powered answer. All generated answers require validation against the provided references.

Based on your scenario, I can recommend specific logs that would be most valuable for diagnosing these issues.

For a comprehensive analysis of node rejoining problems in your Pega 8.6 deployment with embedded Kafka and Cassandra, please collect the following logs from both nodes:

Cassandra Logs - Found in the work/Catalina/localhost/prweb/Cassandra*log* directory These logs are critical for identifying errors and warnings related to the embedded Cassandra service, which could explain the delayed rejoin times. Pay special attention to any replication issues, seed node communication problems, or resource constraints.
PegaRULES Log - Primary application log This contains debugging information about system errors and exceptions that may be affecting node recovery and performance during your campaign runs. Look for any Kafka or Cassandra connectivity issues.
PegaALERT Log - Performance and threshold alerts This log captures diagnostic messages for failures and system events that exceed performance thresholds, which can help identify bottlenecks causing slow rejoin times.
PegaCLUSTER Log - Cluster management information This provides crucial information about the setup and runtime behavior of your cluster, which is essential for understanding node recovery dynamics and communication issues between nodes.
Kafka Server Logs - For embedded Kafka instances Since your scenario involves campaign runs which heavily rely on Kafka for message processing, these logs could reveal issues with Kafka topic replication, partition management, or consumer group rebalancing that might be contributing to the instability.

For the most effective analysis, please collect these logs from:

The node that was deleted and later rejoined (focusing on the periods of slow rejoining)
The remaining active node (especially in the case where it also crashed)
Any timestamps corresponding to when both nodes were offline for 15 minutes

Additionally, if you have access to JVM monitoring data or heap dumps from when the issues occurred, these could also provide valuable insights into potential resource constraints affecting the recovery process.

The fact that you've observed the remaining node sometimes crashing when the first node is deleted suggests there might be resource contention issues or improper handling of partition rebalancing during campaign operations. The logs should help reveal whether this is related to Kafka topic replication, Cassandra data synchronization, or another underlying issue.

References:

Show Less

To see attachments, please log in.

Question

8.6 PEGA multinode deployment can takes an hour to recover from a node failure during a campaign run

Need help or want to help others?

Experience the benefits of Support Center when you log in.

Question

8.6 PEGA multinode deployment can takes an hour to recover from a node failure during a campaign run

Related content:

Need help or want to help others?

Experience the benefits of Support Center when you log in.

We'd prefer it if you saw us at our best.