Question

Credera
US
Last activity: 31 Mar 2025 8:39 EDT
8.6 PEGA multinode deployment can takes an hour to recover from a node failure during a campaign run
We are deploying a non-standard 2-node universal deployment, and while we understand this is not recommended, we are investigating node recovery behavior in failure scenarios.
During our tests, we simulated a failure by deleting a single node while an active campaign was running. As part of the restoration process, we restored the embedded Kafka and Cassandra data before restarting the node. However, we observed inconsistent rejoin times:
- In some cases, the node rejoined the cluster within our usual startup time of approximately 15 minutes.
- In other cases, it took up to an hour for the node to rejoin.
- On one occasion, the remaining active node also crashed, leaving both nodes offline for 15 minutes. Eventually, both nodes came back online and rejoined the cluster.
We would appreciate any insights into why a node might fail to rejoin promptly or cause instability in the remaining node. I can provide relevant logs for further analysis—please let me know which specific logs would be most helpful.
Thanks for your time and support.