Question
Link Group
AU
Last activity: 7 Aug 2023 10:33 EDT
pega nodes are dropping off from the cluster after 30 odd minutes
Hi Team,
We have 6 nodes.
A&B: User ( internal users)
C&D: User ( external users/public facing)
E: Batchprocessing
F: Stream nodes
At the time of restart all the 6 nodes are joining the cluster and able to see the instances in pr_sys_statusnodes. I see only 2 nodes in the Admin portal after startup. However, after sometime i dont see them in the pr_sys_statusnodes tables and admin studio.
Network teams have confirmed that port 5701-5800 are open on the host machines, which are used for Hazelcast cluster communications.
Pega Upgrade versions: 843
App server: Tomcat 9.0.41
All the 6 vms are host on Azure cloud and uses Windows OS.
Any lead or help will be greatly appreciated. Thank you
-
Likes (1)
Bukhari Saheb Shaik -
Share this page Facebook Twitter LinkedIn Email Copying... Copied!
Accepted Solution
Updated: 3 May 2022 5:32 EDT
Link Group
AU
@BukhariSahebS Hi All, issue was with the different nodes being in different timezones w.r.t DB timezone. Once, all the timezones were brought in sync, the issue is resolved.
Soprasteria
CH
Hi,
Could you sum up the connection pool for each tomcat node against the db connection pool size.
Ex: If the DB connection pool size set to 500 and there are 6 tomcat nodes and each node has 100 connection pool size then it may lead to the instablity in the connection pool and nodes will be dettached from the cluster with out any reason.
-
Bukhari Saheb Shaik
Link Group
AU
Thank you for your response.
We are not running the application. As soon as app server startup, even if we keep the application idle for say 30 min, this behaviour is observed. The pool size would be come into picture, if there is a contention on thread pool. I don't see such errors either in Logs, nor see the threads waiting on DB.
Lantiqx
IN
@BukhariSahebSI also see that your cluster is getting disturbed once the final node(6) joins the cluster where is unable to get some pool connections based on the logs. You can check that by turning of the last node to see if all 5 are ok(kind of trail/error) to inspect this issue much further!
I still doubt why all other nodes also dropping if one fails to join?
Capgemini
IN
to quickly check and bring cluster up and running , as Konal mentioned you can shut down 6 nodes (also reconfigure node types to nodes which will be up) and restart node and see if cluster is healthy.
now if we have environment stable with one node running , then add one node at a time and repeat same process.
next once we have environment up and running we can try to debug possible changes gone into nodes (DSS changes , hotfix updates , infra upgrades or updates).
If cluster was stable from long time and issue started recently then need to check if any deployment or infra update caused it . If there is DB backup before any such activity we can restore the DB and check if environment is stable.
-
Bukhari Saheb Shaik Venkata Krishna Kukkala
Accepted Solution
Updated: 3 May 2022 5:32 EDT
Link Group
AU
@BukhariSahebS Hi All, issue was with the different nodes being in different timezones w.r.t DB timezone. Once, all the timezones were brought in sync, the issue is resolved.
-
Abhilash BLR Sohail Mohammed Sabry Ashroff abhishek khanna
Accenture
AU
Thanks, this helped us. We are running a cluster of Windows nodes and Azure DB. Our nodes were in our local timezone, but Azure DB was UTC (and cannot be changed). After changing our node timezones to UTC our issues got fixed.
I suspect what the issue was the heartbeat timestamp using the DB timezone was being compared with the timestamp of each node which was in a different timezone. Since the dates are always off by several hours, node entries were being removed from pr_sys_statusnodes.
Other databases like Oracle have the ability to set the DB timezone, but Azure DB is always locked to UTC. I guess this means if you run off Azure DB or any DB that's always UTC then your nodes need to be UTC...which is a little annoying because your logs are always in UTC.
-
Sabry Ashroff abhishek khanna