Question
XYZ
IT
Last activity: 15 Sep 2023 6:55 EDT
Progressive detoriation of CPU usage across web nodes
Hello all,
My team and I are experiencing problems with CPU and after months of analysis we still haven't figured out what is causing issues.
Let me explain a bit of our application architecture:
Currently we're having two main applications, lets call them X and Y.
Application X is used mostly by services rest, we have a front end which is invoking pega's API's. It is bult on Pega Platform.
Application Y is built on top of CS framework and application X.
Now let's go to the problem:
For the last couple of months we're having issues with CPU spikes across web nodes.
After restart everything work fine, but couple of days after we start to notice first CPU spikes on just one or two nodes. As days are passing, nodes which were the first ones to show spikes are showing much bigger spikes and other nodes start to show degradation.
Latest evidence shows that application Y, the one built on customer service, is causing spikes.
Wht we did so far is following:
Hello all,
My team and I are experiencing problems with CPU and after months of analysis we still haven't figured out what is causing issues.
Let me explain a bit of our application architecture:
Currently we're having two main applications, lets call them X and Y.
Application X is used mostly by services rest, we have a front end which is invoking pega's API's. It is bult on Pega Platform.
Application Y is built on top of CS framework and application X.
Now let's go to the problem:
For the last couple of months we're having issues with CPU spikes across web nodes.
After restart everything work fine, but couple of days after we start to notice first CPU spikes on just one or two nodes. As days are passing, nodes which were the first ones to show spikes are showing much bigger spikes and other nodes start to show degradation.
Latest evidence shows that application Y, the one built on customer service, is causing spikes.
Wht we did so far is following:
- Check events in the PDC one minute before and one minute after spikes -> no luck so far, I even had a help from an LSA who confirmed that events in PDC are not related to spikes
- Thread dump: there was some evidence about slow JDBC messagges, but it doesn't seem to be strictly related to cpu spikes
- Check heap - heap values were within normal values; no indicators of memory leak
- check load balancer is traffic is unevenly distributed
I am running out of the ideas what to do next, does anyone have an idea what else could cause this issues?
Maybe some loops inside code? But it doesn't have sense since degradation of cpu usage happendìs gradually
I've attached screenshot of CPU usage by JVM from ocotber 26th until November 7h and in you can notice progressive detoriation of web nodes.
***Edited by Moderator Marije to add Capability tags***