Applies to Pega Platform™ versions 8.1.6, 8.1.9, 8.6.1, 8.6.2; and Pega Cloud versions 2.19.3 and 2.21.2
Prerequisites
Symptoms
Errors
Explanation
Suggested Approaches for Defining Root Cause
Solutions
Scenarios
Related Content
Prerequisites
Read the following collections of articles to understand application and system performance and the tools and techniques for monitoring the health of your system, especially using Predictive Diagnostic Cloud (PDC):
Predictive Diagnostic Cloud (PDC)
Node status and resources data
Symptoms
Your Pega deployment experiences degraded performance for a variety of reasons:
- Web node shows an unusual increase in the requestor count graph
- Heap use on a Stream service node reaches 97 percent while system remains responsive and there is no memory leak
- High Swap Usage alert on a web node indicates high system memory use on one node that can degrade performance on other nodes in the cluster
Errors
Performance degradation is evident from various indicators of increased requestor counts and increase memory use on web nodes and increase heap use on Stream service nodes.
Explanation
Unusual increases in the requestor count graph are expected from the load balancer under the following conditions:
- Unusual traffic or a higher number of active users during the session
- Passivation, which allows a requestor to be saved into storage and reactivated later
- A requestor remains in memory and requestor processing continues normally until the page is needed (for example, for read access or for an update) and is reactivated into memory. When the user, thread, or page is inactive, it is passivated, including the associated clipboard pages. If the requestor remains idle, eventually the entire requestor context is passivated.
- Caching from data load after Pega deployment update
Unusual increases in heap use on a Stream service node can occur when load balancing is needed.
A High Swap Usage alert might be a sign that the web node is experiencing memory pressure.
Suggested Approaches for Defining Root Cause
- Use Predictive Diagnostic Cloud (PDC) to gauge resource use.
- Select one node to capture a detailed level of usage metrics.
- Know that 90% of heap is not unusual.
- Obtain a Garbage Collection (GC) graph to check the throughput: Anything above 97% is considered good.
- Use the Event viewer to check for alerts Pega0001, Pega0005, Pega0004, and Connect time. See Event Viewer landing page in Pega Predictive Diagnostic Cloud.
- Understand and use Pega alerts from PDC: List of performance and security alerts in Pega Platform
- Understand and use PDC node health indicators and status thresholds.
Solutions
Depending on your specific application performance scenario and its root cause, try one or more of the following solutions.
Heap on web node
If your application performance issue is related to the heap on a running node, take one or both actions:
- Reduce the heap
- Increase the number of nodes
See OPS0024: Java heap out of memory.
If a High Swap Usage alert on a web node threatens the health of other nodes in the cluster, take a heap dump on the node before restarting the node. Then verify that the other nodes in the cluster are healthy after the node is restarted.
Heap on Stream service node
If your application performance issue is related to heap on the stream service, increase the stream size using Garbage Collection (GC) analysis on PDC.
If unusual heap use on the Stream service node suggests that load balancing is required and memory leak is ruled out, increase the heap storage on the Stream service node to balance the load.
See the following articles:
Verbose logging of garbage collection operations
Garbage collection and memory management
PEGA0028 alert: GC cannot reclaim memory from memory pools
Non-uniform requestor count
If your application performance issue is related to requestor management, perform a detailed analysis of the requestor count.
See the following articles and their related articles:
Tracking system utilization for a requestor session with Performance Analyzer
Tracking rule utilization for a requestor session with Performance Profiler
Generating requestor reports for system-wide usage from the Log-Usage class
Managing Requestor Type data instances
PEGA0030 alert: The number of requestors for the system exceeds limit
Configuring your system for passivation and activation
Scenarios
The scenarios described in this section are typical of how application performance issues can manifest themselves. The solutions provided were successful for clients who reported the application performance scenarios. Both the scenarios and the solutions are examples that might help you troubleshoot your own application performance issues related to web nodes in clustered environments.
Scenario 1 Non-uniform node in requestor count
Scenario 2 Node was non-uniform in heap utilization graph
Scenario 3 Web node needs restart - High Swap Usage Alert receive
Scenario 4 Non-uniform web node with increase in requestor count
Scenario 5 Increase in requestor count after update
Cloud Change (CC) request scenarios
Scenario 1 Non-uniform node in requestor count
A spike in the requestor count graph is detected. This can occur when the following conditions exist:
- Unusual traffic or an increased number of active users during the session
- Passivation, which allows a requestor to be saved into storage and reactivated later
- A requestor remains in memory and requestor processing continues normally until the page is needed (for example, for read access or for an update) and is reactivated into memory. When the user, thread, or page is inactive, it is passivated, including the associated clipboard pages. If the requestor remains idle, eventually the entire requestor context is passivated.
To ensure that there is no other constraint in play for the spike in the requestor count graph, you should check the logs, alerts, system performance, and any error that might cause application or system drawbacks during the time frame of the reported requestor spike.
If you have a future business requirement to increase the requestor count, always check the parameter requestor config(Maximum idle requestor, maximum active requestor, maximum wait) and change the value according to the business requirement. See Tuning a requestor pool configuration.
Scenario 2 Node was non-uniform in heap utilization graph
Heap use on one of the Stream service nodes reaches up to 97 percent sometimes, while the system has always remained responsive.
Because the root cause was not a memory leak but increased load, 2 GB of Heap memory was added to the Stream service nodes to accommodate the load. The Heap setting was originally 8 GB. With the additional 2 GB, the Heap memory for each Stream service node increased to 10 GB.
See the following articles:
Status parameters of Stream nodes
Scenario 3 Web node needs restart - High Swap Usage Alert receive
High system memory usage on one web node can have a negative effect on other nodes in the cluster. Therefore, the node was restarted to avoid a possible outage in the cluster.
Scenario 4 Non-uniform web node with increase in requestor count
PDC shows a spike in requestor count on web nodes.
A check of the value very closely and in a detailed way indicates that the spike was a False Positive. The Rules Log and the CPU and memory Graph show no deviation. There was no issue in the production environment. The root cause is apparently a defect or configuration issue in the operating environment.
Scenario 5 Increase in requestor count after update
After updating the Pega environment to Pega Platform version 8.6, the requestor count increases. No new agents have been added into system. The root cause was determined to be the loading of this data from the cache. Therefore, the cache was cleared, and the complete cluster was restarted to make the requestor count normal.
Cloud Change (CC) request scenarios
Requests for Cloud Changes can be Platform Changes that affect artifacts in the Pega Cloud environment that is not a part of the Pega Cloud Service Catalog, or they can be Capacity Changes to platform configuration that is not a part of the Pega Cloud Service Catalog.
See Managing your environments in My Pega Cloud portal, which includes Visualizing your application performance.
A Cloud Change will provide the following guidance, according to its type.
Platform Change
Platform Changes include requests to modify files or folders in the Pega Cloud environment, change platform configuration (for example, kernel parameters) or services, and any requests to install or change software that is not a part of the Pega Cloud Service Catalog. Depending on the nature of the Platform Change request written authorization by your Security Contact may be required. Provide all the details necessary to implement the change. Supporting Documents Requested: Security Contact approval.
Capacity Change
Capacity Changes include requests to change platform configuration (for example, storage, memory), that is not a part of the Pega Cloud Service Catalog. Depending on the nature of the Capacity Change request, written authorization by your Security Contact may be required. Capacity Changes that fundamentally change the Pega Cloud services delivered, such as increasing memory or adding new nodes, may also require approval by Pegasystems service delivery team. Provide all the details necessary to implement the change. Supporting Documents Requested: Security Contact approval.
Modify the heap allocation 73GB to 64GB
Add two new web nodes
Add 10% space to database
Increase temp file limit 10GB in Prod1
Modify the heap allocation 73GB to 64GB
A Cloud Change (CC) requested that Heap allocation for the App tier be decreased from 73GB to 64GB.
In addition to decreasing the Heap allocation in the App tier, the following changes were made in the Web tier:
Updated the Xms and Xmx parameters to specify
-Xms65536m -Xmx65536m
Added the additional JVM parameters:
-XX:+UseStringDeduplication -XX:+PrintAdaptiveSizePolicy -XX:+PrintHeapAtGC
Changed the value of parameter PegaFleetMaxReplacedInParallel from 1 to 3
As part of this change, only the web tier was restarted in a rolling fashion (in a batch of three nodes). The remaining nodes were not restarted.
Add two new web nodes
In one case, two web tier nodes were added in lower environments first, then added to the production environment for a total of 18 nodes. The node count was updated manually in the production environment to run with 18 web nodes. Later, this change was made permanent. If needed, the rollback plan was to decrease the web tier nodes to 16.
In another case, the web node count was increased from 14 to 16 and all nodes were verified are healthy. Two web nodes were terminated to start the replacement nodes. If needed, the rollback plan was to reduce the web node count to 14.
Add 10% space to database
Alerts indicated that a cloud system was running low on database storage space. To avoid system outage, Pega support engineers added 600 GiB space to database storage, increasing it from 6853 GiB to 7550 GiB.
To avoid future alerts, clean up your database. Monitor database space consumption in PDC. Contact your Pega Account Executive to purchase additional storage.
See Database statistics in Pega Predictive Diagnostic Cloud.
Increase temp file limit 10GB in Prod1
An emergency issue with database query failure was seen in the production environment only and the solution was provided avoided future query failures.
Solution: Increase the database parameter temp_file_limit to 10GB in the production environment.
Related Content
Data collected by Pega Predictive Diagnostic Cloud
Managing your high availability cluster