Application performance issues related to nodes in clustered environments

Support Doc

MaryCarbonara

Member since 2010

216 posts

Posted: Jul 1, 2022

Last activity: Jul 1, 2022

Posted: 1 Jul 2022 14:50 EDT
Last activity: 1 Jul 2022 16:18 EDT

Application performance issues related to nodes in clustered environments

Applies to Pega Platform™ versions 8.1.6, 8.1.9, 8.6.1, 8.6.2; and Pega Cloud versions 2.19.3 and 2.21.2

Prerequisites
Symptoms
Errors
Explanation
Suggested Approaches for Defining Root Cause
Solutions
Scenarios
Related Content

Prerequisites

Read the following collections of articles to understand application and system performance and the tools and techniques for monitoring the health of your system, especially using Predictive Diagnostic Cloud (PDC):

Performance

Monitoring system health

Predictive Diagnostic Cloud (PDC)

Node status and resources data

Symptoms

Your Pega deployment experiences degraded performance for a variety of reasons:

Web node shows an unusual increase in the requestor count graph
Heap use on a Stream service node reaches 97 percent while system remains responsive and there is no memory leak
High Swap Usage alert on a web node indicates high system memory use on one node that can degrade performance on other nodes in the cluster

Errors

Performance degradation is evident from various indicators of increased requestor counts and increase memory use on web nodes and increase heap use on Stream service nodes.

Explanation

Unusual increases in the requestor count graph are expected from the load balancer under the following conditions:

Unusual traffic or a higher number of active users during the session
Passivation, which allows a requestor to be saved into storage and reactivated later
A requestor remains in memory and requestor processing continues normally until the page is needed (for example, for read access or for an update) and is reactivated into memory. When the user, thread, or page is inactive, it is passivated, including the associated clipboard pages. If the requestor remains idle, eventually the entire requestor context is passivated.
Caching from data load after Pega deployment update

Unusual increases in heap use on a Stream service node can occur when load balancing is needed.

A High Swap Usage alert might be a sign that the web node is experiencing memory pressure.

Suggested Approaches for Defining Root Cause

Use Predictive Diagnostic Cloud (PDC) to gauge resource use.
1. Select one node to capture a detailed level of usage metrics.
2. Know that 90% of heap is not unusual.
Obtain a Garbage Collection (GC) graph to check the throughput: Anything above 97% is considered good.
Use the Event viewer to check for alerts Pega0001, Pega0005, Pega0004, and Connect time. See Event Viewer landing page in Pega Predictive Diagnostic Cloud.
Understand and use Pega alerts from PDC: List of performance and security alerts in Pega Platform
Understand and use PDC node health indicators and status thresholds.

Solutions

Depending on your specific application performance scenario and its root cause, try one or more of the following solutions.

Heap on web node

Heap on Stream service node

Non-uniform requestor count

Heap on web node

If your application performance issue is related to the heap on a running node, take one or both actions:

Reduce the heap
Increase the number of nodes

See OPS0024: Java heap out of memory.

If a High Swap Usage alert on a web node threatens the health of other nodes in the cluster, take a heap dump on the node before restarting the node. Then verify that the other nodes in the cluster are healthy after the node is restarted.

Heap on Stream service node

If your application performance issue is related to heap on the stream service, increase the stream size using Garbage Collection (GC) analysis on PDC.

If unusual heap use on the Stream service node suggests that load balancing is required and memory leak is ruled out, increase the heap storage on the Stream service node to balance the load.

See the following articles:

Monitoring the Stream service

Verbose logging of garbage collection operations

Garbage collection and memory management

PEGA0028 alert: GC cannot reclaim memory from memory pools

Non-uniform requestor count

If your application performance issue is related to requestor management, perform a detailed analysis of the requestor count.

See the following articles and their related articles:
Tracking system utilization for a requestor session with Performance Analyzer

Tracking rule utilization for a requestor session with Performance Profiler

Generating requestor reports for system-wide usage from the Log-Usage class

Managing requestors

Managing requestor pools

Managing Requestor Type data instances

PEGA0030 alert: The number of requestors for the system exceeds limit

Configuring your system for passivation and activation

Scenarios

The scenarios described in this section are typical of how application performance issues can manifest themselves. The solutions provided were successful for clients who reported the application performance scenarios. Both the scenarios and the solutions are examples that might help you troubleshoot your own application performance issues related to web nodes in clustered environments.

Scenario 1 Non-uniform node in requestor count

Scenario 2 Node was non-uniform in heap utilization graph

Scenario 3 Web node needs restart - High Swap Usage Alert receive

Scenario 4 Non-uniform web node with increase in requestor count

Scenario 5 Increase in requestor count after update

Cloud Change (CC) request scenarios

Scenario 1 Non-uniform node in requestor count

A spike in the requestor count graph is detected. This can occur when the following conditions exist:

Unusual traffic or an increased number of active users during the session
Passivation, which allows a requestor to be saved into storage and reactivated later
A requestor remains in memory and requestor processing continues normally until the page is needed (for example, for read access or for an update) and is reactivated into memory. When the user, thread, or page is inactive, it is passivated, including the associated clipboard pages. If the requestor remains idle, eventually the entire requestor context is passivated.

To ensure that there is no other constraint in play for the spike in the requestor count graph, you should check the logs, alerts, system performance, and any error that might cause application or system drawbacks during the time frame of the reported requestor spike.

If you have a future business requirement to increase the requestor count, always check the parameter requestor config(Maximum idle requestor, maximum active requestor, maximum wait) and change the value according to the business requirement. See Tuning a requestor pool configuration.

Scenario 2 Node was non-uniform in heap utilization graph

Heap use on one of the Stream service nodes reaches up to 97 percent sometimes, while the system has always remained responsive.

Because the root cause was not a memory leak but increased load, 2 GB of Heap memory was added to the Stream service nodes to accommodate the load. The Heap setting was originally 8 GB. With the additional 2 GB, the Heap memory for each Stream service node increased to 10 GB.

See the following articles:

Status parameters of Stream nodes

Monitoring the Stream service

Operating the Stream service

Scenario 3 Web node needs restart - High Swap Usage Alert receive

High system memory usage on one web node can have a negative effect on other nodes in the cluster. Therefore, the node was restarted to avoid a possible outage in the cluster.

Scenario 4 Non-uniform web node with increase in requestor count

PDC shows a spike in requestor count on web nodes.

A check of the value very closely and in a detailed way indicates that the spike was a False Positive. The Rules Log and the CPU and memory Graph show no deviation. There was no issue in the production environment. The root cause is apparently a defect or configuration issue in the operating environment.

Scenario 5 Increase in requestor count after update

After updating the Pega environment to Pega Platform version 8.6, the requestor count increases. No new agents have been added into system. The root cause was determined to be the loading of this data from the cache. Therefore, the cache was cleared, and the complete cluster was restarted to make the requestor count normal.

Cloud Change (CC) request scenarios

Requests for Cloud Changes can be Platform Changes that affect artifacts in the Pega Cloud environment that is not a part of the Pega Cloud Service Catalog, or they can be Capacity Changes to platform configuration that is not a part of the Pega Cloud Service Catalog.

See Managing your environments in My Pega Cloud portal, which includes Visualizing your application performance.

A Cloud Change will provide the following guidance, according to its type.

Platform Change

Capacity Change

Platform Change

Platform Changes include requests to modify files or folders in the Pega Cloud environment, change platform configuration (for example, kernel parameters) or services, and any requests to install or change software that is not a part of the Pega Cloud Service Catalog. Depending on the nature of the Platform Change request written authorization by your Security Contact may be required. Provide all the details necessary to implement the change. Supporting Documents Requested: Security Contact approval.

Capacity Change

Capacity Changes include requests to change platform configuration (for example, storage, memory), that is not a part of the Pega Cloud Service Catalog. Depending on the nature of the Capacity Change request, written authorization by your Security Contact may be required. Capacity Changes that fundamentally change the Pega Cloud services delivered, such as increasing memory or adding new nodes, may also require approval by Pegasystems service delivery team. Provide all the details necessary to implement the change. Supporting Documents Requested: Security Contact approval.

Modify the heap allocation 73GB to 64GB
Add two new web nodes
Add 10% space to database
Increase temp file limit 10GB in Prod1

Modify the heap allocation 73GB to 64GB

A Cloud Change (CC) requested that Heap allocation for the App tier be decreased from 73GB to 64GB.

In addition to decreasing the Heap allocation in the App tier, the following changes were made in the Web tier:

Updated the Xms and Xmx parameters to specify

-Xms65536m -Xmx65536m

Added the additional JVM parameters:

-XX:+UseStringDeduplication -XX:+PrintAdaptiveSizePolicy -XX:+PrintHeapAtGC

Changed the value of parameter PegaFleetMaxReplacedInParallel from 1 to 3

As part of this change, only the web tier was restarted in a rolling fashion (in a batch of three nodes). The remaining nodes were not restarted.

Add two new web nodes

In one case, two web tier nodes were added in lower environments first, then added to the production environment for a total of 18 nodes. The node count was updated manually in the production environment to run with 18 web nodes. Later, this change was made permanent. If needed, the rollback plan was to decrease the web tier nodes to 16.

In another case, the web node count was increased from 14 to 16 and all nodes were verified are healthy. Two web nodes were terminated to start the replacement nodes. If needed, the rollback plan was to reduce the web node count to 14.

Add 10% space to database

Alerts indicated that a cloud system was running low on database storage space. To avoid system outage, Pega support engineers added 600 GiB space to database storage, increasing it from 6853 GiB to 7550 GiB.

To avoid future alerts, clean up your database. Monitor database space consumption in PDC. Contact your Pega Account Executive to purchase additional storage.

See Database statistics in Pega Predictive Diagnostic Cloud.

Increase temp file limit 10GB in Prod1

An emergency issue with database query failure was seen in the production environment only and the solution was provided avoided future query failures.

Solution: Increase the database parameter temp_file_limit to 10GB in the production environment.

Support Doc

Application performance issues related to nodes in clustered environments

Prerequisites

Symptoms

Errors

Explanation

Suggested Approaches for Defining Root Cause

Solutions

Heap on web node

Heap on Stream service node

Non-uniform requestor count

Scenarios

Scenario 1 Non-uniform node in requestor count

Scenario 2 Node was non-uniform in heap utilization graph

Scenario 3 Web node needs restart - High Swap Usage Alert receive

Scenario 4 Non-uniform web node with increase in requestor count

Scenario 5 Increase in requestor count after update

Cloud Change (CC) request scenarios

Platform Change

Capacity Change

Modify the heap allocation 73GB to 64GB

Add two new web nodes

Add 10% space to database

Increase temp file limit 10GB in Prod1

Related Content

Need help or want to help others?

Experience the benefits of Support Center when you log in.

Support Doc

Application performance issues related to nodes in clustered environments

Prerequisites

Symptoms

Errors

Explanation

Suggested Approaches for Defining Root Cause

Solutions

Heap on web node

Heap on Stream service node

Non-uniform requestor count

Scenarios

Scenario 1 Non-uniform node in requestor count

Scenario 2 Node was non-uniform in heap utilization graph

Scenario 3 Web node needs restart - High Swap Usage Alert receive

Scenario 4 Non-uniform web node with increase in requestor count

Scenario 5 Increase in requestor count after update

Cloud Change (CC) request scenarios

Platform Change

Capacity Change

Modify the heap allocation 73GB to 64GB

Add two new web nodes

Add 10% space to database

Increase temp file limit 10GB in Prod1

Related Content

Related content:

Need help or want to help others?

Experience the benefits of Support Center when you log in.

We'd prefer it if you saw us at our best.