Applies to Pega Platform versions 7.2 through 8.5
Pega Ping is a service that verifies the health of a web node, that is, -DNodeType=WebUser
.
Load balancers can be configured to use this ping service to check the node's health. F5 Big IP can use the URL in creating a health monitor, and AWS Elastic Load Balancer can reference the URL in its Health Check.
In releases prior to Pega Platform version 8.2, Pega Ping is a REST service that performs health checks synchronously (on demand) and returns the status when a request comes.
In Pega releases prior to Pega Platform version 8.2, checking the health of a Pega node rapidly and reliably is hindered by the following limitations:
- Health Ping service times out even though the node is healthy. This causes the nodes to restart multiple times and destabilizes the entire cluster.
- Health Ping service does not report unhealthy behaviors such as Out of Memory (OOM). OOM might still be raised in third- party code or code that is not handled by Pega Ping node health monitoring.
- Health Ping service checks the health of the web node processing only, that is,
-DNodeType=WebUser
. It does not consider node types for BackgroundProcessing, Stream (including DSM), BIX, Search, and Universal. - Remote tracing of REST services interferes with ping service execution times.
In Pega Platform version 8.2 and later releases, the Pega Ping service is improved to run health checks asynchronously and periodically.
To benefit from improvements to the Pega Ping service, upgrade to the latest Pega Platform release, at minimum Pega Platform version 8.2.
See Keeping current with Pega.
Node health monitoring features provided by Pega 8.2 and later releases
Frequently asked questions
How can I verify the health of a Pega node?
What does a typical Pega Ping response look like?
How do I get active browser requestor count?
What is the activity that is run by the Pega Ping service?
My node is reporting 'unhealthy'. What do I need to do?
Do the node health checks catch all Out of Memory errors?
What are the typical issues with the Pega Ping service in releases prior to Pega 8.2?
How can I verify the health of a Pega node?
Address your browser to this URL:
http://<<hostName:port/contextName>>/PRRestService/monitor/pingservice/ping
If the node is healthy, the URL returns a response code of 200.
If the node is unhealthy, the URL returns a response code of 500. See I see status code 500: What additional artifacts do I need to collect? I see status codes indicating a problem, but I do not see any error in the PegaRules log file. Why?
What does a typical Pega Ping response look like?
Pega Platform version 8.2 and later releases
In Pega Platform version 8.2 and later releases, the Pega Ping response looks like this example:
{
"node_id":"myCustomNodeId",
"node_type":[ "WebUser", "Stream" ] ,
"health":[
{
"test_name":"Streamservice-Check",
"status":"success",
"last_reported_time":"2018-07-30T20:37:29.656"
},{
"test_name":"HTML-Stream-Check",
"status":"success",
"last_reported_time":"2018-07-30T20:37:29.656",
}
],
"status":"healthy"
}
Releases earlier than Pega Platform version 8.2
In releases earlier than Pega Platform version 8.2, the Pega Ping responses looks like this example:
{
"duration": "201.293172",
"count": "-1"
}
How do I get active browser requestor count?
Pega Platform version 7.3.1 and earlier releases
With Pega 7.3.1 and earlier releases, the ping service used to give the number of active requestors present in a particular node. Because ping is a synchronous API, getting the requestor count causes some performance issues.
Therefore, returning the requestor count was disabled in these earlier releases by setting the DASS disableActiveUserCount
to true:
Rules set: Pega-RULES
Setting name: disableActiveUserCount
Setting value: true
Pega Platform version 7.4 and later releases and a Pega Cloud Services environment
In Pega Platform version 7.4 and later releases, you can count the number of active browser requestors by using this REST service:
/PRRestService/monitor/v1/sessions/browser
This REST service gives results only if you enable the maximum limit of concurrent browser sessions and the environment is Pega Cloud Services.
Set some +ve value to cluster/requestors/browser/maxactive
in the prconfig.xml file setting.
Example: <env name="cluster/requestors/browser/maxactive" value="200"/>
What is the activity that is run by the Pega Ping service?
For releases prior to Pega Platform version 8.2, the Pega Ping service is a REST service pingService in the monitor package that runs the activity pzGetHealthStatus.
In Pega Platform version 8.2 and later releases, the Pega Ping service does not use REST infrastructure and no activity is processed. The engine handles the ping requests without requestor context.
My node is reporting 'unhealthy'. What do I need to do?
Pega Platform version 8.2 and later releases
In Pega Platform version 8.2 and later releases, several health checks run to determine the health of a node.
If any of the health checks fail, then the node's health is marked as unhealthy and the URL returns a response code of 500.
The Pega Ping response includes the details on the health checks being run and which check failed.
Look at the ping response body (JSON) to see these details.
Releases prior to Pega Platform version 8.2
In releases prior to Pega Platform version 8.2, you see an exception in the logs like Timed out borrowing service requestor from requestor pool for service package: monitor or some exception in executing the activity pzGetHealthStatus.
In either case, review the Pega-Rules logs, which will provide more information.
See Understanding logs and logging messages and Understanding the PegaRULES Log Analyzer.
My ping service is returning 500 status (unhealthy) but reviewing the ping JSON or Pega-Rules logs does not help me. Whom do I contact?
Go to My Support Portal to submit a support case (INC) for GCS assistance. See My Support Portal Frequently Asked Questions.
If your environment is a Pega Cloud environment, in My Support Portal, select My Pega Cloud to understand How Pega keeps your environment current.
The support engineer will work with the Product or Service team that owns the service failing the node health check:
- HTML-Stream-Check owned by the Engine-as-a-Service team
- Streamservice-Check owned by the Streaming and Large-scale Processing team
- StaleThreadHealthCheck owned by the Decisioning & Analytics team
- ServiceRegistry-Check owned by the Data Sync and Caching team
My ping service health check displays N_CriticalErrorNotification. What does this mean? What do I need to do?
N_CriticalErrorNotificationis reported by a heath check notification when there is a critical error that occurred in the node, usually an Out of Memory (OOM) error. You need to determine the root cause of the OOM error. See the answer to the next question.
Do the node health checks catch all Out of Memory errors?
The Pega Ping service also returns an unhealthy status for a node when an Out of Memory (OOM) error occurs in the node. Usually an OOM error will mark the node as unhealthy.
However, OOM errors occurring from third-party JAR files are not caught by the node health checks. Because of this limitation, the node health checks catch only about 70 percent to 80 percent of the OOM errors.
When OOM occurs, the Pega Ping response looks like this example:
{
"node_type":" "[ "WebUser"],
"health":" "[
{
"last_reported_time":"2020-08-07T22:27:28.424Z",
"test_name":"HTML-Stream-Check",
"status":"success"
}, {
"test_name":"N_CriticalErrorNotification",
"status":"failure",
"last_reported_time":"2018-07-30T20:37:29.656"
}
],
"state":" " "unhealthy",
"node_id":" " "10.150.69.32_envblr85-web-3"
}
What are the typical issues with the Pega Ping service in releases prior to Pega Platform version 8.2?
Prior to Pega Platform version 8.2, you might encounter the following issues:
- The requestor pool timed out borrowing service requestor from requestor pool for service package.
- Ping service is not reporting an unhealthy node when OOM occurs.
In Pega Platform version 8.2, the Pega Ping service timeout is fixed and. most of the time, OOM errors will mark the node as unhealthy. However, OOM errors occurring from third-party JAR files are not caught by the node health checks. Because of this limitation, the node health checks catch only about 70 percent to 80 percent of the OOM errors.
Node health monitoring features provided by Pega Platform version 8.2 and later releases
With Pega Platform version 8.2 and later releases, reliable monitoring and reporting of node health is afforded by the following improvements:
- All node health checks are run asynchronously and periodically. You can keep or adjust the default settings.
- Every health check must be completed within the configured time. The default value is 5 seconds. When a health check exceeds the specified time, the health check fails.
- Results of all health checks are aggregated at one place after they run.
- Each check result has an expiry time. For a particular health check, if the results are not updated within the specified time, for example 60 seconds, then the health check fails and the node is reported as unhealthy. This detects if there a problem in background job itself.
- Each component specifies its own health checks and registers them with the Health monitor component.
- Components can specify the NodeType for which the check needs to run during health check registration.
- Only health checks that are registered for the current node type are picked for processing.
- Engine components can publish the health events by specifying the event and the event handler. These results will not expire during result aggregation.
- When a ping request comes from the client, the status of all health checks is aggregated, and the final health status is sent with the JSON response.
- If you encounter an issue with any of the health checks, you can disable those checks using the Data-Admin-System-Setting (DASS) identified in Settings. The disabled health checks are not run in the next cycle of monitoring.
Settings
Keep or adjust the default settings for monitoring the health of Pega system nodes:
Setting Name | Type | Default Value | Description | |
monitor/health/monitorInterval | prconfig.xml | 15 (seconds) | Health monitor daemon interval in seconds | |
monitor/health/checkTimeout |
|
5000 (milliseconds) | Health monitor check execution timeout in milliseconds | |
monitor/health/statusTimeout |
|
120 (seconds) | Health monitor status expiration in seconds | |
monitor/health/disableChecks |
|
None | To disable checks dynamically |
You can create Dynamic System Settings (DSSes) from the prconfig.xml settings by following the procedure in Creating a dynamic system setting.