Externalized services (Hazelcast, Kakfa and Search) monitoring

Question

Ganesh Sharma

Member since 2020

3 posts

PEGA

Posted: Jan 2, 2024

Last activity: Feb 27, 2024

Posted: 2 Jan 2024 10:38 EST
Last activity: 27 Feb 2024 4:50 EST

Closed

Solved

Externalized services (Hazelcast, Kakfa and Search) monitoring

Report

How will an application behave in the below scenarios?

When the Hazelcast is down.
External Kafka nodes are down or corruption of replicas?

Are any new db tables introduced to monitor HZ nodes with cluster information like cluster_ID, and Node information?
In the case of nodes' unhealthy state, with application recycling is that node going to join on the same HazelCast cluster, or will it behave like a split-brain?
How to validate stream nodes, like partitions, leaders, under-replicated, and offline count information?
To do cluster restart, currently, there is an order to be followed(Stream first, followed by Search and Web nodes). Should that same process be followed?
What monitoring tools are recommended to monitor Hazelcast and Kafka services for on-premise clients?

***Edited by Moderator Marissa to add Capability tags***

To see attachments, please log in.

Pega Platform 8.8.3

Pega Platform

System Administration

Communications and Media

Lead System Architect

Likes (1)

Vinay Karumanchi
Share this page Facebook Twitter LinkedIn Email Copying... Copied!

Accepted Solution

Posted: 1 year ago

Updated: 1 year ago

Posted: 26 Feb 2024 9:36 EST
Updated: 27 Feb 2024 4:50 EST

Ganesh Sharma

PEGA

replied to Shakes

Report

@Shakes Sharing the information I have gathered

Hazel cast

How will the application behave while the Hazel cast nodes are down?

If a single Hazel cast node goes down, restarting it will make the node join back the cluster and you should not see any other issues.
If the complete Hazel cast cluster is down, then a complete cluster (including Pega tiers) restart is required.

Were any new DB tables introduced to monitor externalized Hazel cast nodes with cluster information like cluster_ID, Node information?

No there is no DB table added to monitor Hazel cast node information. We can make use of health endpoints to get node information.

Will the nodes join back the same Hazel cast cluster in case of un-healthy nodes getting restarted or will we see split brain issue?

Yes, the nodes will join back the same Hazel cast cluster in case of a node restart.

Which monitoring tools should be used to watch the Hazel cast server?

We can use the same monitoring tools used for monitoring the Pega nodes.

Kafka

@Shakes Sharing the information I have gathered

Hazel cast

How will the application behave while the Hazel cast nodes are down?

If a single Hazel cast node goes down, restarting it will make the node join back the cluster and you should not see any other issues.
If the complete Hazel cast cluster is down, then a complete cluster (including Pega tiers) restart is required.

Were any new DB tables introduced to monitor externalized Hazel cast nodes with cluster information like cluster_ID, Node information?

No there is no DB table added to monitor Hazel cast node information. We can make use of health endpoints to get node information.

Will the nodes join back the same Hazel cast cluster in case of un-healthy nodes getting restarted or will we see split brain issue?

Yes, the nodes will join back the same Hazel cast cluster in case of a node restart.

Which monitoring tools should be used to watch the Hazel cast server?

We can use the same monitoring tools used for monitoring the Pega nodes.

Kafka

How does the application behave when External Kafka nodes are down or there is corruption of replicas?

The user experience would be same as with embedded Kafka. We would see slowness in application when Kafka is down or has offline replicas. Any new messages to be processed by QP would be stored in the delayed queue table till Kafka cluster is back online.

How to validate stream services, like partitions, leaders, under-replicated, offline count info?

New relic dashboards can be created for Kafka service, also MSK already has tons of dashboard widgets that can be used for monitoring.

To do cluster restart, currently there is an order to be followed. Does that same process be followed?

There is no specific order of restart post Pega 8.4, nodes can be restarted in any order. However, to avoid failures it is advisable to start the nodes accepting traffic at last.

What should be the default Heap memory configuration for externalized Kafka?

Typically, 1 GB is fine which is also our default in embedded Kafka, the heap memory can be increased to 2GB in the clusters having over 1000 partitions. This must be evaluated in the lower environment before applying to production.

Note: Once Kafka is externalized, we do not need any stream nodes anymore.

Show Less

View reply inline

To see attachments, please log in.

Posted: 1 year ago

Updated: 1 year ago

Posted: 10 Jan 2024 11:44 EST
Updated: 27 Feb 2024 4:50 EST

MarijeSchillern

MOD

replied to Ganesh Sharma

Report

@Ganesh Sharma can you confirm that you've first gone through the available documentation?

Externalization of services in your deployment

External Kafka in your deployment

External Elasticsearch in your deployment

External Hazelcast in your deployment

Third-party externalized services Deployment Changes FAQs

Monitoring an embedded stream service > Database

@Ganesh Sharma can you confirm that you've first gone through the available documentation?

Externalization of services in your deployment

External Kafka in your deployment

External Elasticsearch in your deployment

External Hazelcast in your deployment

Third-party externalized services Deployment Changes FAQs

Monitoring an embedded stream service > Database

Understanding pr_data_stream tables > Nodes_

Externalisation of Kafka service

In Pega 8.8, if Hazelcast is down, it can impact the application's performance and functionality. Features that rely on Hazelcast, such as distributed sessions, decisioning services, and others, may not function as expected. If the external Kafka nodes are down or there is corruption of replicas, it can disrupt the functioning of Queue Processors and Data Flows, as they rely on Kafka for message processing. This can lead to delays or failures in processing queued items. In both cases, Pega has implemented resilience measures to handle such scenarios. However, it's important to monitor and maintain the health of these services to ensure optimal application performance.

In Pega 8.8, the pr_sys_statusnodes table is used to monitor the status of nodes in a Hazelcast cluster. If a node is in an unhealthy state and the application is recycled, the node should rejoin the same Hazelcast cluster, provided the network settings and cluster configuration have not changed. To validate stream nodes, you can use the pr_data_stream_nodes table which contains information about the Kafka cluster, including the list of all known Stream nodes, topics, data partition distribution, and the current controller node. For cluster restarts, it is recommended to follow the same process of restarting Stream nodes first, followed by Search and Web nodes. For monitoring Hazelcast and Kafka services, Pega provides built-in tools like PDC for system health monitoring. For on-premise clients, additional monitoring tools can be used based on the organization's preference and infrastructure, such as Prometheus, Grafana, or any other tool that supports JMX monitoring.

⚠ This is a GenAI-powered tool. All generated answers require validation against the above provided references.

Please check internally for our HZ and Kafka experts if you need more help with this.

Please mark Accept Solution if you're happy with the provided resources

Show Less

To see attachments, please log in.

Like (0)

Posted: 1 year ago

Posted: 22 Feb 2024 11:16 EST

Ganesh Sharma

PEGA

replied to MarijeSchillern

Report

@MarijeSchillern Yes I did and got my answers from the Pega product team, you can mark this closed.

To see attachments, please log in.

Like (0)

Posted: 1 year ago

Posted: 22 Feb 2024 22:39 EST

Shakes

Northbridge

replied to Ganesh Sharma

Report

@Ganesh Sharma Hi Ganesh,

Do you mind sharing the answers here as we are also looking for the same, thanks.

To see attachments, please log in.

Like (0)

Accepted Solution

Posted: 1 year ago

Updated: 1 year ago

Posted: 26 Feb 2024 9:36 EST
Updated: 27 Feb 2024 4:50 EST

Ganesh Sharma

PEGA

replied to Shakes

Report

@Shakes Sharing the information I have gathered

Hazel cast

How will the application behave while the Hazel cast nodes are down?

If a single Hazel cast node goes down, restarting it will make the node join back the cluster and you should not see any other issues.
If the complete Hazel cast cluster is down, then a complete cluster (including Pega tiers) restart is required.

Were any new DB tables introduced to monitor externalized Hazel cast nodes with cluster information like cluster_ID, Node information?

No there is no DB table added to monitor Hazel cast node information. We can make use of health endpoints to get node information.

Will the nodes join back the same Hazel cast cluster in case of un-healthy nodes getting restarted or will we see split brain issue?

Yes, the nodes will join back the same Hazel cast cluster in case of a node restart.

Which monitoring tools should be used to watch the Hazel cast server?

We can use the same monitoring tools used for monitoring the Pega nodes.

Kafka

@Shakes Sharing the information I have gathered

Hazel cast

How will the application behave while the Hazel cast nodes are down?

If a single Hazel cast node goes down, restarting it will make the node join back the cluster and you should not see any other issues.
If the complete Hazel cast cluster is down, then a complete cluster (including Pega tiers) restart is required.

Were any new DB tables introduced to monitor externalized Hazel cast nodes with cluster information like cluster_ID, Node information?

No there is no DB table added to monitor Hazel cast node information. We can make use of health endpoints to get node information.

Will the nodes join back the same Hazel cast cluster in case of un-healthy nodes getting restarted or will we see split brain issue?

Yes, the nodes will join back the same Hazel cast cluster in case of a node restart.

Which monitoring tools should be used to watch the Hazel cast server?

We can use the same monitoring tools used for monitoring the Pega nodes.

Kafka

How does the application behave when External Kafka nodes are down or there is corruption of replicas?

The user experience would be same as with embedded Kafka. We would see slowness in application when Kafka is down or has offline replicas. Any new messages to be processed by QP would be stored in the delayed queue table till Kafka cluster is back online.

How to validate stream services, like partitions, leaders, under-replicated, offline count info?

New relic dashboards can be created for Kafka service, also MSK already has tons of dashboard widgets that can be used for monitoring.

To do cluster restart, currently there is an order to be followed. Does that same process be followed?

There is no specific order of restart post Pega 8.4, nodes can be restarted in any order. However, to avoid failures it is advisable to start the nodes accepting traffic at last.

What should be the default Heap memory configuration for externalized Kafka?

Typically, 1 GB is fine which is also our default in embedded Kafka, the heap memory can be increased to 2GB in the clusters having over 1000 partitions. This must be evaluated in the lower environment before applying to production.

Note: Once Kafka is externalized, we do not need any stream nodes anymore.

Show Less

To see attachments, please log in.

Likes (1)

Marije Schillern

Question

Externalized services (Hazelcast, Kakfa and Search) monitoring

Need help or want to help others?

Experience the benefits of Support Center when you log in.

Question

Externalized services (Hazelcast, Kakfa and Search) monitoring

Related content:

Need help or want to help others?

Experience the benefits of Support Center when you log in.

We'd prefer it if you saw us at our best.