Kafka Partition instabilities and slowness
We upgraded from 8.4 to 8.8 version. Previous version had 20 partitions per Kafka topic. On current versions, Pega has decreased the number of Kafka partitions from 20 to 6 as they identified that 6 partitions would be providing a good throughput for processing and, on rare cases where clients have too many messages per second to be processed by Queue Processors, we would actually need more than 6 partitions – which we can alter the partition count for a specific QP through the activity pxAlterStreamPartitions.
If the number of Kafka partitions are high it can cause partition replication instabilities. This scenario is known to happen when we cross around 900~1000 partitions per stream node. To overcome this issue, we would need to decrease the partition count to less than 900 to be on the safe side.
The DSS that controls the number of partitions per topic is the below one:
Owning Ruleset: Pega-Engine Purpose: prconfig/dsm/services/stream/pyTopicPartitionsCount/default Value: 6
Once the DSS is properly set to 6 (or making sure it’s not set as it defaults to 6 on versions after 8.7), please follow the below steps:
- Go to Dev Studio > App explorer > find the class Pega-DM-DDF-Work and click on the leaf icon.
- Find the dataflow run “DelayedItemsDataFlowService” > open it and click on STOP button – this will make your delayed items to not move from Scheduled column (DB) to Kafka (Ready to process). Immediate items can still be written to Kafka, this is just to reduce the volume of queued messages.
- Make sure there’s no Ready to Process items in the Queue Processor landing page in Admin Studio.
- Bring down both Stream nodes.
- Delete the Kafka-data folders from both Stream servers.
- Bring back one Stream node, wait for it to report NORMAL status, then start bringing up the second Stream node. This sequence is to ensure full stability when creating kafka sessions, partitions and distributing resources across the Kafka cluster.
- Confirm the total number of partitions by clicking on NORMAL on each Stream node in the Stream landing page in Dev Studio. You should see around 1/3 of the previous total number.
- Follow the Step 1 again and start the DelayedItemsDataFlowService dataflow (Restart button).
- Once done, please validate if all the Queue Processors are working fine.