Issue
Queue processor (QP) failures in the ‘DelayedItemsDataFlowService’ reported in Pega Infinity ’23 and later versions. This issue prevents QP items from being processed, resulting in missed Service Level Agreements (SLAs).
Symptoms and impact
The following symptoms are observed:
-
QP items are:
-
Scheduled but not being processed by QP with a delayed option.
-
Not being picked at their scheduled time and remain unresponsive in the queue.
-
-
Items associated with the delayed QP remain in the 'Scheduled' state and do not move to the 'Ready to Process' stage even after the scheduled time has passed.
-
Affected nodes log errors indicate failures in retrieving items from the database. The following PostgreSQL exceptions are observed:
ERROR: current transaction is aborted, commands ignored until end of transaction block Call getNextException to see other errors in the batch.\nProblem #2, SQLState 25P02, Error code 0: org.postgresql.util.PSQLException: ERROR: current transaction is aborted, commands ignored until end of transaction block\nProblem #3, SQLState 25P02, Error code 0: org.postgresql.util.PSQLException: ERROR: current transaction is aborted, commands ignored until end of transaction block\nProblem #4, SQLState 23505, Error code 0: org.postgresql.util.PSQLException: ERROR: duplicate key value violates unique constraint \"pr_sys_locks_pk\"\n Detail: Key (pzinskey)=(PROCESSRESOURCEREADINESS) already exist
Root cause
The issue occurs due to the locking conflicts on the 'pr_sys_locks_pk' table that prevents the system from retrieving delayed or scheduled items from the database for Kafka processing. These locking conflicts impact the input thread of the data flow and prevent QP items from being processed.
Solution
This issue is planned to be addressed in a future Pega Platform release. Issues are prioritized based on impact, severity, and capacity. The specific release for the fix has not yet been determined. This section will be updated with release details when the fix for this issue is available.
Workarounds
As a workaround, apply the following hotfixes. These hotfixes prevent the data flow input threads from going missing but do not address the root cause.
-
HFIX-C2295 on Pega Platform version 24.1.2
-
HFIX-C2326 on Pega Platform version 24.2.0
-
HFIX-C2382 on Pega Platform version 24.2.1
-
HFIX-C2296 on Pega Platform version 24.2.2
OR
Perform the below actions to restart the ‘DelayedItemsDataFlowService’:
-
From Dev Studio, navigate to the App Explorer and search for ‘Pega-DM-DDF-Work’ instances.
-
Open ‘DelayedItemsDataFlowService’ from the ID column to access the data flow landing page. The data flow can be stopped and restarted from this page.
-
Proceed with the restart and accept the prompt to reset statistics.
OR
In multi-node environments with cluster-wide job schedulers, database locking issues can occur. In such environments, use the ‘pzDelayedQueueProcessorSchedule’ job scheduler as an alternative to the Data Flow-based approach for processing delayed items.
Perform the below actions to enable the alternative path:
-
Set the DSS to: 'queueprocessor/delayeditems/dataflowbased/enabled'
-
Set the DB flag within the Pega engine to ‘False’.
If the ‘pzInsKey’ in the exception identifies a job scheduler, configure the job schedulers to use in-memory locking:
DSS name |
job/schedule/lock/method |
RuleSet |
Pega-RulesEngine |
Value |
Memory |
Before deploying the change to production, test it in a lower environment and confirm that the job schedulers:
-
Execute as scheduled
-
Run only on the intended nodes
-
Do not have any exceptions