Contributed by Michael Cunanan and Iolanda Da Costa Martins
Symptom
Decision Management data flows that read from Cassandra and write to database tables throw many PEGA0005 alerts during the batch data ingestion process. The data flows used in the ingestion process have a source shape referencing a Cassandra data set and a destination shape referencing a database data set. The example scenario provides more detail.
Example Scenario
Your marketing campaign application has four (4) files to be ingested within six (6) hours before the start of the business day.
Your application uses simple Decision Management data flows to update two database tables that contain 65-80M records. The INSERT queries on these tables are very slow. The overall data ingestion process takes seven (7) to eight (8) hours.
You are using about 400GB size of actual database storage capacity. Your application uses only 10 percent of the database capacity out of 4TB with the database running on an m5.2x large instance. To support the IOPS requirement for the ingestion process, you need to provision 6TB database storage with m5.4x large instance.
Increasing the database volume size to minimally support the IOPS requirement for INSERT queries, reducing the thread count to 2 or 1, and increasing the batch size to 6000, provides temporarily performance improvement but does not fully support your data ingestion requirements. Changing the database volume and allocating more storage is expensive.
Explanation
Within your marketing campaign application, the Interaction processing is becoming slow, and the Interaction History (IH) Summary displays the wrong records. This issue impedes your marketing campaigns and prevents strategies from writing offers. Your strategies cannot write offers because live campaigns are affected by the aggregated dataflow not starting after setting materialization. The aggregation does not occur because an IH Summary 4-day rule is present in the IH Summary but not in the IH Summary Landing page.
In this case, the problem occurs because the system did not prevent aggregations for withdrawn IH summary data sets.
Details
Investigation found that the application was reaching the database throughput limit, causing IOPS throttling and slow INSERT queries fired from the database. In this case, it was caused by withdrawn IH DADA records that were still included in the IH data sources and saved to the Withdrawn dataset.
Starting with Pega Platform 8.6, the system no longer allows aggregations to happen for withdrawn IH summary data sets.
Environments
The problem was reported for the following environments:
- Pega Platform 8.7.3 in Pega Cloud® services 2.23
- Pega Platform 8.4.4
Solutions
Best practice: Keep current with Pega
Update to Pega Platform version 8.8 or a later release when it is available.
If you cannot update your Pega deployment and need to continue using a Pega Platform release earlier than 8.8, remove withdrawn data sets from IH Summaries. See the Pega Customer Decision Hub 8.6 and Pega Marketing 8.6 article, Removing withdrawn data sets from interaction history summaries.
Suggested Approach
As an alternative to resolving the database performance issue, complete the following steps to run INSERT queries only (not UPSERT queries):
- Create a subscriber class and a subscriber spine class that are mapped to different tables with same table definition.
- Truncate the spine table.
- Run INSERT queries on the spine table only and at end of dataflow run only: Use an activity to change the database table instance for the customer class to point to the spine table.
- Perform Steps 1 through 3 daily after the data ingestion process has completed.
By doing this, the spine table becomes the customer table, and the customer table becomes the spine table for next day.
Continue this switching procedure every night.
Result: The records being inserted should include the entire set for records for the subscriber or accounts table that you need.
This suggested approach ensures that the system is running INSERT queries only and not UPSERT queries during data ingestion that negatively affects database IOPS. The data ingestion process will run faster, causing less negative impact on database performance.
This issue will be addressed in a future patch release.
Watch the Pega Support Center Pega Platform Resolved Issues. For the future patch release, search for the ISSUE# or descriptive keywords. Watch this Support Document for update information.
Related content
PEGA0005 alert: Query time exceeds limit
Removing withdrawn data sets from interaction history summaries
Troubleshooting interaction history
Ingesting data from Kafka topics into the Decision Data Store