Question
Adqura
IN
Last activity: 8 Sep 2021 10:20 EDT
How the Materialized IH summary dataset behave in Producion environment
Hi all,
We planned to use Materialized IH Summary datasets in our application. As per the requirement, we need to read the complete IH records (Time Period set to 'All time') with proper filter conditions and aggregations, so that the dataset will return a smaller number of IH records. There will be around 50 datasets in the application. In one strategy, we import around 20 datasets to check different contention rules.
In the Production environment, all outbound communications will be done as part of Batch execution and all inbound communications will be done with real-time API calls. Also, there will some real-time API calls to capture the responses from customers. Hence, while Batch is running, we will be hitting with real-time API calls. As we are planned to use Materialized IH Summary datasets, they will be aggregated when there is a new IH inserted each time.
I have some questions about performance in the Production environment:
- While the batch is running and in parallel aggregating of dataset will also happen. How will this impact the Batch run performance?
- What sort of a performance impact would a growing IH cause while using an "Import Interaction history summary" shape in strategies (As I mentioned above, we use around 20 datasets in a strategy)
Thanks in advance,
Kiran
-
Like (0)
-
Share this page Facebook Twitter LinkedIn Email Copying... Copied!
Pegasystems Inc.
US
In answer to your questions:
(1) It will affect the performance of batch to some extent. The cost though should be lower than reading from IH directly to aggregate as was previously the case.
(2) The goal of IH Summaries is to try and minimize the cost of a growing IH. To maximize performance though, see if you can minimize the number of distinct datasets. It is far better to have one dataset covering two use cases but may require further aggregation in the strategy vs. having two distinct datasets.
e.g. if you have one dataset with keys Issue, Group and Name for 30 days and a second dataset which only needs aggregates on Issue for 30 days, it is better to create one dataset for both use cases on Issue, Group and Name. In the strategy take the aggregates for Issue, Group, Name and group by and aggregate on Issue for the second use case.
Note that we have made it easier in 8.6+ to reduce the number of datasets. Each aggregate in each dataset can now have its own time range...rather than only allowing one time range for all aggregates per dataset. Let me know if this answer helps.
thanks,
-
Kiran Kumar Devireddy Wojciech Tekiela
Infosys Limited
US
@Gareth CollinsIam having couple of questions related to Aggregation. We are on 8.4 version.
Currently we do have multiple dataset defined actionaggregatelast30days, 80Days, 180 days.And we materialized them.
Lets assumed we launched new aggregatedata set for last 365 days in production then here are the questions:
will it aggregate for new transaction only or it will look precious IH records aggregation and rollover to it.
Does it always look new record for aggregation or read decision result also at cassandra level?
When we not materialized 1 dataset aggregation, why the statistics restarted on IH materialization page?
Pegasystems Inc.
NL
@LokeshA6534 20 aggregation datasets are WAY too many. This will definitely affect, in a negative way, the performance for both real-time and batch decisions. What drives this number ?
In terms of changes - when you add/change a new summary the system figures out what it the shortest time period it needs to replay from IH in order to re-calculate the changed summaries.The datasets which were not changed will ignore these re-played records. Only the changes ones will consume the replayed data. e.g in you example you added a summary which needs 365 days worth of data. In this case the the 365 days of data will be replayed and only this dataset will consume it.
If a dataset is not materialized, when reading the summary it will simply read the data from IH on request and it will roll it up according to the aggregation logic.
See other entries related to this topic - https://collaborate.pega.com/question/ih-summaries-materialized-data-set-lifecycle.
I hope my answer makes sense.
Cheers,
Ionut
Infosys Limited
US
@rusui Thank you.
20 dataset was not my point . It was someone else :)
Just to let you know that we keep IH only for 30 days in cassandra.
And if launch Aggregation for 365 days on 31st day, i am assuming that it will start aggregating all last 30 days from Cassandra or IH table?
We do have everyday 7 million IH records so in total it will have 7 million * 30 days records.
Pegasystems Inc.
NL
@LokeshA6534 yes, it will aggregate from the earliest available date in IH but no earlier than 365 days. In your case it will aggregate the last 30 days.
It is good you're not considering 20 summaries :)
Cheers,
Ionut
Updated: 7 Sep 2021 12:26 EDT
Infosys Limited
US
@rusui Thank you so much. So from where does it read IH to aggregate as summary to Cassandra side? Is it from RDBMS or from Cassandra only?
Does it run on daemon?
Do you see any reason to get performance issue while doing the aggregation of this large volume data and if yes, how to mitigate into production environment where you have live traffic reading at the same time.
Just as fyi, we recently launched aggregation of 365 days including 1 new attribute which was never aggregated for any summary rules and after the same, we saw performance degraded significantly and also could see lot of DDS read alert for real time traffic.
Overall Strategy Execution time increased from 70 ms to 110 ms and after removing this 365 days dataset call from suppression rule, it started came down to 70ms.
Pegasystems Inc.
NL
@LokeshA6534 IH data is replay-ed from RDMBS and the aggregate is stored in Cassandra. I can see that re-creating the 365 aggregate out of 210 mil records may come at a runtime cost. DDS alerts indicate that C* cluster is under stress. If that is that case indeed consider scaling it out. What version are you on ?
The materialization of IH summaries runs on the Dataflow tier and indeed it is a background job.
You can check how far the replay got by checking the in the Decisioning/Decisions/Data Sources , expand the summary and you can see how many records it had processed and up to what timestamp.
-
Gokul Ramalingam
Pegasystems Inc.
US
When a new time period interval (dataset) is defined, will it trigger the aggregation calculation for all the existing aggregation data sets?
E.g If we have 180 days & 90 days aggregation setup and materialized. Now, we added a new 120 days data set, will it re-calculate for all the intervals or just the new interval?
Pegasystems Inc.
NL
@Gokulthe rule is that the recalculations are done ONLY for the changed ones. Once changes are detected the Aggregation service calculates the longest period of time, across all the changed summaries, that it needs to replay. It unwinds the aggregation tracker to that position and it replays. The summaries which were not changed will simply ignore the data.
In your example you have a 90 and 180 days and now you add the 120 days. The change detection logic determines it needs to unwind the aggregation tracker by 120 days, it does that and it starts to replay IH from 120 days ago. The 90 and 180 days summaries will ignore these records because they were not changed.
Makes sense ?
-
Gokul Ramalingam