Discussion
Pegasystems Inc.
AU
Last activity: 23 Sep 2020 9:09 EDT
Data Page "Aggregate Sources" option: De-duplicating records that exist in multiple systems [LSA Data Excellence]
How best to de-duplicate data when records for the same entity are retrieved from multiple Systems of Record?
For List-structure Data Pages, the Response Data Transform for each Source in your "Aggregate Sources" configuration should append to .pxResults those records it can't filter out without knowing about other data sources. The Post Load Processing Activity* of the Data Page is the right place to perform a sweep of all records added to .pxResults from all Systems of Record and apply de-duplication rules. Over and above removing duplicates, this may include:
- Merging data into a "single view" where Data Source X owns properties A, B and C of Customer Z, whilst Data Source Y owns properties D, E and F of Z.
- Sorting the de-duplicated, merged data such that it is more naturally ordered. The default ordering of .pxResults will be in the order the Sources in the Aggregate Sources configuration are contacted.
For Page-structure Data Pages, perhaps still append to .pxResults (always available via inheritance from @baseclass) as per the above, but be mindful that your Post Processing Load Activity* will need to:
- Settle on one-and-only-one result to map back to Primary (the data page)
- Remove .pxResults once the 'golden' record has been harvested from it
Resist the temptation to map the results from each data source to a separate top-level page. This is a form of tight-coupling between each data source's Response Data Transform and the Data Page's Post Load Processing Activity*. If temporary top-level pages are needed (as list members are merged and/or deleted), encapsulate this entirely within the Post Load Processing Activity*.
Resist the temptation to assume that the Response Data Transform DTb for Data Source B always runs after the Response Data Transform DTa for Data Source A. If DTb makes assumptions like
"if the size of pxResults is 0, then Data Source A didn't find any results, which means I - DTb - can assume some stuff"
This not only tightly-couples DTb to DTa, but also requires that the Data Page must have successfully contacted Data Source A before Data Source B.
Have each of the Response Data Transforms - in both the above warning scenarios - retain responsibility for record mapping and error handling that is specific to working with that data source. Let the Post Load Processing Activity* arbitrate over the collated result set once it has been gathered.
* Yes the use of a Post Load Processing Activity will yield a guardrail warning. Your justification is that this rule is required to make business sense of data obtained from multiple sources. Just because there is a guardrail warning, doesn't mean you should not do it. Trying to fold de-duplication, merging and sorting behavior into Response Data Transforms just to avoid the guardrail warning increases coupling, reduces cohesion; impairing their testability, reusability and understandability.
Discussion on this topic was sought from the LSA Data Excellence (Pega 8.4) webinar conducted in July 2020. The webinar and its full set of discussions that arose from it are available at LSA Data Excellence: Webinar, Questions & Answers.