How to effectively load Customer Spine data with Cassandra/DDS in Pega Marketing (Cloud)

Question

MarceloMarote

Member since 2012

5 posts

futureproofAI

Posted: Nov 19, 2019

Last activity: Dec 4, 2019

Posted: 19 Nov 2019 9:24 EST
Last activity: 4 Dec 2019 1:33 EST

Closed

Solved

How to effectively load Customer Spine data with Cassandra/DDS in Pega Marketing (Cloud)

Report

Hi guys,

We are using Pega Platform 8.1.5 (PegaCloud) with Pega Marketing 8.1.5

In our current setup we are using the following repositories in order to store Customer data:

Customer DDS table (internal Cassandra) - contains 100% of the Customer data
Customer DB table (Postgress) - simply contains the CustomerID field (we do this due the limitation on Pega Marketing Segments which require a relational database table). We might add new fields if required for segmentation.

The (simplified) load process is as follows:

We get a Customer file in JSON format as our input. It contains a flat structure of customer attributes.
Using a combination of a dataflow and several datasets, we:
1. truncate the Customer DB table
2. truncate the Customer DDS table
Load the JSON file into Pega
We then write the Customer properties into:
1. the Customer DB table
2. the Customer DDS table

So, basically, we do a truncate and an insert.

This works for us because is a simple data load process using dataflows/datasets which provides a flexible integration point (the input JSON file can be modified to contain new attributes which are transparently loaded into Pega).

At this point, the file is getting pretty big (and it will get bigger) and it takes about 40 mins to load.

The new requirement is to have 24/7 data availability.

Hi guys,

We are using Pega Platform 8.1.5 (PegaCloud) with Pega Marketing 8.1.5

In our current setup we are using the following repositories in order to store Customer data:

Customer DDS table (internal Cassandra) - contains 100% of the Customer data
Customer DB table (Postgress) - simply contains the CustomerID field (we do this due the limitation on Pega Marketing Segments which require a relational database table). We might add new fields if required for segmentation.

The (simplified) load process is as follows:

We get a Customer file in JSON format as our input. It contains a flat structure of customer attributes.
Using a combination of a dataflow and several datasets, we:
1. truncate the Customer DB table
2. truncate the Customer DDS table
Load the JSON file into Pega
We then write the Customer properties into:
1. the Customer DB table
2. the Customer DDS table

So, basically, we do a truncate and an insert.

At this point, the file is getting pretty big (and it will get bigger) and it takes about 40 mins to load.

The new requirement is to have 24/7 data availability.

We are considering 2 options in order to achieve it:

OPTION 1: Changing the load process from "full" to "incremental".

This is the preferred option right now since we believe it would achieve 24/7 availability.

Basically, instead of doing a truncate/insert, we would to never truncate and would do an insert/update.

We believe this is easily achievable by configuring the datasets appropriately:

The DB dataset as "Insert new and overwrite existing records"
The DDS dataset does an insert/update by default.

The question for this option is:

Assuming we are loading several million records to DDS/Cassandra, taking approx 1 hour: What is the impact if records are being retrieved from Cassandra to serve the inbound channel whilst the data load is in progress.
1. locking issues?
2. performance issues?
3. any other issues to watch for?

[Note: As suggested by Kevin in the comments below, we will investigate the option of using partitions and splitting the input file into several to take advantage of parallel data flow runs.]

OPTION 2: Introducing the concept of an staging area.

We think this option would get us close to 24/7 availability, but not to 100%.

We could optimize the process by introducing the concept of a typical "staging area".

If we were doing this with old fashion ETL, we would load the data into a Customer DB staging table, and once the load is complete, we would simply rename the tables.

We are trying to do something similar with datasets, something like:

Load the input JSON file into Pega using a File DataSet - no changes here
Write the data (Customer IDs only) into a staging Customer DB table - change, but no issues here
Write the data (100% of the data) into a staging Customer DDS - change, but no issues here
Instead of renaming the Customer DB table (and considering that we are on PegaCloud and don’t have full DB control), we are thinking about creating a dataflow which as follows:
1. Source component: the staging Customer DB table
2. Destination component: the "final" Customer DB table
Instead of renaming the Customer DDS, we are thinking about creating a dataflow which as follows:
1. Source component: the staging Customer DDS
2. Destination component: the "final" Customer DDS

So, essentially, we would achieve the same as in a typical ETL world, but moving the data from the staging area to the "final" area still takes a few minutes.

Do you guys have any suggestions? Do these approaches make sense? Any suggestions to improve the process are highly appreciated.

Thank you very much,

Marcelo.

Show Less

To see attachments, please log in.

Pega Marketing

Data Integration

Cloud Services

Like (0)
Share this page Facebook Twitter LinkedIn Email Copying... Copied!

Accepted Solution

Posted: 5 years ago

Posted: 21 Nov 2019 7:32 EST

KevinZheng_GCS

PEGA

replied to MarceloMarote

Report

one thought regarding your initial file dataset loading process, you should consider using partitions if possible (versus using one giant file) as distinct files can be processed in parallel in dataflow distribution runs. This potentially can shorten the initial loading time.

To see attachments, please log in.

Like (0)

Posted: 5 years ago

Posted: 21 Nov 2019 17:19 EST

MarceloMarote

futureproofAI

replied to KevinZheng_GCS

Report

Thank you Kevin, we will definitely explore this option.

Also, I added more detail to the initial post.

Could you please be so kind to have a look and provide any feedback?

Thank you!

Marcelo.

To see attachments, please log in.

Like (0)

Posted: 5 years ago

Posted: 3 Dec 2019 8:28 EST

MarceloMarote

futureproofAI

replied to KevinZheng_GCS

Report

Hi Kevin,

I've done some tests as you suggested. I split the one giant file into chunks of 100,000 records each so that each of the files can be processed in parallel in dataflow distribution runs. Since we have 3 dataflow nodes, the load time is now around one third of the original time.

Thanks a lot for this suggestion. It really helps.

Regards,

Marcelo.

To see attachments, please log in.

Like (0)

Posted: 5 years ago

Posted: 21 Nov 2019 15:50 EST

MarceloMarote

futureproofAI

replied to MarceloMarote

Report

Thank you Kevin, we will definitely explore this option.

To see attachments, please log in.

Like (0)

Question

How to effectively load Customer Spine data with Cassandra/DDS in Pega Marketing (Cloud)

Need help or want to help others?

Experience the benefits of Support Center when you log in.

Question

How to effectively load Customer Spine data with Cassandra/DDS in Pega Marketing (Cloud)

Related content:

Need help or want to help others?

Experience the benefits of Support Center when you log in.

We'd prefer it if you saw us at our best.