No of assignments created during data flow execution depends on Number of partitions in source data or Thread count and batch scalability factor?

Question

ABHINANDAN

Member since 2011

32 posts

Areteans Technology Solutions

Posted: Jun 22, 2018

Last activity: Jul 23, 2018

Posted: 22 Jun 2018 1:39 EDT
Last activity: 23 Jul 2018 14:21 EDT

Closed

Solved

No of assignments created during data flow execution depends on Number of partitions in source data or Thread count and batch scalability factor?

Report

I am bit confused on how data flow processing occurs, as in how many assignments are created during execution.

The data flow help suggests the following

"Specify the number of the Pega 7 Platform threads that are assigned to process running the data flows and the batch scalability factor to use idle threads for running the data flows.

For example, when the source of a data flow is divided into five partitions, the data flow run is divided into five assignments that can be processed simultaneously on separate threads if there are enough threads.

The number of available threads is calculated by multiplying the thread count by the number of nodes. With two nodes and five threads in the system, the data flow run uses five threads and five threads remain idle. After you set the batch scalability factor to two, all 10 threads are used to process five assignments.

Enter the number of threads.

Note: The number of threads for running data flows is the same across all decision data nodes that are configured for the Data Flow service.
Enter the batch scalability factor."

If you observe the Italic lines in the above Data flow help, it suggests no of assignments depends on the number of partitions.

I am bit confused on how data flow processing occurs, as in how many assignments are created during execution.

The data flow help suggests the following

"Specify the number of the Pega 7 Platform threads that are assigned to process running the data flows and the batch scalability factor to use idle threads for running the data flows.

Enter the number of threads.

Note: The number of threads for running data flows is the same across all decision data nodes that are configured for the Data Flow service.
Enter the batch scalability factor."

If you observe the Italic lines in the above Data flow help, it suggests no of assignments depends on the number of partitions.

But if you see the attached PNG file showing data flow settings, there it is mentioned that Number of assignments = No of nodes * Thread Count * Batch scalability factor.

So question is which one is correct and how actually data flow parallel processing happens and what is the role of partitions, node count, thread count and batch scalability factor?

Regards

Abhi

Show Less

To see attachments, please log in.

Decision Management

Like (0)
Share this page Facebook Twitter LinkedIn Email Copying... Copied!

Accepted Solution

Posted: 7 years ago

Posted: 5 Jul 2018 5:16 EDT

TomVanDuist

PEGA

replied to ABHINANDAN

Report

You are absolutely right, the help documentation is incorrect in implying that multiple threads will process the same partition/assignment.

We will amend the help text documentation with the following description on Batch Scalability:

"It is used to calculate the suggested number of partitions to be used in a data flow run, that number is calculated using this formula: numOfNodes * threadCount * scalabilityFactor. Keep in mind that this calculation will only suggest the number of partitions, it's up to the dataset implementation to decide how many partitions will actually be used."

Thank you for finding this.

Best,
Tom

View reply inline

To see attachments, please log in.

Posted: 7 years ago

Posted: 3 Jul 2018 9:18 EDT

PaulGentile_GCS

PEGA

replied to ABHINANDAN

Report

The way I understand it is that the number of assignments is determined by number of partitions. That is not going to change, regardless of number of nodes and/or threads.

Now, number of available threads is going to change and if it is higher than the number of assignments, then you have to increase the batch scalability in order to take advantage of idle threads. This tweaking will not change the number of assignments.

To see attachments, please log in.

Like (0)

Posted: 7 years ago

Posted: 3 Jul 2018 10:13 EDT

TomVanDuist

PEGA

replied to PaulGentile_GCS

Report

This is correct, however, I do like to add to this:

When looking at partitions from the point of view of assignments, you can see the total partitions as the total assignments that need to be picked up by threads during the execution of a data flow run. This is distinct from the number of simultaneous assignments.

The number of partitions is defined by the source, the number of simultaneous assignments by the number of threads * number of nodes (and batch scalability factor in case of batch runs).

However, some sources set the number of partitions based on the number of threads. For example the 'Monte Carlo data set' partitions its data non-deterministically over all threads, it does this by setting the number of partitions to 'nodes * threads' so all partitions will be executed at the same time (as opposed to one by one if you have more partitions than threads).

To see attachments, please log in.

Like (0)

Posted: 7 years ago

Posted: 3 Jul 2018 15:26 EDT

ABHINANDAN

Areteans Technology Solutions

replied to ABHINANDAN

Report

Thanks for clarifying it, so that means the hover text "Number of assignments = No of nodes * Thread Count * Batch scalability factor." in the attached screenshot is wrong and probably needs to be corrected by pega.

To see attachments, please log in.

Like (0)

Posted: 7 years ago

Posted: 4 Jul 2018 3:52 EDT

TomVanDuist

PEGA

replied to ABHINANDAN

Report

It is not necessarily wrong, the number of assignments in this case is the amount of simultaneous assignments that will be picked up by data flow threads.

To see attachments, please log in.

Like (0)

Posted: 7 years ago

Posted: 4 Jul 2018 9:56 EDT

ABHINANDAN

Areteans Technology Solutions

replied to ABHINANDAN

Report

Still not very clear e.g if the source has ten partitions then there will be ten assignments.

Now if no of nodes = 6, Thread count = 5 and batch scalability factor = 6, then as per your understanding no. of simultaneous assignments will 6*5*6 = 180. But obviously it is much more than No of assignments 10, so how it will work?

Based on what's mentioned in data flow help and what PaulGentile_GCS has mentioned , I believe formula is probably the below one

No of assignments = No of partitions in source , no argument on this.

No of parallel threads working on the assignments = (batch scalability factor/No of nodes) * (No of nodes * Thread count) = batch scalability factor * Thread count.

Now if "No of parallel threads working on the assignments" < No of assignments then all assignments won't be simultaneously processed,

if "No of parallel threads working on the assignments" = No of assignments, then all assignments will be simultaneously processed one by each thread,

if "No of parallel threads working on the assignments" > No of assignments, then more than one thread will be working on each assignment.

Not sure if it's possible to make batch scalability factor more than no of nodes, if yes how it will work.

Do you guys think I am arriving at right conclusion?

Regards

Abhi

To see attachments, please log in.

Like (0)

Posted: 7 years ago

Updated: 7 years ago

Posted: 4 Jul 2018 11:09 EDT
Updated: 4 Jul 2018 11:08 EDT

TomVanDuist

PEGA

replied to ABHINANDAN

Report

Yes, except for the last conclusion.

Multiple threads cannot work on the same assignment.

If for example you have a source that does not support partitioning, there will only be one partition. Multiple threads/nodes cannot divide the work among them (this is what partitions are used for) so only one thread will be executing this data flow.

So: if "No of parallel threads working on the assignments" > No of assignments, then all assignments will be simultaneously processed one by each thread.

Btw, it would be better to use the terminology of partition instead of 'No of assignments' in the above statement. Then it can be simplified as if "no of assignments" > partitions, then all partitions will be processed simultaneously each by one thread.

To see attachments, please log in.

Like (0)

Posted: 7 years ago

Posted: 4 Jul 2018 13:59 EDT

ABHINANDAN

Areteans Technology Solutions

replied to ABHINANDAN

Report

Ok, but in the case of "No of parallel threads > No of partitions if each partition is processed by one thread only then that means there will be idle threads, doesn't that defeat the purpose of batch scalibility factor stated by the DF help text which I have highlighted below

"The number of available threads is calculated by multiplying the thread count by the number of nodes. With two nodes and five threads in the system, the data flow run uses five threads and five threads remain idle. After you set the batch scalability factor to two, all 10 threads are used to process five assignments."

Am I missing something here?

Regards

Abhi

To see attachments, please log in.

Like (0)

Accepted Solution

Posted: 7 years ago

Posted: 5 Jul 2018 5:16 EDT

TomVanDuist

PEGA

replied to ABHINANDAN

Report

You are absolutely right, the help documentation is incorrect in implying that multiple threads will process the same partition/assignment.

We will amend the help text documentation with the following description on Batch Scalability:

Thank you for finding this.

Best,
Tom

To see attachments, please log in.

Like (0)

Question

No of assignments created during data flow execution depends on Number of partitions in source data or Thread count and batch scalability factor?

Need help or want to help others?

Experience the benefits of Support Center when you log in.

Question

No of assignments created during data flow execution depends on Number of partitions in source data or Thread count and batch scalability factor?

Related content:

Need help or want to help others?

Experience the benefits of Support Center when you log in.

We'd prefer it if you saw us at our best.