non performing batch performance with SQL Server JDBC drivers
environment is PRPC 7.1.3 with Pega Marketing and SQL Server 2016, with JDBC driver sqljdbc42.jar running via WebSphere 8.5 on Linux (RHEL) virtualized servers.
we have noted a widely varying difference in performance when retrieving data from SQL Server via a DataFlow in Batch mode, whereby 96% of the time running the DataFlow is spent reading data from SQL Server (5 tables) against a data set of roughly 20 million rows, and providing unacceptable timing for the process.
the tables are indexed properly, and when you run a test with a batch size reduced to 40 or 50, the performance is very good with queries returning in 30-60 ms. however if you increase the batch size to 100 or more, the same queries now take in excess of 400,000 ms.
for a manually created dataFlow ( to read from SQL Server, build a class, and save the data to Cassandra DDS) we have full control over the performance as we can set the defaultBatchSize to a suitable value (say 50, which works fine).
but, when running Pega Outbound Marketing, the defaultBatchSize is not used (or so it appears), and I believe that it calculates its own value based on the partition size, and passes in the value of pyBatchSize to the DataFlow run. leaving us no control over the process.
environment is PRPC 7.1.3 with Pega Marketing and SQL Server 2016, with JDBC driver sqljdbc42.jar running via WebSphere 8.5 on Linux (RHEL) virtualized servers.
we have noted a widely varying difference in performance when retrieving data from SQL Server via a DataFlow in Batch mode, whereby 96% of the time running the DataFlow is spent reading data from SQL Server (5 tables) against a data set of roughly 20 million rows, and providing unacceptable timing for the process.
the tables are indexed properly, and when you run a test with a batch size reduced to 40 or 50, the performance is very good with queries returning in 30-60 ms. however if you increase the batch size to 100 or more, the same queries now take in excess of 400,000 ms.
for a manually created dataFlow ( to read from SQL Server, build a class, and save the data to Cassandra DDS) we have full control over the performance as we can set the defaultBatchSize to a suitable value (say 50, which works fine).
but, when running Pega Outbound Marketing, the defaultBatchSize is not used (or so it appears), and I believe that it calculates its own value based on the partition size, and passes in the value of pyBatchSize to the DataFlow run. leaving us no control over the process.
for example: using a partition key with 100 values, if we send in a segment size of 5000 customers, with 100 partitions, you get 50 customers per partition, which is assigned to a DataFlow thread to process. we can observe in the log files (with com.pega.dsm.dnode set to DEBUG) that the resulting queries use a where clause with 50 or'd customer ID's. this performs fine, but if we increase the segment to 8000, (80 customers per partition) we can see the where clause contains 80 or'd customer ID's and the performance slows down in the neighborhood of 7-8,000 times slower.
we tried altering the defaultBatchSize and it is definitely ignored in Outbound Marketing.
when you increase the where clause to read a larger batch the JDBC server changes the access to something drastically more inefficient, it almost feels like the index is ignored and you are doing a table scan of the 20 Million Customers.
you can run the exact same query via SQL Server Management Studio, and it performs very quickly (as expected), the explain plan confirming that the index is being used properly. it is only the combination of JDBC and a large where clause that triggers the issue.
I have not seen the issue with Oracle or Postgres, it seems to be unique to SQL Server
we are looking at two fronts:
can we manually fix the batch size for an Outbound Marketing DataFlow run?
is anyone in the Community aware of JDBC configuration settings that affect this, so that we can address the problem at source?
I am aware of the JDBC versions available for SQL Server 2016 https://docs.microsoft.com/en-us/sql/connect/jdbc/microsoft-jdbc-driver-for-sql-server-support-matrix?view=sql-server-2016 and which ones are supported by JDK 1.8
as well as what is specified in the 731_PlatformSupportGuide from Pega.
and we are using the recommended driver:
Microsoft JDBC Driver 6.0 for SQL Server | 6.0 | sqljdbc42.jar |
looking for answers quickly.
***Edited by Moderator: Lochan to add the SR Parallel tag to post***