Discussion
Pegasystems Inc.
JP
Last activity: 17 Oct 2023 9:38 EDT
Understanding Pega Blob and UDF
Hi,
In this post, I will share what I know about Pega Blob technology and UDF. If you find anything incorrect, please let me know I will revise it.
- This article was written back in Pega 8.5 days and some of the contents are old. Please see "Update" at the bottom for UDF support.
1. What is Blob?
Blob (Binary Large Object) is a data type that stores binary data. It is typically an image, audio or other multimedia object that therefore often requires significantly more space than most other data types such as integers, characters, and strings. These are classified as unstructured data. The other classification is semi-structured data such as XML files, and Pega takes this approach. Both of unstructured, and structured data are generally not interpretable by the DBMS. An application must create and edit the content, while the DBMS merely stores it.
2. How does Pega use Blob?
In Pega, there are few tables that do not have Blob column, such as Declare Index table, or Data Type table in the CustomerDATA schema. However, most of the tables have Blob column and it is one of the key technologies that supports Pega's data management. The name of Blob column is "pzPVStream".
Below is the process when Pega stores data into the database.
1. Obfuscation
First, Pega data (XML representation) is obfuscated and gets hard to understand for the reader. In general, "obfuscation" is a data masking technique to obscure its meaning, providing an added layer of security protection. It is similar to "encryption" but doesn't require any secret key and it provides no serious level of security like "true" encryption would (a hacker should be able to de-obfuscate it). The obfuscation algorithm is proprietary to Pega and it is not disclosed.
2. Compression
Next, Pega compresses data by using the java.utils.ZIP libraries, which can reduce Blob size by a third or more. Compression is enabled in the DeflateStreams settings by default, but it is also possible to disable it from Dynamic System Settings if you want to.
3. Encryption
Blob encryption is not enabled by default, but you can do so by using a platform or custom cipher. I've created a How-to document in a separate article. Please see https://support.pega.com/discussion/database-encryption.
The advantages of using Blob are the following:
- Compressed Blob lowers storage overhead
- No size constraint
- Ease of managing complex / Nested structures
- Doesn't need DBA to make complex changes to the database schema
- Fast single-object access
- Allows object model to evolve
- "Only needed" columns are mapped relationally, eliminates need for large amount of SQL construction
- Delivers agility
At the early stage of a project, developers are more focused on business requirement in DCO and initial data structure design may change as you get a bigger picture of application architecture. UDF supports developers' fast development without having to pay much attention to the physical database layer. It is more efficient to optimize the properties once application framework gets solid in the later phase.
3. What is UDF?
Starting in PRPC6.2 SP2, we have introduced a set of UDFs (User-Defined Functions), a.k.a. "Blob reader" or "DirectStreamReader", to provide functionality to retrieve scalar property values directly from Blob in the database. This feature is only available for use in Report Definition. There are three UDFs installed in the database:
(1) pr_read_from_stream
(2) pr_read_int_from_stream
(3) pr_read_decimal_from_stream
These functions are identical, except for the data type of the value returned (String, Int, Decimal). They are installed in both the rules and data schemas. The UDFs are defined and loaded differently in each of our four supported databases. They are implemented using Java on Oracle, PostgreSQL, and DB2 LUW/ZOS, and using C# on Microsoft SQL Server. These are the first UDFs we ship, and this is our first time running Java (to say nothing of C#) in the database.
4. How to use UDF
4-1. Prerequisites
To install UDF, Java must be enabled on the database for Oracle, PostgreSQL, and DB2. For Microsoft SQL Server, CLR (Common Language Runtime) must be enabled.
4-2. Installation
It is installed automatically as part of the standard installation and upgrade processes. If you do not want to install UDF, you need to edit setupDatabase.properties in the ./scripts directory of the distribution media. The default value is blank, which is interpreted as false and UDF is installed. Set it to true to bypass UDF installation. This file is applied not only for command-line installation but also GUI (IUA) installation. UDF is not mandatory to install and all other functionalities work completely fine without UDF. If you have any issues with UDF installation on Windows / PostgreSQL 11+, please see https://support.pega.com/discussion/how-set-sqlj-schema-udf-postgresql-11-windows.
* Update
From Pega 8.8 onwards, the default value is set to true. So, UDF is not installed by default and there is no need to modify the file to skip UDF installation.
4-3. Parameters for UDF
- ref: a property reference indicating the scalar property to return. The property specification must start with a "." (period).
- insKey: the handle (pzInsKey) of the instance whose value you wish to obtain, or NULL.
- stream: the Blob column name (pzPVStream).
4-3. Sample Query using UDF
SQL> select my_schema.pr_read_from_stream('.SomeProperty', pzInsKey, pzPVStream)
from {Class: MyCo-Task}
where pxObjClass = 'MyCo-Task';
You can run the following sample query to check if three UDFs are working or not.
5. What is optimization?
Optimization is to expose a property as a dedicated database column. This eliminates the need to extract data from Blob by UDF and you can significantly improve reporting performance. Technically speaking, it might slow insert and update operations, so make sure you optimize the only properties that are used in the Report Definition (do not optimize unnecessary ones). Also, be noted that you can't optimize properties in classes mapped to pr_others table.
There are two patterns in optimization – Single Value property and Page List property. If you optimize a Single Value property, the column is simply added in the same table. If you optimize a property in Page List property, a new Index table is created and the column is added in there. Please see below for the sample image.
In Dev, optimization is typically done by optimization wizard from Dev Studio. In Prod, if you are on Pega 8+, importing R-A-P takes care of it. The details are explained in the next #6 and #7.
6. How to optimize a property in Dev
6-1. Optimization wizard
You can simply right-click the property and launch optimization wizard from Dev Studio. Below is for Single Value property, but the process is the same for Page List property as well.
6-2. Prerequisites for optimization wizard
There are a couple of prerequisites for developer to be able to use optimization wizard in Dev Studio.
7. How to optimize a property in Prod
I have created a separate article for this topic. Please see https://support.pega.com/discussion/how-optimize-properties-production-environment.
8. Declare Index
When you optimize properties in Page List property, system creates Declare Index rule and some more objects. For your reference, below are the artifacts created for PurchaseRequest sample application earlier.
(1) Declare Index rule: "Index_ItemList"
(2) Index class: "Index-MyCo-Data-Item"
(3) Property rule in Index class: "ID", "Name", and "Price"
(4) Physical table in the database: "PR_INDEX_MYCO_DATA_ITEM"
9. How to unoptimize a property
If you optimize a wrong property which you shouldn't have, or Report Definition requirement is gone and the property doesn't have to be optimized anymore, you may want to revert the optimization for clean-up.
9-1. Dev
9-1-1. Single Value property
The easiest way is to drop the column directly from DBMS, and restart the system (or resave Data-Admin-DB-Table instance). This approach works for all version. Be noted, unless you restart the system, you can't resave the work object (you'll get FATAL pyCommitError), nor can you re-optimize the same property.
9-1-2. Page List property
It is a bit more complex than Single Value property as there are more objects involved. If you want to remove one property and leave others, you can take the similar approach as Single Value property. If you want to remove all the properties and entire Declare Index, you can clean up all the artifacts system had created - in this particular example, all the properties in Index-MyCo-Data-Item class, Index-MyCo-Data-Item class (all the records have to be deleted in advance), physical database table, and Declare Index rule. In this case, system restart is not required as deleting Index-MyCo-Data-Item class will also delete Data-Admin-DB-Table instance and that makes engine know the changes. You can resave the work object or re-optimize the same Page List property right away.
9-2. Prod
If the property is already optimized in Prod and end users are using the application, the first question is whether the risk is worth taking. You have two options - one is to unoptimize it, and the other one is to just keep it as is. The disadvantage of keeping it is that insert and update operations might get slower. If it is just one or few properties, the overhead is probably negligible, but if there are many such properties, you may want to clean up the mess.
In case of optimization, importing Data-Admin-DB-Table instance propagates the optimization state from Dev to Prod. How about the reverse process, unoptimization? If we drop the column in Dev, create a product file and import it into Prod, will PRPC detect the table structure changes and run DDL (DROP COLUMN) to synchronize it? The answer is No, PRPC import process adds columns, but doesn't drop columns. So, for the Single Value property, take the same approach as Dev - drop column from DBMS directly and restart the system. For Page List property, if you want to remove all the properties and entire Declare Index, you can't physically delete these rules because Declare Index and property are rule instances and they are already locked in Prod. What you can do is, withdraw Declare Index rule (and also property rule) in Dev, and import R-A-P to Prod. Make sure new record doesn't get inserted into the Index table anymore.
10. UDF and performance
Probably you've already heard many times UDFs severely degrades the reporting performance, but how bad is it? It is different per version and DBMS type, but just to get a sense of it I have conducted a benchmark test on my local PC.
(1) Environment
- Hardware: Lenovo T460s laptop PC x 1
- Software: Pega Platform 7.4, Tomcat / PostgreSQL, Windows 10 Pro
- CPU: Intel® Core™ i7-6600U CPU @ 2.60GHz
- RAM: 20GB (heap size is 16GB)
(2) Scenario
- Read a single property by Report Definition. Measure elapsed time before and after optimization (UDF vs non-UDF).
(3) Number of rows
- 4 patterns - 1,000, 10,000, 100,000, and 1,000,000
Retrieving a single property value from 1,000,000 rows with UDF takes 114.719 secs while the same query against optimized property without UDF takes 1.789 secs. 114.719 / 1.789 makes more than 60 times differences. Please take this as my personal reference, not Pegasystems official one.
With simple queries, using the UDF to extract a property from the Blob on the database is :
- about 25% shorter in elapsed time than returning the entire Blob to the app server
- about 60 times slower in elapsed time than reading an exposed column
11. How to eliminate unoptimized Report Definitions
Now we understand that all the properties in the Report Definition should be optimized. However, the property rule form doesn't tell us whether it's optimized or not. How can we make sure we have no such Report Definitions?
The easiest way would be to check Guardrails. Referencing unoptimized properties is Severe warning, and it comes on top. So you should be able to navigate to the problematic Report Definitions easily. If your environment doesn't have UDF installed, these reports will fail at runtime. Make sure you have zero such reports.
12. UDF support
12-1. Cloud
If you are on-premise, you have full control over your infrastructure and probably you don't have much issues with setting up UDF. If you are on cloud, since UDF is heavily tied to its environment, depending on the cloud type, UDF can or cannot be installed. For example, if you choose Amazon RDS, UDF can't be installed because Java is not supported (In PostgreSQL, you need to copy PL/Java files on to the database server's filesystem, but accessing to operating system is not allowed in RDS). The same restriction applies to other cloud vendors such as Microsoft Azure or Google Cloud Platform. On the other hand, if you are using Amazon EC2, you can install UDF because EC2 gives you the same level of flexibility as on-premise, and you can freely set up anything on your own. If you are on customer-managed cloud, check if Java is supported in their database.
12-2. PL/Java and PL/V8
As mentioned above, Amazon RDS does not support PL/Java, but Pega Cloud supports UDF. This is achieved by PL/V8, which is a JavaScript language extension for PostgreSQL powered by Google's V8 engine. UDF is specially rewritten by PL/V8 for non-Java environment. Unfortunately, the script is never provided to outside of Pega. PL/V8 is only used internally at Pega Cloud instances.
- Update (9/9/2022)
UDF is now deprecated from Pega 8.6 onwards. We have already removed UDF from the latest Pega Cloud instances. When you create a Report Definition, you need to optimize the properties. If it is difficult for any reason, Obj-Browse is another way to get values of unoptimized properties. For questions about how we will makes changes or further direction of a product, I am not the right person to answer. Please reach out to Pega representatives in your project.
Hope this helps.
Thanks,