Implement String Matching by Probability in PEGA

Discussion

VTALUKDAR

Member since 2010

41 posts

ING Bank N.V.

Posted: Aug 4, 2023

Last activity: Aug 4, 2023

Posted: 4 Aug 2023 11:55 EDT
Last activity: 4 Aug 2023 11:56 EDT

Closed

Implement String Matching by Probability in PEGA

Recently I got to work with one fascinating and small proof of concept : the idea of fuzzy string matching using probability. Suppose you have 2 separate systems called X and Y, but one of them can handle Special characters and diacritics, while the other system cannot. In such a scenario, a name such as Renée Brontë would become Renee Bronte or , worse, some garbled characters like Reneee Brontee in one of the systems, causing the Name matching to fail.

How do you match such data ? Well, string substitutions work to a degree (if you get é, replace it with e), but it is not scalable : there are only so many string substitutions you can try, not to mention frequently interchangeable alphabets (I am told that in Dutch , ij and y can be frequently interchanged). In such a scenario , another far better option comes to mind : what if we can apply statistical probability to such string matching problems ? What if we can feed String 1 and String 2 to a code, and the code returns the a probability of match between String 1 and String 2 ?

This is where two algorithms came to my attention : the Levenshtein Distance Algorithm(https://lnkd.in/eAAtMRZv), and the Jaro-Winkler Algorithm (https://lnkd.in/e-73Dv-Z). I have provided the links to their Wiki pages , which explain the algorithms far better than I ever could , but in short : a bit of research told me that Levenshtein Distance works better for longer matches, while Jaro-Winkler works better for shorter names. The great advantage of having such a statistical matching is , you can have a threshold check as well : if the probability is more than 70, then accept the match , otherwise reject it. If you find after 30 days that the number is not working well, you can turn it higher / lower , based on the analysis of cases being rejected !

Well then, how to implement it in PEGA ? I was surprised to know that the Levenshtein Distance and Jaro-Winkler algorithm calculations come out-of-the-box ! I have given below the code snippet on how to use them in a function. If you are at a quandary on which one of the algorithms to use ( internet tells me that it depends on the use cases being implemented ), I recommend adding a Max/ Min function after the below code , to take the higher/lower of the two values based on your requirement.

Happy Coding , and I hope if you are reading this you learnt something new as I have while implementing this !

To see attachments, please log in.

Pega Platform

Java and Activities

Lead System Architect

Team Lead

Likes (4)

Bas Groeneveld Sudit Sengupta Tanul Thanvi Mario Batres
Share this page Facebook Twitter LinkedIn Email Copying... Copied!

Discussion

Implement String Matching by Probability in PEGA

Need help or want to help others?

Experience the benefits of Support Center when you log in.

Discussion

Implement String Matching by Probability in PEGA

Related content:

Need help or want to help others?

Experience the benefits of Support Center when you log in.

We'd prefer it if you saw us at our best.