Implement String Matching by Probability in PEGA
Recently I got to work with one fascinating and small proof of concept : the idea of fuzzy string matching using probability. Suppose you have 2 separate systems called X and Y, but one of them can handle Special characters and diacritics, while the other system cannot. In such a scenario, a name such as Renée Brontë would become Renee Bronte or , worse, some garbled characters like Reneee Brontee in one of the systems, causing the Name matching to fail.
How do you match such data ? Well, string substitutions work to a degree (if you get é, replace it with e), but it is not scalable : there are only so many string substitutions you can try, not to mention frequently interchangeable alphabets (I am told that in Dutch , ij and y can be frequently interchanged). In such a scenario , another far better option comes to mind : what if we can apply statistical probability to such string matching problems ? What if we can feed String 1 and String 2 to a code, and the code returns the a probability of match between String 1 and String 2 ?
This is where two algorithms came to my attention : the Levenshtein Distance Algorithm(https://lnkd.in/eAAtMRZv), and the Jaro-Winkler Algorithm (https://lnkd.in/e-73Dv-Z). I have provided the links to their Wiki pages , which explain the algorithms far better than I ever could , but in short : a bit of research told me that Levenshtein Distance works better for longer matches, while Jaro-Winkler works better for shorter names. The great advantage of having such a statistical matching is , you can have a threshold check as well : if the probability is more than 70, then accept the match , otherwise reject it. If you find after 30 days that the number is not working well, you can turn it higher / lower , based on the analysis of cases being rejected !
Well then, how to implement it in PEGA ? I was surprised to know that the Levenshtein Distance and Jaro-Winkler algorithm calculations come out-of-the-box ! I have given below the code snippet on how to use them in a function. If you are at a quandary on which one of the algorithms to use ( internet tells me that it depends on the use cases being implemented ), I recommend adding a Max/ Min function after the below code , to take the higher/lower of the two values based on your requirement.
Happy Coding , and I hope if you are reading this you learnt something new as I have while implementing this !