NLP With Ruta Script
I have created a Decision Data rule for entity extraction. I am performing NLP using RUTA script in pega. My requirement is to extract policy number from an email.
S- Represents Alphanumeric A- Represents Numeric
Policy Number has format: 1)With Hyphen SS-SSSSSSS-AAA 2)Without Hyphen SS SSSSSSS AAA 3)Without Spaces SSSSSSSSSAAA 4)Optionally This policy number can be prefixed with 1 also.So 1SS-SSSSSSS-AAA, 1SS SSSSSSS AAA and 1SSSSSSSSSAAA are also valid combination.
So policy number has 3 parts; 1st part is of length 2(SS), 2nd part is of length 7(SSSSSSS) and third part is of length 3(AAA). And optionally "1" is fourth part which would be prefixed to policy number.
I have written a script for this but its not working for combination in which policy number is prefixed with 1.
Below is code from script:
PACKAGE uima.ruta.example;
Document{-> RETAINTYPE(SPACE)};
DECLARE VarA;
DECLARE VarC;
DECLARE VarE;
("1")? W{REGEXP(".{2}")} ("-"|SPACE)? ((W* NUM* W* NUM* W* NUM* W*)|(NUM* W* NUM* W* NUM* W* NUM*)){REGEXP(".{7}")} ("-"|SPACE)? W{REGEXP(".{3}")->MARK(EntityType,1,6)};
(W* NUM*){REGEXP(".{2}")} ("-"|SPACE)? ((W* NUM* W* NUM* W* NUM* W*)|(NUM* W* NUM* W* NUM* W* NUM*)){REGEXP(".{7}")} ("-"|SPACE)? W{REGEXP(".{3}")->MARK(EntityType,1,5)};
((W|NUM)(NUM|W)*){REGEXP("(?i)\\b[1]{0,1}[A-Z0-9]{2}[A-Z0-9]{7}[A-Z]{3}\\b" )->MARK(EntityType)};
I have created a Decision Data rule for entity extraction. I am performing NLP using RUTA script in pega. My requirement is to extract policy number from an email.
S- Represents Alphanumeric A- Represents Numeric
Policy Number has format: 1)With Hyphen SS-SSSSSSS-AAA 2)Without Hyphen SS SSSSSSS AAA 3)Without Spaces SSSSSSSSSAAA 4)Optionally This policy number can be prefixed with 1 also.So 1SS-SSSSSSS-AAA, 1SS SSSSSSS AAA and 1SSSSSSSSSAAA are also valid combination.
So policy number has 3 parts; 1st part is of length 2(SS), 2nd part is of length 7(SSSSSSS) and third part is of length 3(AAA). And optionally "1" is fourth part which would be prefixed to policy number.
I have written a script for this but its not working for combination in which policy number is prefixed with 1.
Below is code from script:
PACKAGE uima.ruta.example;
Document{-> RETAINTYPE(SPACE)};
DECLARE VarA;
DECLARE VarC;
DECLARE VarE;
("1")? W{REGEXP(".{2}")} ("-"|SPACE)? ((W* NUM* W* NUM* W* NUM* W*)|(NUM* W* NUM* W* NUM* W* NUM*)){REGEXP(".{7}")} ("-"|SPACE)? W{REGEXP(".{3}")->MARK(EntityType,1,6)};
(W* NUM*){REGEXP(".{2}")} ("-"|SPACE)? ((W* NUM* W* NUM* W* NUM* W*)|(NUM* W* NUM* W* NUM* W* NUM*)){REGEXP(".{7}")} ("-"|SPACE)? W{REGEXP(".{3}")->MARK(EntityType,1,5)};
((W|NUM)(NUM|W)*){REGEXP("(?i)\\b[1]{0,1}[A-Z0-9]{2}[A-Z0-9]{7}[A-Z]{3}\\b" )->MARK(EntityType)};
Valid Policy Numbers: AB-CD123EF-GHI, 1AB-CD123EF-GHI, ABCD123EFGHI, 23-456ABC7-GHI, 123-456ABC7-GHI, 1A3-456ABC7-GHI, 12A-456ABC7-GHI etc..
i am not able to handle 123-456ABC7-GHI, 1A3-456ABC7-GHI, 12A-456ABC7-GHI these combination.
Please help to write correct script that cover all possible combination. Thanks in advance.