Discussion
Accenture
IN
Last activity: 16 Oct 2018 12:03 EDT
PEGA ROBOTIC STUDIO . A LONG WAITED , ARTICLE ON OCR .. EXTRACTING TEXT FROM IMAGE BY USING OPENSPAN.
Hi All,
I have extracted text from a image by using Microsoft.office.interop.onenote 15.0 library.
You may find this library on Pega Robotics 8.0.1030 version.
One note has in built feature to extract text from image.
Here i have added scripts to extract text. Here are the steps to achieve it.
1. Delete any pages in a onenote by using below c# script.
A. Add reference Microsoft.office.interop.onenote.15.0 library not dll.
B. Add reference system.linq.xml
Script to delete
Imports :
using System;
using System.Linq;
using System.Xml.Linq;
using Microsoft.Office.Interop.OneNote;
Parameters:
out string atName,out Microsoft.Office.Interop.OneNote.Application onenoteApp
Body:
onenoteApp = new Microsoft.Office.Interop.OneNote.Application();
string notebookXml;
atName = "";
onenoteApp.GetHierarchy(null, HierarchyScope.hsPages, out notebookXml);
var doc = XDocument.Parse(notebookXml);
var ns = doc.Root.Name.Namespace;
foreach (var notebookNode in from node in doc.Descendants(ns +
"Page") select node)
{
try{
onenoteApp.DeleteHierarchy(notebookNode.Attribute("ID").Value,DateTime.MinValue,true);
}
2. Create new page to paste our image inside to it. Same reference has to be added.
Imports:
using Microsoft.Office.Interop.OneNote;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Xml;
using System.Xml.Linq;
Parameters:
Microsoft.Office.Interop.OneNote.Application onenoteApp, out string pageID
Body:
pageID="";
//var onenoteApp = new Microsoft.Office.Interop.OneNote.Application();
string notebookXml;
onenoteApp.GetHierarchy(null, HierarchyScope.hsSections, out notebookXml);
var doc = XDocument.Parse(notebookXml);
var ns = doc.Root.Name.Namespace;
int i =0;
foreach (var sectionNode in from node in doc.Descendants(ns + "Section") select node)
{
//MessageBox.Show(sectionNode.Attribute("name").Value.ToString());
string sectionId = sectionNode.Attribute("ID").Value;
if(i==0 && sectionNode.Attribute("name").Value.ToString() == "New Section 1")
onenoteApp.CreateNewPage(sectionId, out pageID, NewPageStyle.npsDefault);
}
3. Insert your image in the one note post converting into structured xml.
Import:
using System;
using System.Linq;
using System.Xml.Linq;
using Microsoft.Office.Interop.OneNote;
using System.Drawing.Imaging;
Parameters:
filename is your local saved image file path
string filename,string pageToBeChange,Microsoft.Office.Interop.OneNote.Application onenoteApp
Body:
string strNamespace = @"http://schemas.microsoft.com/office/onenote/2013/onenote";
string m_xmlImageContent =
"<one:Image><one:Size width=\"{1}\" height=\"{2}\" isSetByUser=\"true\" /><one:Data>{0}</one:Data></one:Image>";
string m_xmlNewOutline =
"<?xml version=\"1.0\"?><one:Page xmlns:one=\"{2}\" ID=\"{1}\"><one:Title><one:OE><one:T><![CDATA[{3}]]></one:T></one:OE></one:Title>{0}</one:Page>";
// string pageToBeChange = "Untitled page";
Bitmap bitmap = new Bitmap(filename);
MemoryStream stream = new MemoryStream();
bitmap.Save(stream, ImageFormat.Png);
string fileString = Convert.ToBase64String(stream.ToArray());
string notebookXml;
onenoteApp = new Microsoft.Office.Interop.OneNote.Application();
onenoteApp.GetHierarchy(null, HierarchyScope.hsPages, out notebookXml);
4. Reading text from Image
Import:
using System;
using System.Linq;
using Microsoft.Office.Interop.OneNote;
using System.Xml.Linq;
using System.IO;
using System.Drawing;
using System.Drawing.Imaging;
Parameters:
Showing the extracted text in a textbox.
TextBox t,string PageID,Microsoft.Office.Interop.OneNote.Application
onenoteApp = new Microsoft.Office.Interop.OneNote.Application();
int i = 0;
string notebookXml;
do
{
try
{
onenoteApp.GetHierarchy(null, HierarchyScope.hsPages, out notebookXml);
//MessageBox.Show(notebookXml);
var doc = XDocument.Parse(notebookXml);
var ns = doc.Root.Name.Namespace;
var pageNode = doc.Descendants(ns + "Page").Where(n =>
n.Attribute("ID").Value == PageID).FirstOrDefault();
//MessageBox.Show(pageNode.ToString());
if (pageNode != null)
{
string pageXml;
onenoteApp.GetPageContent(pageNode.Attribute("ID").Value, out pageXml);
//MessageBox.Show(XDocument.Parse(pageXml).ToString());
var doc1 = XDocument.Parse(pageXml);
var ns1 = doc1.Root.Name.Namespace;
//MessageBox.Show(ns1.ToString());
pageNode = doc1.Descendants(ns1 + "OCRText").FirstOrDefault();
//MessageBox.Show(pageNode.ToString());
string a = pageNode.ToString();
a = a.Substring(a.IndexOf("CDATA[")+6,(a.IndexOf("]]")-(a.IndexOf("CDATA["))-6));
t.Text =a;
i = 1;
}
}
catch(Exception e)
{
}
}while(i!=1);
***Updated by moderator: Lochan to add Categories and Group Tags***
**Moderation Team has archived post**
This post has been archived for educational purposes. Contents and links will no longer be updated. If you have the same/similar question, please write a new post.