OCR with Microsoft Office Document Imaging

If you need cheap and simple OCR functionality Microsoft Office Document Imaging Type Library (MODI) is a nice option if its requirements (Microsoft Office 2003 or later) and limitations (limited language support) don't bother you. Here is a simple C# function that does OCR on the image with the specified path:

static string OCR(string path)
{
    MODI.Document doc = new MODI.Document();
    doc.Create(path);
    doc.OCR(MODI.MiLANGUAGES.miLANG_ENGLISH, false, false);
    string result = ((MODI.Image)doc.Images[0]).Layout.Text;
    System.Runtime.InteropServices.Marshal.ReleaseComObject(doc);
    return result;
}

However, there is another Microsoft Office object model related problem. For Office 2003 users to be able to use your application, the MODI 11.0 (2003 version) must be referenced in the project and the release version of the application must be compiled on a machine with Office 2003 installed. In such a case VB6 still managed to compile the project on a machine with a newer version of Office installed, since the newer version of the type library was automatically used (MODI 12.0 for Office 2007 in this case). On the other hand strong type checking at compile time prevents that in C#.

If you want to keep using Office 2007 and be able to compile such a project, the only solution is to install Microsoft Office Document Imaging as the only component of Office 2003 along the existing Office 2007 installation. Unfortunately this overwrites the Microsoft Office Document Image Writer printer driver from 2007 with the older version, therefore you'll have to start a lengthy process of repairing the Office 2007 installation afterwards. And don't forget to apply all the service packs and updates for Office 2003 before that since this will also overwrite the printer driver and you'll have to repair Office 2007 once again. I learned that the hard way.

Get notified when a new blog post is published (usually every Friday):

Copyright
Creative Commons License