Optical character recognition (OCR) in C# and VB.NET

Optical character recognition (OCR) is the process of converting images with text into machine-encoded text. GemBox.Pdf supports OCR via GemBox.Pdf.Ocr.dll.

GemBox.Pdf.Ocr allows you to load text inside images and scanned PDF files into a PdfDocument. This functionality enables you to extract the text or save the document to an editable PDF file.

With the example below, you can learn how to do OCR in an image, extracting the text.

Text extracted from the image with the GemBox.Pdf.Ocr C#/VB.NET library
Screenshot of text extracted from the image with the GemBox.Pdf.Ocr library
using System;
using GemBox.Pdf;
using GemBox.Pdf.Content;
using GemBox.Pdf.Ocr;

class Program
{
    static void Main()
    {
        // If using the Professional version, put your serial key below.
        ComponentInfo.SetLicense("FREE-LIMITED-KEY");

        using (PdfDocument document = OcrReader.Read("%#BookPage.jpg%"))
        {
            var page = document.Pages[0];
            var contentEnumerator = page.Content.Elements.All(page.Transform).GetEnumerator();

            while (contentEnumerator.MoveNext())
            {
                if (contentEnumerator.Current.ElementType == PdfContentElementType.Text)
                {
                    var textElement = (PdfTextContent)contentEnumerator.Current;
                    Console.WriteLine(textElement.ToString());
                }
            }
        }
    }
}
Imports System
Imports GemBox.Pdf
Imports GemBox.Pdf.Content
Imports GemBox.Pdf.Ocr

Module Program

    Sub Main()

        ' If using the Professional version, put your serial key below.
        ComponentInfo.SetLicense("FREE-LIMITED-KEY")

        Using document = OcrReader.Read("%#BookPage.jpg%")

            Dim page = document.Pages(0)
            Dim contentEnumerator = page.Content.Elements.All(page.Transform).GetEnumerator()

            While contentEnumerator.MoveNext()
                If contentEnumerator.Current.ElementType = PdfContentElementType.Text Then
                    Dim textElement = CType(contentEnumerator.Current, PdfTextContent)
                    Console.WriteLine(textElement.ToString())
                End If
            End While

        End Using

    End Sub
End Module

GemBox.Pdf.Ocr internally uses Tesseract to perform optical character recognition. That’s why it is necessary to have leptonica-1.82.0.dll and tesseract50.dll present in the x64 or x86 folder in the output directory. These DLLs are distributed together with GemBox.Pdf.Ocr and they were compiled with Visual Studio 2019. Therefore you'll need to ensure you have the Visual Studio 2019 Runtime installed.

In many cases, in order to get better OCR results, you'll need to improve the quality of the image you are giving to GemBox.Pdf.Ocr.

OCR with different languages

Language data is necessary to perform optical character recognition with Tesseract.

GemBox.Pdf.Ocr comes with data for the English language inside the gembox_tesseract_data folder. You’ll need to download the language data and put it inside a dedicated folder copied to the output directory to support other languages.

The following example shows how to load a scanned PDF file with German text and save it to an editable PDF file. The resulting image also shows the shortcomings of OCR when reading unclear text.

PDF file with recognized text using GemBox.Pdf.Orc
Screenshot of PDF file with recognized text using GemBox.Pdf.Ocr
using System;
using GemBox.Pdf;
using GemBox.Pdf.Ocr;

class Program
{
    static void Main()
    {
        // If using the Professional version, put your serial key below.
        ComponentInfo.SetLicense("FREE-LIMITED-KEY");

        // TesseractDataPath specifies the directory which contains language data.
        // You can download the language data files from: https://www.gemboxsoftware.com/pdf/docs/ocr.html#language-data
        var readOptions = new OcrReadOptions() { TesseractDataPath = "languagedata" };

        // The language of the text.
        readOptions.Languages.Add(OcrLanguages.German);

        using (PdfDocument document = OcrReader.Read("%#GermanDocument.pdf%", readOptions))
        {
            document.Save("GermanDocumentEditable.pdf");
        }
    }
}
Imports System
Imports GemBox.Pdf
Imports GemBox.Pdf.Ocr

Module Program

    Sub Main()

        ' If using the Professional version, put your serial key below.
        ComponentInfo.SetLicense("FREE-LIMITED-KEY")

        ' TesseractDataPath specifies the directory which contains language data.
        ' You can download the language data files from: https://www.gemboxsoftware.com/pdf/docs/ocr.html#language-data
        Dim readOptions As New OcrReadOptions() With {.TesseractDataPath = "languagedata"}

        ' The language of the text.
        readOptions.Languages.Add(OcrLanguages.German)

        Using document = OcrReader.Read("%#GermanDocument.pdf%", readOptions)
            document.Save("GermanDocumentEditable.pdf")
        End Using

    End Sub
End Module

See also


Next steps

GemBox.Pdf is a .NET component that enables developers to read, merge and split PDF files or execute low-level object manipulations from .NET applications in a simple and efficient way.

Download Buy