Optical character recognition (OCR) in C# and VB.NET

Optical character recognition (OCR) is the process of converting images with text into machine-encoded text. GemBox.Pdf supports OCR via GemBox.Pdf.Ocr.dll.

GemBox.Pdf.Ocr allows you to load text inside images and scanned PDF files into a PdfDocument. This functionality enables you to extract the text or save the document to an editable PDF file.

With the example below, you can learn how to do OCR in an image, extracting the text.

Text extracted from the image with the GemBox.Pdf.Ocr C#/VB.NET library
Screenshot of text extracted from the image with the GemBox.Pdf.Ocr library
using System;
using GemBox.Pdf;
using GemBox.Pdf.Content;
using GemBox.Pdf.Ocr;

class Program
{
    static void Main()
    {
        // If using Professional version, put your serial key below.
        ComponentInfo.SetLicense("FREE-LIMITED-KEY");

        using (PdfDocument document = OcrReader.Read("%#BookPage.jpg%"))
        {
            var page = document.Pages[0];
            var contentEnumerator = page.Content.Elements.All(page.Transform).GetEnumerator();

            while (contentEnumerator.MoveNext())
            {
                if (contentEnumerator.Current.ElementType == PdfContentElementType.Text)
                {
                    var textElement = (PdfTextContent)contentEnumerator.Current;
                    Console.WriteLine(textElement.ToString());
                }
            }
        }
    }
}
Imports System
Imports GemBox.Pdf
Imports GemBox.Pdf.Content
Imports GemBox.Pdf.Ocr

Module Program

    Sub Main()

        ' If using Professional version, put your serial key below.
        ComponentInfo.SetLicense("FREE-LIMITED-KEY")

        Using document = OcrReader.Read("%#BookPage.jpg%")

            Dim page = document.Pages(0)
            Dim contentEnumerator = page.Content.Elements.All(page.Transform).GetEnumerator()

            While contentEnumerator.MoveNext()
                If contentEnumerator.Current.ElementType = PdfContentElementType.Text Then
                    Dim textElement = CType(contentEnumerator.Current, PdfTextContent)
                    Console.WriteLine(textElement.ToString())
                End If
            End While

        End Using

    End Sub
End Module

GemBox.Pdf.Ocr internally uses Tesseract to perform optical character recognition. That’s why it is necessary to have leptonica-1.80.0.dll and tesseract41.dll present in the x64 or x86 folder in the output directory. These DLLs are distributed together with GemBox.Pdf.Ocr and they were compiled with Visual Studio 2019. Therefore you'll need to ensure you have the Visual Studio 2019 Runtime installed.

OCR with different languages

Language data is necessary to perform optical character recognition with Tesseract.

GemBox.Pdf.Ocr comes with data for the English language inside the gembox_tesseract_data folder. You’ll need to download the language data and put it inside a dedicated folder copied to the output directory to support other languages.

The following example shows how to load a scanned PDF file with German text and save it to an editable PDF file. The resulting image also shows the shortcomings of OCR when reading unclear text.

PDF file with recognized text using GemBox.Pdf.Orc
Screenshot of PDF file with recognized text using GemBox.Pdf.Ocr
using System;
using GemBox.Pdf;
using GemBox.Pdf.Ocr;

class Program
{
    static void Main()
    {
        // If using Professional version, put your serial key below.
        ComponentInfo.SetLicense("FREE-LIMITED-KEY");

        // TesseractDataPath specifies the directory which contains language data.
        // You can download the language data files from: https://www.gemboxsoftware.com/pdf/docs/ocr.html#language-data
        var readOptions = new OcrReadOptions() { TesseractDataPath = "languagedata" };

        // The language of the text.
        readOptions.Languages.Add(OcrLanguages.German);

        using (PdfDocument document = OcrReader.Read("%#GermanDocument.pdf%", readOptions))
        {
            document.Save("GermanDocumentEditable.pdf");
        }
    }
}
Imports System
Imports GemBox.Pdf
Imports GemBox.Pdf.Ocr

Module Program

    Sub Main()

        ' If using Professional version, put your serial key below.
        ComponentInfo.SetLicense("FREE-LIMITED-KEY")

        ' TesseractDataPath specifies the directory which contains language data.
        ' You can download the language data files from: https://www.gemboxsoftware.com/pdf/docs/ocr.html#language-data
        Dim readOptions As New OcrReadOptions() With {.TesseractDataPath = "languagedata"}

        ' The language of the text.
        readOptions.Languages.Add(OcrLanguages.German)

        Using document = OcrReader.Read("%#GermanDocument.pdf%", readOptions)
            document.Save("GermanDocumentEditable.pdf")
        End Using

    End Sub
End Module

Want more?

Next example GitHub

Check the next example or select an example from the menu. You can also download our examples from the GitHub.


Like it?

Download Buy

If you want to try the GemBox.Pdf yourself, you can download the free version. It delivers the same performance and set of features as the professional version, but with some operations limited. To remove the limitation, you need to purchase a license.