Optical character recognition (OCR) in C# and VB.NET
Optical character recognition (OCR) is the process of converting images with text into machine-encoded text. GemBox.Pdf supports OCR via GemBox.Pdf.Ocr.dll.
GemBox.Pdf.Ocr allows you to load text inside images and scanned PDF files into a PdfDocument
. This functionality enables you to extract the text or save the document to an editable PDF file.
With the example below, you can learn how to do OCR in an image, extracting the text.
using GemBox.Pdf;
using GemBox.Pdf.Content;
using GemBox.Pdf.Ocr;
using System;
class Program
{
static void Main()
{
// If using the Professional version, put your serial key below.
ComponentInfo.SetLicense("FREE-LIMITED-KEY");
using (PdfDocument document = OcrReader.Read("%#BookPage.jpg%"))
{
var page = document.Pages[0];
var contentEnumerator = page.Content.Elements.All(page.Transform).GetEnumerator();
while (contentEnumerator.MoveNext())
{
if (contentEnumerator.Current.ElementType == PdfContentElementType.Text)
{
var textElement = (PdfTextContent)contentEnumerator.Current;
Console.WriteLine(textElement.ToString());
}
}
}
}
}
Imports GemBox.Pdf
Imports GemBox.Pdf.Content
Imports GemBox.Pdf.Ocr
Imports System
Module Program
Sub Main()
' If using the Professional version, put your serial key below.
ComponentInfo.SetLicense("FREE-LIMITED-KEY")
Using document = OcrReader.Read("%#BookPage.jpg%")
Dim page = document.Pages(0)
Dim contentEnumerator = page.Content.Elements.All(page.Transform).GetEnumerator()
While contentEnumerator.MoveNext()
If contentEnumerator.Current.ElementType = PdfContentElementType.Text Then
Dim textElement = CType(contentEnumerator.Current, PdfTextContent)
Console.WriteLine(textElement.ToString())
End If
End While
End Using
End Sub
End Module
GemBox.Pdf.Ocr internally uses Tesseract to perform optical character recognition. That’s why it is necessary to have leptonica-1.82.0.dll and tesseract50.dll present in the x64 or x86 folder in the output directory. These DLLs are distributed together with GemBox.Pdf.Ocr and they were compiled with Visual Studio 2019. Therefore you'll need to ensure you have the Visual Studio 2019 Runtime installed.
In many cases, in order to get better OCR results, you'll need to improve the quality of the image you are giving to GemBox.Pdf.Ocr. Language data is necessary to perform optical character recognition with Tesseract. GemBox.Pdf.Ocr comes with data for the English language inside the gembox_tesseract_data folder. You’ll need to download the language data and put it inside a dedicated folder copied to the output directory to support other languages. The following example shows how to load a scanned PDF file with German text and save it to an editable PDF file. The resulting image also shows the shortcomings of OCR when reading unclear text.OCR with different languages
using GemBox.Pdf;
using GemBox.Pdf.Ocr;
using System;
class Program
{
static void Main()
{
// If using the Professional version, put your serial key below.
ComponentInfo.SetLicense("FREE-LIMITED-KEY");
// TesseractDataPath specifies the directory which contains language data.
// You can download the language data files from: https://www.gemboxsoftware.com/pdf/docs/ocr.html#language-data
var readOptions = new OcrReadOptions() { TesseractDataPath = "languagedata" };
// The language of the text.
readOptions.Languages.Add(OcrLanguages.German);
using (PdfDocument document = OcrReader.Read("%#GermanDocument.pdf%", readOptions))
{
document.Save("GermanDocumentEditable.pdf");
}
}
}
Imports GemBox.Pdf
Imports GemBox.Pdf.Ocr
Imports System
Module Program
Sub Main()
' If using the Professional version, put your serial key below.
ComponentInfo.SetLicense("FREE-LIMITED-KEY")
' TesseractDataPath specifies the directory which contains language data.
' You can download the language data files from: https://www.gemboxsoftware.com/pdf/docs/ocr.html#language-data
Dim readOptions As New OcrReadOptions() With {.TesseractDataPath = "languagedata"}
' The language of the text.
readOptions.Languages.Add(OcrLanguages.German)
Using document = OcrReader.Read("%#GermanDocument.pdf%", readOptions)
document.Save("GermanDocumentEditable.pdf")
End Using
End Sub
End Module