How to Extract Content from PDF Documents

In today's digital world, PDF documents are essential for storing and sharing information. Whether you're dealing with reports, manuals, invoices, or any other document type, extracting content from these PDFs is often necessary for further processing or analysis.

This article will explore how to perform this task effortlessly using the GemBox.Pdf component. Aimed at C# and VB.NET developers, this guide will walk you through the steps of reading text content from a PDF document, making it easier to integrate this functionality into your applications.

You can navigate through the following sections:

Install and configure the GemBox.Pdf library

Before you start, you need to install GemBox.Pdf. The best way to do that is to install the NuGet Package by following these instructions:

  1. Add the GemBox.Pdf component as a package using the following command from the NuGet Package Manager Console:

    Install-Package GemBox.Pdf

  2. After installing the GemBox.Pdf library, you must call the ComponentInfo.SetLicense method before using any other member of the library.

    ComponentInfo.SetLicense("FREE-LIMITED-KEY");

In this tutorial, by using "FREE-LIMITED-KEY", you will be using GemBox's free mode. This mode allows you to use the library without purchasing a license, but with some limitations. If you purchased a license, you can replace "FREE-LIMITED-KEY" with your serial key.

You can check this page for a complete step-by-step guide to installing and setting up GemBox.Pdf in other ways.

Reading Content Within the Given Bounds

GemBox.Pdf supports exporting text with the given parameters. One of the use cases is to extract text within the given bounds. This functionality is particularly useful for extracting specific sections of a document, such as tables, paragraphs, or other elements contained within certain coordinates.

using (var document = PdfDocument.Load("%#Invoice.pdf%"))
{
    // Get the page from which we want to make the extraction
    var page = document.Pages[0];

    // Extract text content from the given bounds
    var text = page.Content.GetText(new PdfTextOptions
    {
        Bounds = new PdfQuad(140, 640, 250, 660),
        Order = PdfTextOrder.Reading
    }).ToString();

    // Writing the extracted text
    Console.WriteLine($"Result: {text}");

    if (DateTime.TryParse(text, out var dateTime))
    {
        var dateDifference = DateTime.UtcNow - dateTime;
        Console.Write($"Created {dateDifference.Days} days {dateDifference.Hours} hours {dateDifference.Minutes} minutes ago.");
    }
}
' If using the Professional version, put your serial key below.
ComponentInfo.SetLicense("FREE-LIMITED-KEY")

Dim dateTime As Date = Nothing

Using document = PdfDocument.Load("%#Invoice.pdf%")
' Get the page from which we want to make the extraction
Dim page = document.Pages(0)

    ' Extract text content from the given bounds
    Dim text = page.Content.GetText(New PdfTextOptions With {
        .Bounds = New PdfQuad(140, 640, 250, 660),
        .Order = PdfTextOrder.Reading
    }).ToString()

    ' Writing the extracted text
    Console.WriteLine($"Result: {text}")

    If Date.TryParse(text, dateTime) Then
        Dim dateDifference = Date.UtcNow - dateTime
        Console.Write($"Created {dateDifference.Days} days {dateDifference.Hours} hours {dateDifference.Minutes} minutes ago.")
    End If
End Using
Text extracted from a PDF file using GemBox.Pdf
Screenshot of a text extracted from a PDF file with GemBox.Pdf

Extracting Tables

You can easily extract tables by defining the coordinates of a table within a PDF. The code below demonstrates how to do that by iterating through all table rows and columns and returning an enumeration of PdfText instances representing the table cell text content. This way, you can extract the entire table content for further analysis or manipulation.

public static IEnumerable<IEnumerable<PdfText>> GetTableRows(PdfPage page, PdfPoint topLeft, IList<double> columnWidths, double rowHeight, int rowCount)
{
    var currentY = topLeft.Y;
    for (var i = 0; i < rowCount; i++)
    {
        yield return GetRowCells(page, columnWidths, topLeft.X, currentY, currentY - rowHeight);
        currentY -= rowHeight;
    }
}

public static IEnumerable<PdfText> GetRowCells(PdfPage page, IList<double> columnWidths, double startX, double bottom, double top)
{
    var currentX = startX;
    foreach (var columnWidth in columnWidths)
    {
        yield return page.Content.GetText(new PdfTextOptions
        {
            Bounds = new PdfQuad(currentX, bottom, currentX + columnWidth, top)
        });
        currentX += columnWidth;
    }
}
Public Shared Iterator Function GetTableRows(ByVal page As PdfPage, ByVal topLeft As PdfPoint, ByVal columnWidths As IList(Of Double), ByVal rowHeight As Double, ByVal rowCount As Integer) As IEnumerable(Of IEnumerable(Of PdfText))
    Dim currentY = topLeft.Y
    For i = 0 To rowCount - 1
        Yield GetRowCells(page, columnWidths, topLeft.X, currentY, currentY - rowHeight)
        currentY -= rowHeight
    Next
End Function

Public Shared Iterator Function GetRowCells(ByVal page As PdfPage, ByVal columnWidths As IList(Of Double), ByVal startX As Double, ByVal bottom As Double, ByVal top As Double) As IEnumerable(Of PdfText)
    Dim currentX = startX
    For Each columnWidth In columnWidths
        Yield page.Content.GetText(New PdfTextOptions With {
            .Bounds = New PdfQuad(currentX, bottom, currentX + columnWidth, top)
        })
        currentX += columnWidth
    Next
End Function
Table extracted from a PDF file GemBox.Pdf
Screenshot of a table extracted from a PDF file with GemBox.Pdf

Automated Data Extraction

For cases where content might not be at the exact position in every document, you can use the following class to extract data by providing its text content as a string. This is especially important when processing a large number of documents automatically.

public class DataExtractor
{
    private readonly PdfPage page;
    private readonly PdfText text;

    public DataExtractor(PdfPage page)
    {
        this.page = page;
        this.text = page.Content.GetText();
    }

    public string ExtractText(string labelText)
    {
        var foundText = this.text.Find(labelText).First();
        var bounds = foundText.Bounds;
        var quads = new PdfQuad(bounds.Right, bounds.Bottom, double.MaxValue, bounds.Top);
        var value = this.page.Content.GetText(new PdfTextOptions { Bounds = quads });
        return value.ToString();
    }
}
Public Class DataExtractor
    Private ReadOnly page As PdfPage
    Private ReadOnly text As PdfText

    Public Sub New(page As PdfPage)
        Me.page = page
        Me.text = page.Content.GetText()
    End Sub

    Public Function ExtractText(labelText As String) As String
        Dim foundText = Me.text.Find(labelText).First()
        Dim bounds = foundText.Bounds
        Dim quads = New PdfQuad(bounds.Right, bounds.Bottom, Double.MaxValue, bounds.Top)
        Dim value = Me.page.Content.GetText(New PdfTextOptions With {
            .bounds = quads
        })
        Return value.ToString()
    End Function
End Class
Content extracted from a PDF file GemBox.Pdf
Screenshot of content extracted from a PDF file with GemBox.Pdf

In summary, the PdfContentGroup.GetText method simplifies the process of content extraction within defined bounds, opening up numerous possibilities for customization and automation in handling PDF documents. Whether you're building applications for data processing, content management, or accessibility, this method offers a robust solution to meet your needs.

For detailed information on all possible options on using the PdfContentGroup.GetText method, check the PdfTextOptions documentation page.

Reading Content with Structural Integrity

The PdfContentGroup.GetText method provides an efficient way to read content from a PDF file while preserving the structure as closely as possible to the original document. This enables users to extract data from a PDF without losing formatting, alignments, and other structural elements.

The code below shows how to read content with structural integrity by setting the PdfTextOptions.Whitespaces to include the NewLine option along with Space and Blank options.

// Load the PDF file to extract content
using (var document = PdfDocument.Load("%#Structural.pdf%"))
{
    // Get the page from which we want to make the extraction
    var page = document.Pages[0];

    // Extract text content from the page, keeping the structural integrity
    var text = page.Content.GetText(new PdfTextOptions
    {
        FontFace = new PdfFontFace("Consolas"),
        Order = PdfTextOrder.Reading,
        Whitespaces = PdfTextWhitespaces.Space | PdfTextWhitespaces.Blank | PdfTextWhitespaces.NewLine
    }).ToString();

    // Writing the extracted text
    Console.WriteLine(text);
}
' If using the Professional version, put your serial key below.
ComponentInfo.SetLicense("FREE-LIMITED-KEY")

' Load the PDF file to extract content
Using document = PdfDocument.Load("%#Structural.pdf%")
' Get the page from which we want to make the extraction
Dim page = document.Pages(0)

    ' Extract text content from the page, keeping the structural integrity
    Dim text = page.Content.GetText(New PdfTextOptions With {
        .FontFace = New PdfFontFace("Consolas"),
        .Order = PdfTextOrder.Reading,
        .Whitespaces = PdfTextWhitespaces.Space Or PdfTextWhitespaces.Blank Or PdfTextWhitespaces.NewLine
    }).ToString()

    ' Writing the extracted text
    Console.WriteLine(text)
End Using
Reading PDF content with structure preserved
Screenshot of content exported with structural intgrity using GemBox.Pdf

The PdfContentGroup.GetText method's ability to read content while maintaining structural integrity has several significant applications.

Document Conversion

Converting PDF documents to other text-based formats (such as TXT or DOCX) without losing the original layout and formatting becomes seamless with the following method.

public static string Textify(PdfPage page)
{
    var text = page.Content.GetText(new PdfTextOptions
    {
        FontFace = new PdfFontFace("Arial"),
        Order = PdfTextOrder.Reading,
        Whitespaces = PdfTextWhitespaces.Blank | PdfTextWhitespaces.NewLine
    });
    return text.ToString();
}
Public Function Textify(ByVal page As PdfPage) As String
    Dim text = page.Content.GetText(New PdfTextOptions With {
        .FontFace = New PdfFontFace("Arial"),
        .Order = PdfTextOrder.Reading,
        .Whitespaces = PdfTextWhitespaces.Blank Or PdfTextWhitespaces.NewLine
    })
    Return text.ToString()
End Function

Data Migration

When migrating data from PDF to CSV, you can use the following code to preserve the structural integrity which ensures that the content remains consistent and accurate.

public static string ToCsv(PdfPage page)
{
    var text = page.Content.GetText(new PdfTextOptions
    {
        Bounds = new PdfQuad(28, 622, 567, 396),
        Order = PdfTextOrder.Content,
        Whitespaces = PdfTextWhitespaces.Blank | PdfTextWhitespaces.NewLine
    }).ToString();

    // Replacing large gaps in text with a separator
    var csv = Regex.Replace(text, @"\s{4,}", ",");
    return csv;
}
Public Function ToCsv(ByVal page As PdfPage) As String
    Dim text = page.Content.GetText(New PdfTextOptions With {
        .Bounds = New PdfQuad(28, 622, 567, 396),
        .Order = PdfTextOrder.Content,
        .Whitespaces = PdfTextWhitespaces.Blank Or PdfTextWhitespaces.NewLine
    }).ToString()

    ' Replacing large gaps in text with a separator
    Dim csv = Regex.Replace(text, "\s{4,}", ",")
    Return csv
End Function
Content migrated from PDF to CSV
Content migrated with preserved structural integrity from PDF to CSV generated from GemBox.Pdf
Screenshot of content migrated with preserved structural integrity from PDF to CSV generated with GemBox.Pdf

Reading Content by Paragraphs

The PdfContentGroup.GetText method can be used to read content from a PDF file as paragraphs by converting content of the same paragraph into one line. This approach ensures better results when copying and pasting the content into text editors like MS Word, as it preserves the paragraph structure.

// If using the Professional version, put your serial key below.
ComponentInfo.SetLicense("FREE-LIMITED-KEY");

using (var document = PdfDocument.Load("%#Chapters.pdf%"))
{
    // Get the page that we want to make extraction
    var page = document.Pages[0];

    // Get the content of the page
    var content = page.Content;

    // Extract text content from the given bounds, keeping the structural integrity
    var text = content.GetText(new PdfTextOptions
    {
        Order = PdfTextOrder.Content,
        Bounds = new PdfQuad(0, 100, 600, 640),
        Whitespaces = PdfTextWhitespaces.Space | PdfTextWhitespaces.NewLine
    });

    var linesRegex = new Regex(@"^.*$", RegexOptions.Multiline);
    // Writing extracted paragraphs starting with a tab character and adding an empty line between them
    foreach (var line in text.Find(linesRegex))
        Console.WriteLine($"\t{line}{Environment.NewLine}");
}
' If using the Professional version, put your serial key below.
ComponentInfo.SetLicense("FREE-LIMITED-KEY")

Using document = PdfDocument.Load("%#Chapters.pdf%")
' Get the page from which we want to make extraction
Dim page = document.Pages(0)

    ' Extract text content from the given bounds, keeping the structural integrity
    Dim text = page.Content.GetText(New PdfTextOptions With {
        .Order = PdfTextOrder.Content,
        .Bounds = New PdfQuad(0, 100, 600, 640),
        .Whitespaces = PdfTextWhitespaces.Space Or PdfTextWhitespaces.NewLine
    })

    Dim linesRegex = New Regex("^.*$", RegexOptions.Multiline)
    ' Writing extracted paragraphs starting with a tab character and adding an empty line between them
    For Each line In text.Find(linesRegex)
        Console.WriteLine($"{Microsoft.VisualBasic.ControlChars.Tab}{line}{Environment.NewLine}")
    Next
End Using
Content extracted from PDF file with paragraph preservation
Screenshot of content extracted from a PDF file using GemBox.Pdf with paragraph preservation

The ability to convert texts in the same paragraph into one line has various practical applications.

Content Migration

The method below can be used for transferring content to word processors like MS Word, maintaining paragraph integrity, which ensures a smoother transition and preserves the readability of the text. Note that this code requires the GemBox.Document component for loading DOCX files.

// If using the Professional version, put your serial key below.
GemBox.Pdf.ComponentInfo.SetLicense("FREE-LIMITED-KEY");
GemBox.Document.ComponentInfo.SetLicense("FREE-LIMITED-KEY");

using (var pdfDocument = PdfDocument.Load("%#Chapters.pdf%"))
{
    var document = new DocumentModel();

    // Get the page that we want to make extraction
    var page = pdfDocument.Pages[0];

    var section = new Section(document);
    document.Sections.Add(section);

    var text = page.Content.GetText(new PdfTextOptions
    {
        Order = PdfTextOrder.Content,
        Bounds = new PdfQuad(0, 100, 600, 640),
        Whitespaces = PdfTextWhitespaces.Space | PdfTextWhitespaces.NewLine
    });
    var paragraphs = text.ToString()
        .Split(new[] { Environment.NewLine }, StringSplitOptions.RemoveEmptyEntries)
        .Select(i => new Paragraph(document, i));

    foreach (var paragraph in paragraphs)
        section.Blocks.Add(paragraph);

    document.Save("Chapters.docx");
}
' If using the Professional version, put your serial key below.
GemBox.Pdf.ComponentInfo.SetLicense("FREE-LIMITED-KEY")
GemBox.Document.ComponentInfo.SetLicense("FREE-LIMITED-KEY")

Using pdfDocument = GemBox.Pdf.PdfDocument.Load("%#Chapters.pdf%")
Dim document = New DocumentModel()

    ' Get the page that we want to make extraction
    Dim page = pdfDocument.Pages(0)

    Dim section = New Section(document)
    document.Sections.Add(section)

    Dim text = page.Content.GetText(New PdfTextOptions With {
        .Order = PdfTextOrder.Content,
        .Bounds = New PdfQuad(0, 100, 600, 640),
        .Whitespaces = PdfTextWhitespaces.Space Or PdfTextWhitespaces.NewLine
    })
    Dim paragraphs = text.ToString().Split({Environment.NewLine}, StringSplitOptions.RemoveEmptyEntries).[Select](Function(i) New Paragraph(document, i))

    For Each paragraph In paragraphs
        section.Blocks.Add(paragraph)
    Next

    document.Save("Chapters.docx")
End Using
Word document with content transferred from a PDF file with paragraph integrity using GemBox.Pdf
Screenshot of Word file with content transferred from PDF mantaining paragraph integrity

Data Analysis

Analyzing textual content is facilitated when paragraphs are maintained as single lines, especially when performing natural language processing or text mining.

public static async Task<string> Translate(PdfPage page)
{
    var text = page.Content.GetText(new PdfTextOptions
    {
        Bounds = new PdfQuad(0, 100, 600, 700),
        Order = PdfTextOrder.Content,
        Whitespaces = PdfTextWhitespaces.Space | PdfTextWhitespaces.NewLine
    }).ToString();

    var translator = new GTranslate.Translators.GoogleTranslator();
    var result = await translator.TranslateAsync(text, "en");
    return result.Translation;
}
Public Shared Async Function Translate(ByVal page As PdfPage) As Task(Of String)
    Dim text = page.Content.GetText(New PdfTextOptions With {
        .Bounds = New PdfQuad(0, 100, 600, 700),
        .Order = PdfTextOrder.Content,
        .Whitespaces = PdfTextWhitespaces.Space Or PdfTextWhitespaces.NewLine
    }).ToString()

    Dim translator = New GTranslate.Translators.GoogleTranslator()
    Dim result = Await translator.TranslateAsync(text, "en")
    Return result.Translation
End Function

Accessibility

Enhancing readability for assistive technologies by maintaining paragraph integrity allows for a more user-friendly experience for individuals with visual impairments or other accessibility needs. The example below demonstrates such case, using the SpeechSynthesizer class.

public static void Listen(PdfPage page)
{
    var text = page.Content.GetText(new PdfTextOptions
    {
        Order = PdfTextOrder.Content,
        Whitespaces = PdfTextWhitespaces.Space | PdfTextWhitespaces.NewLine
    }).ToString();

    var synth = new System.Speech.Synthesis.SpeechSynthesizer();
    synth.SetOutputToDefaultAudioDevice();
    synth.Speak(text);
}
Public Shared Sub Listen(ByVal page As PdfPage)
    Dim text = page.Content.GetText(New PdfTextOptions With {
        .Order = PdfTextOrder.Content,
        .Whitespaces = PdfTextWhitespaces.Space Or PdfTextWhitespaces.NewLine
    }).ToString()

    Dim synth = New Speech.Synthesis.SpeechSynthesizer()
    synth.SetOutputToDefaultAudioDevice()
    synth.Speak(text)
End Sub

Conclusion

Now you know how to use GemBox.Pdf to export text from your PDF files programmatically. Besides the use cases described in this article, you can use the GemBox.Pdf API to read, write, merge, and split PDF files and execute other low-level object manipulations in a very straightforward and quick way.

For more information regarding the GemBox.Pdf component, check the documentation pages.

See also


Next steps

GemBox.Pdf is a .NET component that enables developers to read, merge and split PDF files or execute low-level object manipulations from .NET applications in a simple and efficient way.

Download Buy