Extract text and tables from PDF in C# and VB.NET

When reading the text content of a PDF file, GemBox.Document will recognize the file's logical structure and represent it using Table and Paragraph elements. You can read more about how GemBox.Document detects a PDF's structure on the Support level for reading PDF format help page.

If you don't need the logical structure, but instead want to know the exact position of the text (e.g., on which page and coordinates some text is located), take a look at this alternative approach for reading PDF text using GemBox.Pdf.

The following example shows how you can read paragraphs and tables from a PDF file using GemBox.Document.

Reading PDF file and extracting its paragraphs and tables in C# and VB.NET
Screenshot of read text and table from input PDF file
Upload your file (Drag file here)
using System;
using System.Linq;
using GemBox.Document;
using GemBox.Document.Tables;

class Program
    static void Main()
        // If using the Professional version, put your serial key below.

        var document = DocumentModel.Load("%InputFileName%");

        // Display file's properties.
        var properties = document.DocumentProperties;
        Console.WriteLine($"Title: {properties.BuiltIn[BuiltInDocumentProperty.Title]}");
        Console.WriteLine($"Author: {properties.BuiltIn[BuiltInDocumentProperty.Author]}");

        // Get paragraphs.
        var paragraphs = document.GetChildElements(true, ElementType.Paragraph).Cast<Paragraph>();

        // Get tables.
        var tables = document.GetChildElements(true, ElementType.Table).Cast<Table>();

        // Display paragraphs and tables count.
        Console.WriteLine($"Paragraph count: {paragraphs.Count()}");
        Console.WriteLine($"Table count: {tables.Count()}");

        // Display first paragraph's content.
        var paragraph = paragraphs.First();
        Console.WriteLine("Paragraph content:");

        // Display last table's content.
        var table = tables.Last();
        Console.WriteLine("Table content:");

        foreach (var row in table.Rows)
            Console.WriteLine(new string('-', 56));
            foreach (var cell in row.Cells)
Imports System
Imports System.Linq
Imports GemBox.Document
Imports GemBox.Document.Tables

Module Program

    Sub Main()

        ' If using the Professional version, put your serial key below.

        Dim document = DocumentModel.Load("%InputFileName%")

        ' Display file's properties.
        Dim properties = document.DocumentProperties
        Console.WriteLine($"Title: {properties.BuiltIn(BuiltInDocumentProperty.Title)}")
        Console.WriteLine($"Author: {properties.BuiltIn(BuiltInDocumentProperty.Author)}")

        ' Get paragraphs.
        Dim paragraphs = document.GetChildElements(True, ElementType.Paragraph).Cast(Of Paragraph)()

        ' Get tables.
        Dim tables = document.GetChildElements(True, ElementType.Table).Cast(Of Table)()

        ' Display paragraphs and tables count.
        Console.WriteLine($"Paragraph count: {paragraphs.Count()}")
        Console.WriteLine($"Table count: {tables.Count()}")

        ' Display first paragraph's content.
        Dim paragraph = paragraphs.First()
        Console.WriteLine("Paragraph content:")

        ' Display last table's content.
        Dim table = tables.Last()
        Console.WriteLine("Table content:")

        For Each row In table.Rows
            Console.WriteLine(New String("-"c, 56))
            For Each cell In row.Cells

    End Sub
End Module

See also

Next steps

GemBox.Document is a .NET component that enables you to read, write, edit, convert, and print document files from your .NET applications using one simple API. How about testing it today?

Download Buy