Read and Extract PDF Text in C# and VB.NET

When reading the PDF file's text content, GemBox.Document will represent it with Paragraph and Table elements by using various heuristics for recognizing PDF file's logical structure. You can read more about this on Support level for reading PDF format help page.

If you don't need the logical structure, but rather want to know the exact position of the text (e.g. on which page and coordinates the some text is located) then take a look at this alternative approach for reading PDF text using GemBox.Pdf.

The following example demonstrates how you can read a PDF file with GemBox.Document and extract the file's properties and text that matches a specified regular expression.

Screenshot of read text from input PDF file
Reading PDF file's text in C# and VB.NET
Upload your file (Drag file here)
using System;
using System.Text.RegularExpressions;
using GemBox.Document;

class Program
{
    static void Main()
    {
        // If using Professional version, put your serial key below.
        ComponentInfo.SetLicense("FREE-LIMITED-KEY");

        DocumentModel document = DocumentModel.Load("%InputFileName%");
        DocumentProperties properties = document.DocumentProperties;

        // Read PDF file's properties.
        Console.WriteLine($"Author: {properties.BuiltIn[BuiltInDocumentProperty.Author]}");
        Console.WriteLine($"Created on: {properties.BuiltIn[BuiltInDocumentProperty.DateContentCreated]}");
        Console.WriteLine();

        // Read PDF file's text content and match specified regular expression.
        var text = document.Content.ToString();
        var regex = new Regex(@"(?<Hours>\d+)\s+(?<Unit>\d+\.\d{2})\s+(?<Price>\d+\.\d{2})");
        foreach (Match match in regex.Matches(text))
        {
            var groups = match.Groups;
            Console.WriteLine($"Hours={groups["Hours"]} | Unit={groups["Unit"]} | Price={groups["Price"]}");
        }
    }
}
Imports System
Imports System.Text.RegularExpressions
Imports GemBox.Document

Module Program

    Sub Main()

        ' If using Professional version, put your serial key below.
        ComponentInfo.SetLicense("FREE-LIMITED-KEY")

        Dim document As DocumentModel = DocumentModel.Load("%InputFileName%")
        Dim properties As DocumentProperties = document.DocumentProperties

        ' Read PDF file's properties.
        Console.WriteLine($"Author: {properties.BuiltIn(BuiltInDocumentProperty.Author)}")
        Console.WriteLine($"Created on: {properties.BuiltIn(BuiltInDocumentProperty.DateContentCreated)}")
        Console.WriteLine()

        ' Read PDF file's text content and match specified regular expression.
        Dim text = document.Content.ToString()
        Dim regex As New Regex("(?<Hours>\d+)\s+(?<Unit>\d+\.\d{2})\s+(?<Price>\d+\.\d{2})")

        For Each match As Match In regex.Matches(text)
            Dim groups = match.Groups
            Console.WriteLine($"Hours={groups("Hours")} | Unit={groups("Unit")} | Price={groups("Price")}")
        Next

    End Sub
End Module

Check next example or download examples from GitHub.