Read text from PDF files in C# and VB.NET
GemBox.Pdf provides a very fast reading of PDF files from your C# or VB.NET application. It can read a 1,000 page PDF file full of text (almost 500,000 words) in just three seconds.
The text extraction is fairly straightforward to carry out. Using a simple API and just a few lines of code, you can quickly retrieve the entire text content from a PDF file as a single String
, ready for your further processing.
The following example demonstrates how you can easily read the text content of each page in your PDF document.

using System;
using GemBox.Pdf;
class Program
{
static void Main()
{
// If using Professional version, put your serial key below.
ComponentInfo.SetLicense("FREE-LIMITED-KEY");
// Iterate through PDF pages and extract each page's Unicode text content.
using (var document = PdfDocument.Load("%InputFileName%"))
foreach (var page in document.Pages)
Console.WriteLine(page.Content.ToString());
}
}
Imports System
Imports GemBox.Pdf
Module Program
Sub Main()
' If using Professional version, put your serial key below.
ComponentInfo.SetLicense("FREE-LIMITED-KEY")
' Iterate through PDF pages and extract each page's Unicode text content.
Using document = PdfDocument.Load("%InputFileName%")
For Each page In document.Pages
Console.WriteLine(page.Content.ToString())
Next
End Using
End Sub
End Module
Reading additional information about a text
GemBox.Pdf simplifies PDF page content operations by representing the content as a sequence of parsed, or compiled, elements, such as text, path, and external objects (images and forms). For more information see the Content Streams and Resources help page.
The PdfTextContent
elements can be used to extract additional information about a text such as its location, font, and color as shown in the next example.

using System;
using System.Linq;
using GemBox.Pdf;
using GemBox.Pdf.Content;
class Program
{
static void Main(string[] args)
{
// If using Professional version, put your serial key below.
ComponentInfo.SetLicense("FREE-LIMITED-KEY");
// Iterate through all PDF pages and through each page's content elements,
// and retrieve only the text content elements.
using (var document = PdfDocument.Load("%InputFileName%"))
foreach (var textElement in document.Pages
.SelectMany(page => page.Content.Elements.All())
.Where(element => element.ElementType == PdfContentElementType.Text)
.Cast<PdfTextContent>())
{
var text = textElement.ToString();
var font = textElement.Format.Text.Font;
var color = textElement.Format.Fill.Color;
var location = textElement.Location;
// Read the text content element's additional information.
Console.WriteLine($"Unicode text: {text}");
Console.WriteLine($"Font name: {font.Face.Family.Name}");
Console.WriteLine($"Font size: {font.Size}");
Console.WriteLine($"Font style: {font.Face.Style}");
Console.WriteLine($"Font weight: {font.Face.Weight}");
if (color.TryGetRgb(out double red, out double green, out double blue))
Console.WriteLine($"Color: Red={red}, Green={green}, Blue={blue}");
Console.WriteLine($"Location: X={location.X:0.00}, Y={location.Y:0.00}");
Console.WriteLine();
}
}
}
Imports System
Imports System.Linq
Imports GemBox.Pdf
Imports GemBox.Pdf.Content
Module Program
Sub Main()
' If using Professional version, put your serial key below.
ComponentInfo.SetLicense("FREE-LIMITED-KEY")
' Iterate through all PDF pages and through each page's content elements,
' and retrieve only the text content elements.
Using document = PdfDocument.Load("%InputFileName%")
For Each textElement In document.Pages _
.SelectMany(Function(page) page.Content.Elements.All()) _
.Where(Function(element) element.ElementType = PdfContentElementType.Text) _
.Cast(Of PdfTextContent)()
Dim text = textElement.ToString()
Dim font = textElement.Format.Text.Font
Dim color = textElement.Format.Fill.Color
Dim location = textElement.Location
' Read the text content element's additional information.
Console.WriteLine($"Unicode text: {text}")
Console.WriteLine($"Font name: {font.Face.Family.Name}")
Console.WriteLine($"Font size: {font.Size}")
Console.WriteLine($"Font style: {font.Face.Style}")
Console.WriteLine($"Font weight: {font.Face.Weight}")
Dim red, green, blue As Double
If color.TryGetRgb(red, green, blue) Then Console.WriteLine($"Color: Red={red}, Green={green}, Blue={blue}")
Console.WriteLine($"Location: X={location.X:0.00}, Y={location.Y:0.00}")
Console.WriteLine()
Next
End Using
End Sub
End Module
Reading text from a specific rectangular area
With GemBox.Pdf, you can extract a PDF document's text from a specific rectangular area. To do this, you define the coordinates of the targeted area and retrieve only the PdfTextContent
elements that are within it, as shown in the next example.

using System;
using System.Drawing;
using System.Linq;
using GemBox.Pdf;
using GemBox.Pdf.Content;
class Program
{
static void Main(string[] args)
{
// If using Professional version, put your serial key below.
ComponentInfo.SetLicense("FREE-LIMITED-KEY");
var pageIndex = 0;
var area = new Rectangle(400, 690, 150, 30);
using (var document = PdfDocument.Load("%InputFileName%"))
{
// Retrieve first page object.
var page = document.Pages[pageIndex];
// Retrieve text content elements that are inside specified area on the first page.
foreach (var textElement in page.Content.Elements.All()
.Where(element => element.ElementType == PdfContentElementType.Text)
.Cast<PdfTextContent>())
{
var location = textElement.Location;
if (location.X > area.X && location.X < area.X + area.Width &&
location.Y > area.Y && location.Y < area.Y + area.Height)
Console.Write(textElement.ToString());
}
}
}
}
Imports System
Imports System.Drawing
Imports System.Linq
Imports GemBox.Pdf
Imports GemBox.Pdf.Content
Module Program
Sub Main()
' If using Professional version, put your serial key below.
ComponentInfo.SetLicense("FREE-LIMITED-KEY")
Dim pageIndex = 0
Dim area = New Rectangle(400, 690, 150, 30)
Using document = PdfDocument.Load("%InputFileName%")
' Retrieve first page object.
Dim page = document.Pages(pageIndex)
' Retrieve text content elements that are inside specified area on the first page.
For Each textElement In page.Content.Elements.All() _
.Where(Function(element) element.ElementType = PdfContentElementType.Text) _
.Cast(Of PdfTextContent)()
Dim location = textElement.Location
If location.X > area.X AndAlso location.X < area.X + area.Width AndAlso
location.Y > area.Y AndAlso location.Y < area.Y + area.Height Then
Console.Write(textElement.ToString())
End If
Next
End Using
End Sub
End Module
Check next example or download examples from GitHub.