Supported File Formats

GemBox.Document supports multiple file formats with a varying degree of support.

Read and write

Microsoft Word Document (DOCX).
OpenDocument Text (ODT).
Adobe Portable Document Format (PDF).
HyperText Markup Language (HTML).
MIME HyperText Markup Language (MHTML).
Flat OPC Format (XML).
Rich Text Format (RTF).
Plain text (TXT).

Read only

Microsoft Word 97-2003 Document (DOC).
WordML (XML).

Write only

Microsoft XML Paper Specification (XPS).
Image formats (SVG, PNG, JPEG, GIF, BMP, TIFF, WMP, EMF).

Additional conversion outputs

In addition to exporting to a file or a stream, GemBox.Document also supports document printing (DocumentModel.Print() method) and converting to the following types:

using the DocumentModel.ConvertToXpsDocument(XpsSaveOptions) method and
using the DocumentModel.ConvertToImageSource(ImageSaveOptions) method.

These outputs are especially useful in WPF applications, by providing a means to embed a document in your WPF application with and controls, as shown in our GemBox.Document WPF examples.

The following code snippet shows how to assign a DocumentModel instance to and controls:

C#
VB.NET

// Assign a DocumentModel instance to DocumentViewer control.
documentViewer.Document = document.ConvertToXpsDocument(SaveOptions.XpsDefault).GetFixedDocumentSequence();

// Assign a DocumentModel instance to Image control.
image.Source = document.ConvertToImageSource(SaveOptions.ImageDefault);

' Assign a DocumentModel instance to DocumentViewer control.
documentViewer.Document = document.ConvertToXpsDocument(SaveOptions.XpsDefault).GetFixedDocumentSequence()

' Assign a DocumentModel instance to Image control.
image.Source = document.ConvertToImageSource(SaveOptions.ImageDefault)

Support level for DOCX and DOC formats

GemBox.Document supports most of DOCX and DOC features through its API, but not all. For example, GemBox.Document doesn't support Equations through its API.

By default, GemBox.Document will preserve the unsupported features, so you don't lose any relevant document content when loading and saving a document to DOCX format.

For more information, see the Preservation documentation page and Preservation example.

Support level for reading PDF format

There are various options for reading PDF files using GemBox components. Each has its advantages and is suitable for different scenarios.

Logical loading

Since PDF is a fixed document format (the location of every text, border line, background fill, etc. is specified in page coordinates and is, potentially, transformed) and GemBox.Document model is a flow document format (like HTML, for example), to read a PDF file into a GemBox.Document, elements such as Paragraph and Table must be recognized from PDF-positioned text and lines/paths.

The recognition of PDF logical structure in GemBox.Document is based on various heuristics that we have implemented and plan to improve and extend over time based on customer feedback. However, note that a fully correct recognition is impossible to achieve just by reading the content of PDF pages because higher level information is required to disambiguate certain cases. For example, a PDF page with text in two columns could be a table with a single row and two cells or a section with two columns. Or, a PDF page with a single small line of text in the middle of it could be a paragraph with left alignment and left indentation, right alignment and right indentation, or some other combination.

For an example, see the Extract text from PDF example. GemBox.Document also supports reading encrypted PDF files.

High fidelity loading

High fidelity loading uses text frames to position the text in the same location on a page as it appeared on the PDF page. The PDF page graphics are converted to shapes or rendered into temporary images that are then inserted into a page.

Although the output of this approach looks very similar or identical to the input PDF, it has the following drawbacks:

The logical structure of the document is not available - For example, if you have a table in a PDF file and you want to extract the content of a cell in the second row and third column, that is not possible since there is no table.
Text search is limited - Since logically connected text segments might end up in different text frames, looking for a term that spans two or more text frames is not possible.
Editing is limited - Since text segments are absolutely positioned on a page using text frames, removing text or adding new text doesn't reflow the rest of the content; the positions of all text frames are independent of each other.

For an example, see the Convert PDF to DOCX example.

Loading using GemBox.Pdf

Alternatively, you can use the GemBox.Pdf component which gives you lower level access to PDF elements and gives you more control over editing the document and more precise information when extracting content and properties of PDF elements.

How to choose loading approach

The following table can help you choose which option is the best for your use case.

	Logical loading in GemBox.Document	High-fidelity loading in GemBox.Document	Loading with GemBox.Pdf
Summary	The file is loaded by trying to detect the logical structure of the document.	The file is loaded by absolutely positioning paragraphs, shapes, and images on pages.	The loaded model corresponds directly to the (low-level) PDF specification.
Advantages	The loaded document has a flow structure making it easier to edit. Easy to extract text from tables and paragraphs.	The loaded document looks almost identical to the original PDF. Almost every PDF feature is recognized.	The model gives precise control over all PDF features.
Disadvantages	Only a limited number of elements are recognized. Visually, the loaded document doesn't look exactly like the original PDF.	The absolute positioning of elements makes it harder to edit the document in MS Word or a similar application.	The lower level model makes it harder to work with it. Doesn't support conversion to DOCX and other office file formats.
When to use	Flow structure extraction (tables, paragraphs). Conversion to other file formats if heavy editing is expected after the conversion.	Conversion to other file formats such as DOCX, RTF, ODT, or XPS.	Editing of PDF documents. Conversion to image file formats.

Support level for writing PDF, XPS, and image formats

Exporting a document to a fixed document file format, such as PDF and XPS, and to image formats is accomplished with GemBox.Document internal paginator and renderer that are commonly used for all formats mentioned. This means that PDF, XPS, and image formats share the same level of support for document features since they are all rendered in the same way.

The following list contains GemBox.Document API members that are, currently, not supported when exporting to PDF, XPS, and image formats:

CharacterFormat properties:
- Border,
- Kerning.
TextWrappingStyle fields Through and Tight.
BorderStyle fields Triple, Wave, and DoubleWave.
UnderlineType fields Wave and DoubleWave.
Bar field.
PenCompoundType fields.
EffectPadding property.
AlignParagraphBordersAndTableEdgesWithPageBorder property.
ShrinkTextOnOverflow field.

Note

Support for these members will be added in future versions of GemBox.Document based on customer feedback.

Support for ISO-standardized versions of PDF

GemBox.Document supports writing to PDF/A, the ISO-standardized version of the Portable Document Format (PDF) specialized for long-term archiving of electronic documents.

The following list contains conformance levels that are currently supported when exporting to PDF format:

PDF/A-1a,
PDF/A-1b,
PDF/A-2a,
PDF/A-2b,
PDF/A-2u,
PDF/A-3a,
PDF/A-3b,
PDF/A-3u.