Supported File Formats
GemBox.Document supports multiple file formats with varying degree of support.
File format support
GemBox.Document supports the following file formats:
- Microsoft Word 97-2003 Document (DOC).
Input and output:
- Microsoft Word Document (DOCX).
- HyperText Markup Language (HTML).
- MIME HyperText Markup Language (MHTML).
- Rich Text Format (RTF).
- Plain text (TXT).
- Flat OPC Format (XML).
- Adobe Portable Document Format (PDF).
- Microsoft XML Paper Specification (XPS).
- Image formats (PNG, JPEG, GIF, BMP, TIFF, WMP).
Supporting a file format as an input file format means that GemBox.Document is able to read the specified file format and supporting it as an output file format means that GemBox.Document is able to write to the specified file format.
GemBox.Document support for file formats depends on the used framework as following:
Additional conversion outputs
In addition to exporting to a file or a stream, GemBox.Document also supports printing a document (Print() method) and converting a document to the following types:
- XpsDocument using the DocumentModel.ConvertToXpsDocument(XpsSaveOptions) method and
- ImageSource using the DocumentModel.ConvertToImageSource(ImageSaveOptions) method.
These outputs are especially useful in WPF applications, by providing a means to embed a document in your WPF application with DocumentViewer and Image controls, as shown in our GemBox.Document WPF examples.
// Assign a DocumentModel instance to DocumentViewer control. documentViewer.Document = document.ConvertToXpsDocument(SaveOptions.XpsDefault).GetFixedDocumentSequence(); // Assign a DocumentModel instance to Image control. image.Source = document.ConvertToImageSource(SaveOptions.ImageDefault);
Support level for DOCX and DOC formats
GemBox.Document supports most of the Microsoft Word Document (DOCX) and Word 97-2003 Document (DOC) features through its API, but not all.
For example, GemBox.Document doesn't support macros and smart arts through its API.
Although not supporting all Microsoft Word Document (DOCX) features through its API, GemBox.Document allows you to preserve the unsupported features, so you don't lose any relevant document content when loading and saving a document to DOCX format.
Support level for reading PDF format (beta)
GemBox.Document currently supports reading PDF files that contain text in paragraphs and/or tables on .NET Framework and .NET Core for Windows.
Since PDF is a fixed document format (the location of every text, border line, background fill, etc. is specified in page coordinates and is, potentially, transformed) and GemBox.Document model is a flow document format (like HTML, for example), to read a PDF file into a GemBox.Document, elements such as Paragraph and Table must be recognized from PDF-positioned text and lines/paths.
The recognition of PDF logical structure in GemBox.Document is based on various heuristics that we have implemented and plan to improve and extend over time based on customer feedback. However, note that a fully correct recognition is impossible to achieve just by reading the content of PDF pages, because higher level information is required to disambiguate certain cases. For example, a PDF page with text in two columns could be a table with a single row and two cells or a section with two columns. Or, a PDF page with a single small line of text in the middle of it could be a paragraph with left alignment and left indentation, right alignment and right indentation, or some other combination.
Some competitive products offer PDF reading functionality with high fidelity by using text frames or text boxes to position the text in the same location on a page as it appeared in the PDF page and by rendering PDF page graphics into a temporary image that is then inserted into a page.
Although the output of this approach looks very similar or identical to the input PDF, it has the following flaws:
- The logical structure of the document is not available - For example, if you have a table in a PDF file and you want to extract the content of a cell in the second row and third column, that is not possible since there is no table. There is just a pile of text frames (with text fragments) and pictures (with border lines, background fills, and other graphics) scattered all over the document.
- Text search is limited - Since logically connected text segments might end up in different text frames (for example, if a paragraph is justified), looking for a term which spans two or more text frames is not possible.
- Editing is limited - Since text segments are absolutely positioned on a page using text frames, removing text or adding new text doesn't reflow the rest of the content; the positions of all text frames are independent of each other.
GemBox.Document PDF reader was designed to handle these scenarios. For an example, see the extract text from PDF example.
GemBox.Document also supports reading encrypted PDF files. For an example, see the PDF encryption example.
Based on customer feedback, we might also implement high fidelity PDF reading using text frames, but with the same limitations as mentioned above.
Support level for PDF, XPS and image formats
Exporting a document to a fixed document file format, such as PDF and XPS, and to image formats is accomplished with GemBox.Document internal paginator and renderer that are commonly used for all formats mentioned.
This means that PDF, XPS and image formats share the same level of support for document features, since they are all rendered in the same way.
The following list contains GemBox.Document API members that are, currently, not supported when exporting to PDF, XPS and image formats:
- ViewOptions class.
- WriteProtection class.
- CharacterFormat properties:
- TextWrappingStyle fields Through and Tight.
- BorderStyle fields Triple, Wave and DoubleWave.
- UnderlineType fields Wave and DoubleWave.
- Bar field.
- PenCompoundType fields.
- EffectPadding property.
Support for these members will be added in future versions of GemBox.Document based on customer feedback.
Support for ISO-standardized versions of the Portable Document Format (PDF)
GemBox.Document supports writing to PDF/A, the ISO-standardized version of the Portable Document Format (PDF) specialized for long-term archiving of electronic documents.
The following list contains conformance levels that are currently supported when exporting to PDF format:
Support for Partially Trusted applications
Most of the Internet Service Providers restrict hosted ASP.NET applications to Medium Level Trust and by doing so, disable accessing files outside the application directory, among other things, as explained in trust Element (ASP.NET Settings Schema) level Attribute.
GemBox.Document support for Partially Trusted applications depends on the used file formats as follows:
- Word Document (DOCX), Word 97-2003 Document (DOC), HyperText Markup Language (HTML), Rich Text Format (RTF) and Plain Text (TXT) are fully supported in Partially Trusted applications.
- Adobe Portable Document Format (PDF) is supported in Partially Trusted applications if font location is set to a directory that is available to the Partially Trusted application.
Setting the font location directory is necessary for Partially Trusted applications because they can only access files inside the application directory, and font files are, by default, located in C:\Windows\Fonts, which is restricted to Partially Trusted applications. For more information on how to set font location directory, see Private Fonts sample. Font files are, usually, copyrighted, so make sure you conform to the font license, before copying a font file to another location.
- Microsoft XML Paper Specification (XPS) is not supported in Partially Trusted applications because ReachFramework.dll assembly, where most of the XPS implementation resides, is not decorated with AllowPartiallyTrustedCallersAttribute.
- Creating a digitally signed PDF file is not supported in Partially Trusted applications because ComputeSignature() does not work in partial trust.
- Image formats (PNG, JPEG, GIF, BMP, TIFF, WMP) are not supported in Partially Trusted applications because BitmapEncoder class and its derived classes, used for writing image data to the specific image file format, do not work in partial trust.
- Printing is not supported in Partially Trusted applications because it uses XPS infrastructure.
- Charts are not supported in Partially Trusted applications because they rely on reflection.