Supported File Formats
GemBox.Document supports multiple file formats with varying degree of support.
File format support
GemBox.Document supports the following file formats:
Input only:
- Microsoft Word 97-2003 Document (DOC).
- WordML (XML).
Input and output:
- Microsoft Word Document (DOCX).
- OpenDocument Text (ODT).
- Adobe Portable Document Format (PDF).
- HyperText Markup Language (HTML).
- MIME HyperText Markup Language (MHTML).
- Flat OPC Format (XML).
- Rich Text Format (RTF).
- Plain text (TXT).
Output only:
- Microsoft XML Paper Specification (XPS).
- Image formats (PNG, JPEG, GIF, BMP, TIFF, WMP).
Supporting a file format as an input file format means that GemBox.Document is able to read the specified file format and supporting it as an output file format means that GemBox.Document is able to write to the specified file format.
GemBox.Document support for file formats depends on the used framework as following:
Additional conversion outputs
In addition to exporting to a file or a stream, GemBox.Document also supports printing a document (Print() method) and converting a document to the following types:
- XpsDocument using the DocumentModel.ConvertToXpsDocument(XpsSaveOptions) method and
- ImageSource using the DocumentModel.ConvertToImageSource(ImageSaveOptions) method.
These outputs are especially useful in WPF applications, by providing a means to embed a document in your WPF application with DocumentViewer and Image controls, as shown in our GemBox.Document WPF examples.
Following code snippet shows how to assign a DocumentModel instance to DocumentViewer and Image controls:
// Assign a DocumentModel instance to DocumentViewer control.
documentViewer.Document = document.ConvertToXpsDocument(SaveOptions.XpsDefault).GetFixedDocumentSequence();
// Assign a DocumentModel instance to Image control.
image.Source = document.ConvertToImageSource(SaveOptions.ImageDefault);
Support level for DOCX and DOC formats
GemBox.Document supports most of the Microsoft Word Document (DOCX) and Word 97-2003 Document (DOC) features through its API, but not all.
For example, GemBox.Document doesn't support macros and smart arts through its API.
Although not supporting all Microsoft Word Document (DOCX) features through its API, GemBox.Document allows you to preserve the unsupported features, so you don't lose any relevant document content when loading and saving a document to DOCX format.
For more information, see Preservation and Preservation example.
Support level for reading PDF format
There are various options for reading PDF files using GemBox components and each has its advantages and is suitable in different scenarios.
Logical loading
Since PDF is a fixed document format (the location of every text, border line, background fill, etc. is specified in page coordinates and is, potentially, transformed) and GemBox.Document model is a flow document format (like HTML, for example), to read a PDF file into a GemBox.Document, elements such as Paragraph and Table must be recognized from PDF-positioned text and lines/paths.
The recognition of PDF logical structure in GemBox.Document is based on various heuristics that we have implemented and plan to improve and extend over time based on customer feedback. However, note that a fully correct recognition is impossible to achieve just by reading the content of PDF pages, because higher level information is required to disambiguate certain cases. For example, a PDF page with text in two columns could be a table with a single row and two cells or a section with two columns. Or, a PDF page with a single small line of text in the middle of it could be a paragraph with left alignment and left indentation, right alignment and right indentation, or some other combination.
For an example, see the extract text from PDF example.
GemBox.Document also supports reading encrypted PDF files. For an example, see the PDF encryption example.
High fidelity loading
High fidelity loading uses text frames to position the text in the same location on a page as it appeared in the PDF page. The PDF page graphics are converted to shapes or rendered into temporary images that are then inserted into a page.
Although the output of this approach looks very similar or identical to the input PDF, it has the following drawbacks:
- The logical structure of the document is not available - For example, if you have a table in a PDF file and you want to extract the content of a cell in the second row and third column, that is not possible since there is no table.
- Text search is limited - Since logically connected text segments might end up in different text frames, looking for a term which spans two or more text frames is not possible.
- Editing is limited - Since text segments are absolutely positioned on a page using text frames, removing text or adding new text doesn't reflow the rest of the content; the positions of all text frames are independent of each other.
For an example, see the convert PDF to DOCX example.
Loading using GemBox.Pdf
Alternatively you can use GemBox.Pdf component which gives you lower level access to the PDF elements and gives you more control over editing the document and more precise information when extracting content and properties of PDF elements.
How to choose loading approach
The following table can help you choose which option is the best for your use case.
Logical loading in GemBox.Document | High-fidelity loading in GemBox.Document | Loading with GemBox.Pdf | |
---|---|---|---|
Summary | The file is loaded by trying to detect the logical structure of the document. | The file is loaded by absolutely positioning paragraphs, shapes and images on pages. | The loaded model corresponds directly to the (low-level) PDF specification. |
Advantages |
|
|
|
Disadvantages |
|
|
|
When to use |
|
|
|
Support level for PDF, XPS and image formats
Exporting a document to a fixed document file format, such as PDF and XPS, and to image formats is accomplished with GemBox.Document internal paginator and renderer that are commonly used for all formats mentioned.
This means that PDF, XPS and image formats share the same level of support for document features, since they are all rendered in the same way.
The following list contains GemBox.Document API members that are, currently, not supported when exporting to PDF, XPS and image formats:
- ViewOptions class.
- WriteProtection class.
- CharacterFormat properties:
- TextWrappingStyle fields Through and Tight.
- BorderStyle fields Triple, Wave and DoubleWave.
- UnderlineType fields Wave and DoubleWave.
- Bar field.
- PenCompoundType fields.
- EffectPadding property.
Note
Support for these members will be added in future versions of GemBox.Document based on customer feedback.
Support for ISO-standardized versions of the Portable Document Format (PDF)
GemBox.Document supports writing to PDF/A, the ISO-standardized version of the Portable Document Format (PDF) specialized for long-term archiving of electronic documents.
The following list contains conformance levels that are currently supported when exporting to PDF format:
- PDF/A-1a,
- PDF/A-1b,
- PDF/A-2a,
- PDF/A-2b,
- PDF/A-2u,
- PDF/A-3a,
- PDF/A-3b,
- PDF/A-3u.
Support for Partially Trusted applications
Most of the Internet Service Providers restrict hosted ASP.NET applications to Medium Level Trust and by doing so, disable accessing files outside the application directory, among other things, as explained in trust Element (ASP.NET Settings Schema) level Attribute.
GemBox.Document support for Partially Trusted applications depends on the used file formats as follows:
- Word Document (DOCX), Word 97-2003 Document (DOC), OpenDocument Text (ODT), HyperText Markup Language (HTML), Flat OPC Format (XML), WordML (XML), Rich Text Format (RTF) and Plain Text (TXT) are fully supported in Partially Trusted applications.
- Adobe Portable Document Format (PDF) is supported in Partially Trusted applications if font location is set to a directory that is available to the Partially Trusted application.
Important
Setting the font location directory is necessary for Partially Trusted applications because they can only access files inside the application directory, and font files are, by default, located in C:\Windows\Fonts, which is restricted to Partially Trusted applications. For more information on how to set font location directory, see Private Fonts sample. Font files are, usually, copyrighted, so make sure you conform to the font license, before copying a font file to another location.
- Microsoft XML Paper Specification (XPS) is not supported in Partially Trusted applications because ReachFramework.dll assembly, where most of the XPS implementation resides, is not decorated with AllowPartiallyTrustedCallersAttribute.
- Creating a digitally signed PDF file is not supported in Partially Trusted applications because ComputeSignature() does not work in partial trust.
- Image formats (PNG, JPEG, GIF, BMP, TIFF, WMP) are not supported in Partially Trusted applications because BitmapEncoder class and its derived classes, used for writing image data to the specific image file format, do not work in partial trust.
- Printing is not supported in Partially Trusted applications because it uses XPS infrastructure.
- Charts are not supported in Partially Trusted applications because they rely on reflection.