Supported File Formats
GemBox.Document supports multiple file formats with varying degree of support.
The following sections explain file formats support in more detail and present other, file format specific, valuable information:
GemBox.Document supports the following file formats:
Microsoft Word 97-2003 Document (DOC).
Microsoft Word Document (DOCX).
HyperText Markup Language (HTML).
Rich Text Format (RTF).
Plain text (TXT).
Adobe Portable Document Format (PDF).
Microsoft XML Paper Specification (XPS).
MIME HyperText Markup Language (MHTML).
Image formats (PNG, JPEG, GIF, BMP, TIFF, WMP).
Supporting a file format as an input file format means that GemBox.Document is able to read the specified file format and supporting it as an output file format means that GemBox.Document is able to write to the specified file format.
In addition to exporting to a file or a stream, GemBox.Document also supports printing a document (DocumentModelPrint method) and converting a document to the following types:
ImageSource using the DocumentModelConvertToImageSource(ImageSaveOptions) method.
These outputs are especially useful in WPF applications, by providing a means to embed a document in your WPF application with DocumentViewer and Image controls, as shown in our GemBox.Document WPF examples.
// Assign a DocumentModel instance to DocumentViewer control. documentViewer.Document = document.ConvertToXpsDocument(SaveOptions.XpsDefault).GetFixedDocumentSequence(); // Assign a DocumentModel instance to Image control. image.Source = document.ConvertToImageSource(SaveOptions.ImageDefault);
' Assign a DocumentModel instance to DocumentViewer control. documentViewer.Document = document.ConvertToXpsDocument(SaveOptions.XpsDefault).GetFixedDocumentSequence() ' Assign a DocumentModel instance to Image control. image.Source = document.ConvertToImageSource(SaveOptions.ImageDefault)
GemBox.Document supports most of the Microsoft Word Document (DOCX) and Word 97-2003 Document (DOC) features through its API, but not all.
For example, GemBox.Document doesn't support document revision changes, structured document and custom XML tags and comments through its API.
Although not supporting all Microsoft Word Document (DOCX) features through its API, GemBox.Document allows you to preserve the unsupported features, so you don't lose any relevant document content when loading and saving a document to DOCX format.
GemBox.Document currently supports reading PDF files that contain text in paragraphs and/or tables.
Since PDF is a fixed document format (the location of every text, border line, background fill, etc. is specified in page coordinates and is, potentially, transformed) and GemBox.Document model is a flow document format (like HTML, for example), to read a PDF file into a GemBox.Document, elements such as Paragraphs and Tables must be recognized from PDF-positioned text and lines/paths.
The recognition of PDF logical structure in GemBox.Document is based on various heuristics that we have implemented and plan to improve and extend over time based on customer feedback.
However, note that a fully correct recognition is impossible to achieve just by reading the content of PDF pages, because higher level information is required to disambiguate certain cases. For example, a PDF page with text in two columns could be a table with a single row and two cells or a section with two columns. Or, a PDF page with a single small line of text in the middle of it could be a paragraph with left alignment and left indentation, right alignment and right indentation, or some other combination.
Some competitive products offer PDF reading functionality with high fidelity by using text frames or text boxes to position the text in the same location on a page as it appeared in the PDF page and by rendering PDF page graphics into a temporary image that is then inserted into a page.
Although the output of this approach looks very similar or identical to the input PDF, it has the following flaws:
The logical structure of the document is not available.
For example, if you have a table in a PDF file and you want to extract the content of a cell in the second row and third column, that is not possible since there is no table. There is just a pile of text frames (with text fragments) and pictures (with border lines, background fills, and other graphics) scattered all over the document.
Text search is limited.
Since logically connected text segments might end up in different text frames (for example, if a paragraph is justified), looking for a term which spans two or more text frames is not possible.
Editing is limited.
Since text segments are absolutely positioned on a page using text frames, removing text or adding new text doesn’t reflow the rest of the content; the positions of all text frames are independent of each other.
GemBox.Document PDF reader was designed to handle these scenarios. For an example, see the extract text from PDF example.
GemBox.Document also supports reading encrypted PDF files. For an example, see the PDF encryption example.
Based on customer feedback, we might also implement high fidelity PDF reading using text frames, but with the same limitations as mentioned above.
Exporting a document to a fixed document file format, such as PDF and XPS, and to image formats is accomplished with GemBox.Document internal paginator and renderer that are commonly used for all formats mentioned.
This means that PDF, XPS and image formats share the same level of support for document features, since they are all rendered in the same way.
The following list contains GemBox.Document API members that are, currently, not supported when exporting to PDF, XPS and image formats:
Support for these members will be added in future versions of GemBox.Document based on customer feedback.
Most of the Internet Service Providers restrict hosted ASP.NET applications to Medium Level Trust and by doing so, disable accessing files outside the application directory, among other things, as explained in trust Element (ASP.NET Settings Schema) level Attribute.
GemBox.Document support for Partially Trusted applications depends on the used file formats as follows:
Word Document (DOCX), Word 97-2003 Document (DOC), HyperText Markup Language (HTML), Rich Text Format (RTF) and Plain Text (TXT) are fully supported in Partially Trusted applications.
Adobe Portable Document Format (PDF) is supported in Partially Trusted applications if font location is set to a directory that is available to the Partially Trusted application.
Setting the font location directory is necessary for Partially Trusted applications because they can only access files inside the application directory, and font files are, by default, located in C:\Windows\Fonts, which is restricted to Partially Trusted applications.
For more information on how to set font location directory, see Private Fonts sample.
Font files are, usually, copyrighted, so make sure you conform to the font license, before copying a font file to another location.
Microsoft XML Paper Specification (XPS) is not supported in Partially Trusted applications because ReachFramework.dll assembly, where most of the XPS implementation resides, is not decorated with AllowPartiallyTrustedCallersAttribute.
Image formats (PNG, JPEG, GIF, BMP, TIFF, WMP) are not supported in Partially Trusted applications because BitmapEncoder class and its derived classes, used for writing image data to the specific image file format, do not work in partial trust.
Printing is not supported in Partially Trusted applications because it uses XPS infrastructure.