Read and write Open XML files (MS Office 2007)
Here you can download the latest version of Excel 2007 file format (XLSX) demo application,
together with C# source code:
Download Excel2007ASPNet.zip
(last updated on 2006-12-22)
After downloading, extract ZIP file to folder on your hard drive and open the solution
file in Visual Studio 2005. If you don't have Visual Studio 2005 you can run the
application directly by running the executable in the bin folder.
Introduction
With Office 2007, Microsoft decided to change default application formats from old,
proprietary, closed formats (DOC, XLS, PPT) to new, open and standardized XML formats
(DOCX, XLSX and PPTX). New formats share some similarities with old Office XML formats
(WordML, SpreadsheetML) and some similarities with competing OpenOffice.org OpenDocument
formats, but there are many differences. Since new formats will be default in Office
2007 and Microsoft Office is the most predominant office suite, these formats are
destined to be popular and you will probably have to deal with them sooner or later.
This article will explain basics of Open XML file format and specifically XLSX format,
the new format for Excel 2007. Presented is a demo application which writes / reads
tabular data to / from XLSX files. Application is written in C# using Visual Studio
2005. Created XLSX files can be opened using Excel 2007.
Microsoft Open XML format
Every Open XML file is essentially a ZIP archive containing many other files. Office-specific
data is stored in multiple XML files inside that archive. This is in direct contrast
with old WordML and SpreadsheetML formats which were single, non-compressed XML
files. Although more complex, new approach offers few benefits:
-
You don’t need to process entire file in order to extract specific data.
-
Images and multimedia are now encoded in native format, not as text streams.
-
Files are smaller as a result of compression and native multimedia storage.
In Microsoft’s terminology, Open XML ZIP file is called a package. Files inside
that package are called parts. It is important to know that every part has a defined
content type and there are no default type presumptions based on the file extension.
Content type can describe anything; application XML, user XML, images, sounds, video
or any other binary objects. Every part must be connected to some other part using
a relationship. Inside package are special XML files with “.rels” extension which
define relationship between parts. There is also a start part (sometimes called
“root”, which is a bit misleading because graph containing all parts doesn’t have
to be a tree structure), so entire structure looks like in a Picture 1.

Picture 1: Parts and relations inside XLSX file.
To cut a long story short, in order to read the data from an Open XML file you need
to:
- Open package as a ZIP archive – any standard ZIP library will do.
- Find parts that contain data you want to read – you can navigate through
relationship graph (more complex) or you can presume that certain parts have defined
name and path (Microsoft can change that in the future).
- Read parts you are interested in – using standard XML library (if they are
XML) or some other method (if they are images, sounds or of some other type).
On the other side, if you want to create a new Open XML file, you need to:
- Create/get all necessary parts - by using some standard XML library (if they
are XML), by copying them or by using some other method.
- Create all relationships – create “.rels” files.
- Create content types – create “[Content_Types].xml” file.
- Package everything into a ZIP file with appropriate extension (DOCX, XLSX or
PPTX) – any standard ZIP library will do.
The whole story about packages, parts, content types and relations is the same for
all Open XML documents (regardless of they originating application) and Microsoft
refers to it as Open Packaging Conventions.
Excel 2007 Open XML specifics
Excel 2007 extends on the basis of Open Packaging Conventions by adding its own
application-specific XML types. Reference schemas for all XML files used in Office
can be downloaded from MSDN, but note some things are still open to change until
the final Excel 2007 release.
We just want to write / read worksheet data, so we need to look in folder “\xl\worksheets”
inside XLSX file, where all the worksheets are located. For every worksheet there
is a separate XML file; “sheet1.xml”, “sheet2.xml” and so on. When you open such
file you will notice that all of the sheet data is inside <sheetData> element.
For every row there is a <row> element, for every cell there is a <c>
element. Finally, value of the cell is stored in a <v> element.
However, real world XML is never simple as schoolbook XML. You will notice that
numbers get encoded as numbers inside <v> element:
<c r="A1"&rt;
<v>100</v>
</c>
However, string value (like “John”), also gets encoded as number:
<c r="B1" t="s"&rt;
<v>0</v>
</c>
That is because MS Excel uses internal table of unique strings (for performance
reasons). Zero is an index of that string in an internal table of strings and attribute
t="s" tells us that underlying type is a string, not a number. So where is the table
of unique strings located? It is in “\xl\sharedStrings.xml” XML file, and contains
all strings used in entire workbook, not just specific worksheet.
This approach is used for many other things; cell styles, borders, charts, number
formats etc. In fact, that becomes the major programming problem when working with
XLSX files – updating and maintaining various tables of some unique Excel objects.
In this article we will just write / read data values, but if you require some complex
formatting you should probably be better using some commercial component which does
all tedious work for you.
Implementation
Our demo is a Web Forms application (see Picture 2), written in C# using Visual
Studio 2005. Since there is no support for ZIP files in .NET Framework 2.0 (only
for ZIP algorithm), our demo is using an open-source ZIP library called SharpZipLib
(available at: http://www.icsharpcode.net/OpenSource/SharpZipLib). For
demonstration purposes we will extract entire ZIP files to TEMP folder, so we can
examine contents of that folder and files while debugging demo application. In real
world application you may want to avoid extracting to temporary folder and just
read to / write from ZIP file directly.
For XML processing, the choice is simple. For reading XML files we use XmlTextReader
class and for writing we use XmlTextWriter class. Both come with .NET Framework,
but you can also use any other XML processing library.

Picture 2: Demo application in action.
Data reading
We want to read a simple “In.xlsx” file (in the “Input” folder) and copy its contents
to the DataTable. That file contains a list of people with their first and last
names (text values) and their IDs (number values). When “Read input .xlsx file”
button in clicked, the following code is executed:
protected void
buttonReadInput_Click(object sender,
EventArgs e)
{
//
Input file name.
string
fileName = Request.PhysicalApplicationPath + @"..\Input\In.xlsx";
// Delete contents of the temporary directory.
ExcelRW.DeleteDirectoryContents(tempDir);
// Unzip input XLSX file to the temporary directory.
ExcelRW.UnzipFile(fileName,
tempDir);
// Open XML file with table of all unique
strings used in the workbook..
FileStream
fs = new FileStream(tempDir
+ @"\xl\sharedStrings.xml",
FileMode.Open, FileAccess.Read);
// ..and call helper method that parses
that XML and returns an array of
// strings.
ArrayList
stringTable = ExcelRW.ReadStringTable(fs);
// Get DataTable with people from session variable.
DataTable
data = (DataTable)Session["people"];
// Open XML file with worksheet data..
fs =
new FileStream(tempDir +
@"\xl\worksheets\sheet1.xml",
FileMode.Open, FileAccess.Read);
//
..and call helper method that parses that XML and fills DataTable with
// values.
ExcelRW.ReadWorksheet(fs,
stringTable, data);
// Update GridView.
this.SetDataBinding();
}
Nothing unusual happens here. XLSX file is unzipped to the TEMP folder and then
necessary XML parts (now files) are processed. File “sharedStrings.xml” contains
global table of unique strings while file “sheet1.xml” contains data for the first
sheet. Helper methods are pretty straightforward XML reading code -- you can download
demo application code to examine them in more detail.
If everything is OK, after the button click all data will show up in the GridView.
Data writing
Now we want to write data from a DataTable to the “Out.xlsx” file in the “Output”
folder. When “Write output .xlsx file” button is clicked, the following code is
executed:
protected void
buttonWriteOutput_Click(object sender,
EventArgs e)
{
// Output file name.
string fileName = Request.PhysicalApplicationPath
+ @"..\Output\
Out.xlsx";
// Delete contents of the temporary
directory.
ExcelRW.DeleteDirectoryContents(tempDir);
// Unzip template XLSX file to the temporary
directory.
ExcelRW.UnzipFile(templateFile,
tempDir);
// We will need two string tables; a
lookup Hashtable for fast searching
// and an ordinary ArrayList where items
are sorted by their index.
Hashtable lookupTable;
DataTable data = (DataTable)Session["people"];
// Call
helper methods which creates both tables from input data.
ArrayList stringTable = ExcelRW.CreateStringTables(data,
out
lookupTable);
// Create XML file..
FileStream fs =
new FileStream(tempDir +
@"\xl\sharedStrings.xml",
FileMode.Create);
// ..and fill it with unique strings
used in the workbook
ExcelRW.WriteStringTable(fs,
stringTable);
// Create XML file..
fs = new
FileStream(tempDir + @"\xl\worksheets\sheet1.xml",
FileMode.Create);
// ..and fill it with rows and columns
of the DataTable.
ExcelRW.WriteWorksheet(fs,
data, lookupTable);
// ZIP temporary directory to the XLSX
file.
ExcelRW.ZipDirectory(tempDir,
fileName);
if (this.CheckBoxDownload.Checked)
{
// Stream XLS to browser.
Response.Clear();
Response.ContentType =
"application/vnd.ms-excel";
Response.AddHeader("Content-Disposition",
"attachment;
filename=Out.xlsx");
Response.WriteFile(fileName);
Response.End();
}
}
This time code is a bit more complicated. In order not to generate all necessary
parts needed for XLSX file, we decide to use a template file. We extract template
file to the temporary folder and then just change XML parts containing shared string
table and worksheet data. All other parts, relationships and content types stay
the same -- so we don’t need to generate any of that. Note that we use two string
tables; a lookup Hashtable for fast searching and an ordinary ArrayList where items
are sorted by their index. We could pull it out only with ArrayList but then we
would need to search entire ArrayList every time we add a new string (to check if
it is already there). CreateStringTables() helper method builds both string tables,
WriteStringTable() helper method writes string table XML and WriteWorksheet() helper
method writes worksheet data XML.
Again, download demo application code to examine helper methods in more detail.
Alternative ways
As always in programming, there is more than one method to achieve the same thing.
You could use Office automation to start an instance of Excel 2007 (or any other
Office application) and then use interop calls to create a document and save it.
However, using automation has some drawback I have already written about (see why
is GemBox.Spreadsheet better than automation).
Next version of .NET Framework (codename WinFX) will have support for Open Packaging
Conventions (package handling and navigating the relationships) but it seems there
will be no support for accessing application specific data so you will still need
to process XML parts manually.
As another option, you could use some third party component which will come with
support for Open XML format. This will probably cost you some money but has advantage
that usually more than one format (for example; XLS, XLSX, CSV) are supported within
the same API, so your application will be able to target different file formats
using the same code.
About the author

Zeljko Svedic (Željko
Švedić) works as a lead developer at
GemBox Software. He primarily works on GemBox.Spreadsheet - a .NET component for reading and writing XLS,
CSV, HTML and XLSX files. He got interested in Open XML as a part of
his work.
|