We have all met that .docx file format lastly introduced by Microsoft in its word processor as much as we have also met the .xlsx file format ( which i covered in a previous tutorial ) for spreadsheets, today’s goal would be for us to understand what is that .docx format about and of course, make ourselves a reader for it in C#, are you ready ?

What is a .docx file format ?

No other thing than an Office Open XML format, which as its name tell us is an XML file set ( the document file plus support xml files, for templates, formats, tables, configuration, etc )  with one of those XML files containing the actual text document, and the other as aforementioned for decoration, format and culture support.
Click here for a wiki about docx file format

In order to read the document we are gonna help ourselves with the following :

. The ICSharpCode.sharpZiplib
. The System.Xml namespace and its xml management functions
. A sample .docx file that you will find in the attachment

so, before we start you better Download the Zip lib here

and take a look to  .NET System.XML namespace and methods in case this is your first met with it.

In the attachment you will find the file changes.docx, which is a propper docx file from silverlight, but if we open it with winrar you will find out that there is many files inside it.

As you can see, There is one xml for the document itself, where the text will be, and then you have xmls for the fonts, settings, styles, etc
We are going to focus in the document.xml only in order to extract just the document’s text.

so basically, we will have to unzip the file, find the document.xml and parse it. lets do it.

If you are going to do this for yourself, here is what you should do  :

. Create a windows form application
. In the form, place a button for a FileOpen dialog, which you will use to choose the .docx file to be read
. Add to your project a reference for the previously downloaded iCSharpCode.SharpZiplib.dll
. Add a new class for the DocxTextReader, and paste the following code on it :

you will finally get something like this :

Just in case, this is the way you call the reader helper.

So today we learnt what is all that .docx and open office xml file format, we got ourselves introduced to icsharpcode libs which is very helpful managing zipped files and we learnt how to find our good old word’s text content inside all that zipped xml thingie, not bad i would say.

