When Microsoft introduced Office 2007 they also revealed that they would now store their documents in a XML-based file format. To show the difference between the “old” format and the new, files stored in XML get the ‘X’ added to the file extension.
Retrieving images from a Word document (or any other for that matter) has never been a simple walk in the park. In earlier Office versions it was if not easy, but everyone managed to use copy and paste. You could also save the document as a HTML-file and retrieve the images from the file folder it created.
With the “new” XML-based file formats it’s actually gotten way easier…
XML is an archive
What many of us didn’t know is that the XML-format is not just a “web-format” which can be configured in similar manner to the HTML-format, using tags. The XML-format can also be used as an archive which can store dependent resources.
A normal word document with images stores the images and various XML-documents containing formatting, text etc. These resources may be extracted much the same way as ZIP-files, or RAR-files.
Rename the document
First, let me say it’s important that the document is stored in the 2007/2010 format for this to work. In this example I’m using a Word document, but it can be applied to any Office Document.
- Make a copy of your document and rename the extension to ZIP (e.g. name.docx -> name.zip)
- Open or Extract the “Zip-File” using your prefered archive tool
- In the archive you will find a folder named WORD (or excel, or …)
- Inside that folder, you find one called MEDIA, open it.
- There are your media files
The media files has been renamed using continuous numbering (image 1, image 2 …). The advantage is that the files themselves hasn’t been altered so there is no data loss as you could experience with the “old” office file format.