Tuesday, February 13, 2007

What is an Outlook MSG file?




This is going to get technical real fast. But since you really wanted to know, here is your answer.




Short Answer:


Outlook MSG files are outlook messages saved as files. They are saved as a COM structured storage OLE2 compound document, which is the same technique used by various Microsoft applications like Word, Excel, etc.



Um OK, that's great but it does not help me..




Sorry, but it is what it is. If you want to really understand all this then put on your propeller hat, give it a good hard spin, fill your pocket protector with fresh pens, re-tape your glasses, and read on....




So what is Structured Storage?



Structured storage (variously also known as COM structured storage or OLE structured storage) is a technology developed by Microsoft as part of its Windows operating system for storing hierarchical data within a single file. Strictly speaking, the term structured storage refers to a set of COM interfaces that a conforming implementation must provide, and not to a specific implementation, nor to a specific file format, like Outlook MSG files (in fact, a structured storage implementation need not store its data in a file at all). In addition to providing a hierarchical structure for data, structured storage may also provide a limited form of transactional support for data access. Microsoft provides an implementation that supports transactions, as well as one that does not (called simple-mode storage, the latter implementation is limited in other ways as well, although it performs better).

Structured storage is widely used in Microsoft Office applications, although newer releases (starting with Office 2007) will use a new XML-based format by default. It is also an important part of both COM and the related Object Linking and Embedding (OLE) technologies. Other notable applications of structured storage include MSSQL, the Windows shell, and many third-party CAD programs.

What was the motivation to use Structured Storage?

Structured storage addresses some inherent difficulties of storing multiple data objects within a single file. One difficulty arises when an object persisted in the file changes in size due to an update. If the application that is reading/writing the file expects the objects in the file to remain in a certain order, everything following that object's representation in the file may need to be shifted backward to make room if the object grows, or forward to fill in the space left over if the object shrinks. If the file is large, this could be a costly operation. Of course, there are many possible solutions to this difficulty, but often the application programmer does not want to deal with low level details such as binary file formats.

Structured storage provides an abstraction known as a stream, represented by the interface IStream. A stream is conceptually very similar to a file, and the IStream interface provides methods for reading and writing similar to file input/output. A stream could reside in memory, within a file, within another stream, etc., depending on the implementation. Another important abstraction is that of a storage, represented by the interface IStorage. A storage is conceptually very similar to a directory on a file system. Storage's can contain streams, as well as other storage's.

If an application wishes to persist several data objects to a file, one way to do so would be to open an IStorage that represents the contents of that file and save each of the objects within a single IStream. One way to accomplish the latter is through the standard COM interface IPersistStream. OLE depends heavily on this model to embed objects within documents.

The format

Microsoft's implementation uses a file format known as compound files, and all of the widely deployed structured storage implementations read and write this format. Compound files use a FAT-like structure to represent storage's and streams. Chunks of the file, known as sectors (these may or may not correspond to sectors of the underlying file system), are allocated as needed to add new streams and to increase the size of existing streams. If streams are deleted or shrink leaving unallocated sectors, these sectors can be reused for new streams.




Conclusion




Although you could, not easily, write code that could open an MSG file, the exact structure of the streams within the file are not documented and therefore, you would have no reference as to what the individual streams represent, the type of data within the stream, and how to consume the data properly. Furthermore, since Microsoft could change the way data is stored in the MSG file at any time, you would be plagued with support for the rest of your life for whatever solution you try to build on your own. For this reason, Microsoft does not document the file structure and the only method they provide for working with Outlook MSG files is the Outlook client.


There is a happy ending to this story!

So, if you need tools to work with the Outlook MSG file format, and don't have time to play pin the tale on the donkey,visit the MSG Technologies page for a complete set of applications and tools for working with Outlook MSG files. They have done all the hard work so you can simply focus on delivering your solution without needing to understand the underlying structure of a Outlook MSG file.
Good news, you can toss that propeller hat and loose the pocket protector!
Cheers!

Convert MSG files to XML

Let talk about converting Outlook Based MSG files to XML documents using a tool from Priasoft, the only tool I know of that can do this, called the MSG to XML conversion tool. I think this is a great way to solve lots of business requirements that seem to exist when Outlook Email is stored in MSG file formats.


First off, why would you want to convert an Outlook MSG file to XML?

Good question, the reasons are many and much to much to cover in full detail here, however here are a few reason that come to mind:

a) You want MSG files stored with all there related properties that can be consumed easily by any application that supports XML.
b) You want to store meta data about MSG files that exist on local and network server volumes
c) You want to archive MSG files to a open, standard format, like XML, that can viewed in most web browsers or consumed by applications that support XML
d) You want a complete representation of an Outlook MSG file in a format that can be indexed by search applications, since XML is, at its core, a structured text file.
e) You need to populate a database with an index of Outlook MSG files and there properties and you have quickly discovered that Outlook MSG files are in a proprietary undocumented Microsoft format that cannot be simply parsed as most properties are in a binary format.

And so on.....

So what is an XML file anyway?

An XML file, short for Extensible Markup Language, is a W3C-recommended general-purpose markup language that supports a wide variety of applications. XML languages or 'dialects' are easy to design and to process. XML is also designed to be human-legible, and to this end, terseness was not considered essential in its structure. XML is a simplified subset of Standard Generalized Markup Language (SGML). Its primary purpose is to facilitate the sharing of data across different information systems, particularly systems connected via the Internet. To learn more about XML click here.

This post was to simply to let you know that this option exists to help solve some Outlook MSG file issues you may be facing and to cover at a high level why you would want to convert MSG files to XML and also what an XML file is. If you think you would like to explore this option, Priasoft offers a free download of the application so you can test scenarios and also review the XML output. Get it here.

Please also post any comments or question you have on the MSG file format of the XML format here for further discussion.