When working on the beta of Officeshots.org I ran into an interesting problem with file type and MIME type detection of OpenDocument files. When a user uploads an ODF file to Officeshots I want to determine the MIME type myself using the PHP Fileinfo extension. Windows user who do not have any ODF supporting applications installed will report ODF files as application/zip which is of no use to me. In addition, a malicious user could attempt to upload an executable file and report the MIME type as ODF file.
On Linux, the PHP Fileinfo extension relies on the magic file that is provided by the file package. The magic file contains a series of tests that can determine the file type and MIME type of a file by its contents. I found out that the magic file is incomplete for OpenDocument files. Below I will show you what is wrong with the magic file and how you can fix it.
If you don’t care about the technical explanantion, you can skip to the fix directly.
The problem with magic
First off, some tests. I ran these tests on Debian Lenny, but I have seen other distributions as well that have incomplete file magic support for OpenDocument Format. Here is what I get when I test an odt file using the file command.
- ~$ file document.odt
- document.odt: OpenDocument Text
- ~$ file --mime document.odt
- document.odt: application/vnd.oasis.opendocument.text
So far, so good. Both the file type description and the MIME type are right. But for any other type of OpenDocument file only the description is correct. The file type is not. Below I am testing an ods spreadsheet.
- ~$ file spreadsheet.ods
- spreadsheet.ods: OpenDocument Spreadsheet
- ~$ file --mime spreadsheet.ods
- spreadsheet.ods: application/octet-stream
The file type "OpenDocument Image Template" is even missing completely from the magic file. There is another problem with the magic file too. An OpenDocument file is basically a zip archive that contains several XML files. The OpenDocument specification (pdf) does not specify what version of zip to use. The magic file only searches for zip 2.0, which is what most ODF applications use, but not all. Some applications use version 1.0 instead and according to the ODF spec that is valid. Here is what happens when you try to detect an ODF file zipped with the zip 1.0 standard.
- ~$ file document.odt
- document.odt: Zip archive data, at least v1.0 to extract
- ~$ file --mime document.odt
- document.odt: application/zip
Fixing magic detection
I have written a patch for the magic file that fixes all of the above problems. It removes the version test for the ODF zip container, adds the correct MIME type for all the different ODF file types and adds the missing OpenDocument Image Template. This patch is written for /usr/share/file/magic on Debian Lenny. If you want to patch your own Linux distribution then you may need to adapt it. You can view the patch in our Officeshots Trac or download the patch directly from Subversion.
Update 2009-06-29: I have now also created a patch against the original upstream file-5.0.3.
First, make a backup of your original magic file. Then apply the patch to magic.
- ~# cd /usr/share/file
- /usr/share/file# cp magic magic.orig
- /usr/share/file# patch < ~/magic.patch
- patching file magic
After this you need to recompile the magic file. This will create magic.mgc which is the file that is actually used by the file command and the PHP Fileinfo extension.
- /usr/share/file# file -C magic
Now your magic file will correctly identify all OpenDocument file types.
- ~$ file --mime spreadsheet.ods
- spreadsheet.ods: application/vnd.oasis.opendocument.spreadsheet
And that’s all there is to it. Have fun with ODF!.
Comments
#1 Frank Groeneveld (http://techfield.org)
#2 Sander Marechal (http://www.jejik.com)
#3 Polprav (http://polprav.blogspot.com/)
Can I quote a post in your blog with the link to you?
#4 Sander Marechal (http://www.jejik.com)
That means you can use, quote, change or sell,my article, whatever you want. As long as you mention my name (or link back to me) and you also share your article under the same license.
#5 Anonymous Coward
Can i read the magic file fromanywhere to know the filetype of a file
#6 Sander Marechal (http://www.jejik.com)
Instead, I suggest that you use the proper library and API instead. For example, from C and C++ you can use libmagic. In PHP you can use the Fileinfo extension. I am sure that Perl, Python and any other language also have a library that wraps around libmagic. That is much, much easier than trying to do it yourself.
#7 fenderbirds
Comments have been retired for this article.