Daily Archives:April 3rd, 2003

Getting at your data

Possibly the best aspect of OpenOffice is its open file format.. Here’s a very crude quick and easy way on a Unix platform to get at something close to the raw text in an OO document:

   unzip -p mydocument.sxw  content.xml | sed 's/<[^>]*>/ /g'

This extracts the contents of the document as XML, then strips out most XML tags. Could be improved in many ways, but this is fine if you want to run the text through grep, wc etc or just want to get your paragraphs into a text editor. Is any normal user likely to do this? No, but it’s important that it can be done, even on a machine which knows nothing about OpenOffice.

(Here’s why I think this sort of thing is important, by the way.)

© Copyright Quentin Stafford-Fraser