One of the challenges, when storing or transmitting the image of a scanned multi-page document, is that it takes an awful lot of space. Unless, of course, you compress the data. But how should you do this?
The kind of lossy compression used by the JPEG and MPEG standards is great for photos and movie frames, but not much good for text – it makes the edges blurry. And the hard-edged, often lossless, compression used by things like PNG and GIF is great for text but will do nasty things to any embedded photos or background textures. So how do you handle, say, a typical magazine page, with crisp text, embedded photos, graduated background colours?
In the late 90s, my friend Yann LeCun and others created the DjVu format, which cunningly works out how to split a document up and compress each bit using the most appropriate system, then reassemble them for viewing later. It was particularly good for things like digitising historical manuscripts – it would separate the script from the parchment, deal with them separately and still produce a realistic-looking copy afterwards, but take a fraction of the amount of data that most other schemes would have used; especially important in those pre-broadband days. The same concepts are now in the JBIG2 standard, which is included in PDF and embedded in many devices, including Xerox copiers.
Another way to save space and time is that, once you’ve separated the text and other symbols from the background, it’s fairly easy to see if any symbols are re-used. So you don’t have to store the image of every ‘e’ in the document – you can store a representative sample of each size, font etc and simply insert an appropriate one wherever it is used in the original. All very cunning.
Assuming you get it right.
But this story on the BBC describes how some Xerox photocopiers may not have been getting it right, occasionally substituting incorrect digits in their copies. This can be something of a problem if you are, say, an accountant, or an architect. It’s not clear from this article whether this has ever caused anybody serious problems yet, or just been noticed in the lab, but you can imagine the potential lawsuits…
It’s a potential danger of any technology that reassembles a perfect-looking output, when in fact some data may have been lost since the input. You could save a lot of mobile-phone bandwidth if you noticed that someone had just used the same word that they used a few minutes ago, for example…
Xerox fought hard to preserve their trademark by not allowing it to become a generic verb meaning ‘to photocopy’. But I guess they’d like it less if it came to mean something else.
“Ah, hello, is that my tax accountant? I was wondering if you could…. ahem… Xerox this year’s figures for me?…”
Thanks to Mike Flynn for the link.