As some comments are arriving below -- lossy compression does not work that way!
Interpreting one set of data as another is a clever trick for puzzles (thematic, for example, to: http://web.mit.edu/puzzle/www/2013/coinheist.com/oceans_11/t... -- but that's only a hint, the solution is still lots of fun!) but is kind of wrong in practice.
Compression is about conveying a similar 'message' while removing the stuff that just doesn't compress well. In signals such as pictures and audio, this is often seen/heard in high frequency detail.
To the layman, the simplest lossy compression of an image is to resize it by a half in each dimension and then stretch the output. (Not unlike retina-display artifacts you may have seen). Assuming you do it right (with an averaging filter) -- you've just removed the upper half of the representable frequencies, it just so happens that most of what the eye cares about isn't in those ranges -- and so the image still looks decent. A little blocky if you stretch it, but still pretty good. JPEG is just a more technically advanced version of similar tricks.
A better analogy of lossy compression on text is actually xkcd's Up Goer Five (http://xkcd.com/1133/) -- using only the "ten hundred" most common words, you can still convey concepts (albeit simplistically) and stay under ten bits a word, or a little less than two ascii characters per. If you were to map the dictionary into "up-goer five" speak, you could compress Shakespeare pretty well. It would lose a lot in the translation -- but so does a heavily-compressed image. If you limit yourself to the first 64k words, you have a much larger vocabulary, still limited, and still fitting within two bytes per word. Though you may have to use more words, a word like "wherefore" still counts for four words worth of compressed space, when replaced by "why". Tradeoffs!
That'd be a curious hack -- sort the words in Romeo and Juliet by word frequency. Rewrite the least common words in terms of the other words. Compress.
I wrote a quick script to convert text into the most common 1000 words as best possible, here's your comment. Needs a bit of work, and a better synonym list, but whatever.
as some comments are coming below -- lossy press does not work that way!
interpreting one set of data as another is a able action for puzzles (thematic, for example, to: http://web.mit.edu/puzzle/www/2013/coinheist.com/oceans_11/t.... -- but that's only a direction, the action is still army of fun!) but is kind of wrong in act.
press is about giveing a similar 'account' while removing the air that just doesn't cut well. in act such as pictures and clear, this is often seen/heard in high light army.
to the christian, the simplest lossy press of an image is to resize it by a half in each area and then answer the data. (not different retina-display artifacts you may have seen). big you do it right (with an averaging filter) -- you've just alone the c half of the representable frequencies, it just so happens that most of what the eye care about isn't in those ranges -- and so the image still air christian. a little blocky if you answer it, but still beautiful good. jpeg is just a more technically developed account of similar actions.
a better approach of lossy press on back is actually xkcd's up goer five (http://xkcd.com/1133/) -- using only the "ten hundred" most common words, you can still give concepts (albeit simplistically) and stay under ten cut a word, or a little less than two ascii act per. if you were to art the cant into "up-goer five" speak, you could cut shakespeare beautiful well. it would clear a lot in the change -- but so does a heavily-cuted image. if you all i to the first 64k words, you have a much larger cant, still alled, and still change within two bytes per word. again you may have to use more words, a word like "ground" still counts for four words account of cuted space, when replaced by "why". tradeoffs!
that'd be a concerned common -- sort the words in man and juliet by word light. record the least common words in terms of the other words. cut.
If you limit yourself to the first 64k words, you have a much larger vocabulary, still limited, and still fitting within two bytes per word.
Maybe I'm misunderstanding what you're saying but if you're compressing adaptively by building a dictionary of the target material, 64k words would give you lossless compression of Shakespeare - the unique words in the collected works (with some wiggle room for 'word forms') are in the 30-40k range.
True! You could do that too. Sending the codebook is a cromulent strategy. It'd be lossless too. That's (to a first order) how LZ works.
I was arguing you can use a standard codebook (some agreed-upon top-N list) instead. I imagine if you did you'd lose a couple words ("wherefore" springs to mind, as it's not in common English usage) but by and large you'd have just about everything.
Technically, you can also have a lossless JPEG (or at least, within epsilon), but that's not why it's done.
More importantly, the entropy of English text isn't really that high -- ~ 1 bit per letter -- which means lossless compression works pretty well on it. Images less so.
The lesson here being that compressing English text as suggested in the article is rather meaningless, and can be done much, much better. When the article asks at the end:
> We're sensitive to data loss in text form: we can only consume a few dozens of bytes per second, and so any error is obvious.
> Conversely, we're almost blind to it in pictures and images: and so losing quality doesn't bother us all that much.
> Should it?
My answer is "mu". I wanted to answer "no", but that would mean I even accept the question as valid.
EDIT: s/in this way/as suggested in the article/ for clarity.
I suppose my confusion arises from the fact that pretty much any practical compression scheme is essentially 'content adaptive' rather than 'standard codebooky'. In the lossy cases, add whatever magical filtering to toss out the perceptually irrelevant frequencies.
I agree with you that the article misunderstands compression and coding in general I just couldn't quite figure out what one of your counter-examples was about.
I found that when I tried to use up goer five speak (e.g. http://m50d.github.io/2013/01/18/girl-with-not-real-power.ht...) I used many more words to express the same number of concepts. So I'd be interested to see how much more or less compressible it actually makes things.
Compression depends on data structures filled with repeated pieces. Lossy compression improves the hit rate of repeated pieces by replacing rare pieces with similar, common pieces. In the case of colours, naive lossy compression means "Apricot" and "Burnt Sienna" both become "Orange", which becomes "Org". "Chartreuse" and "Pistachio" both become "Lime", which gets abbreviated to "Lm". A final list of all the colours gets tallied at the end, and colours like "Orange" and "Lime" may get abbreviated all the way down to "O" and "L".
> "The point isn't how many words you're using, but how repetitiously those words appear."
Not under the compression scheme barakm is proposing, which is to fix a universal dictionary (the commonest 1k or 64k English words across some very wide corpus), so that you don't have to transmit it.
Under the 64k scheme, every word is 2B, but you probably have to use a few more words. It's a naive scheme, but a very comprehensible one, and relatively useful for shorter texts.
As an alternative, sure, you can make a custom dictionary, in which case you can use whatever words you like, as long as you can keep the distinct number down, and ideally heavily-repeat certain words. This results in a larger filesize for small messages, because you have to start by sending the dictionary, but a much better filesize for nearly any large message.
Or, hmn. I guess you could allow recursive compression, or back-references, or something, so that you could compress repetition of word-sequences.
And, hmn, that might not always work out. Well, look, you could allocate the first byte of the message to switching between compression schemes, so that you can choose between 256 different schemes, and choose the best one.
Oh look, if I keep this up for another hour or two, maybe I'll catch up to the state of the art from before I was born. No, I don't know what wavelets are, nor who Fourier is, why do you ask?
Anyway, popping out of fake-naive mode, my point is that Huffman compression isn't the only compression technique.
Shakespeare is particularly evil in this regard, given the terms he coined. Some we have picked up and others haven't made it to the modern age.
But, extra credit for compressing such that it still pentas its iambic meter. :-)
It would be fun to take arbitrary slices of force-ranked words and constrain composition likewise. Words of ordinal frequency 1500-3000 would create a bit of a challenge along the line of writing a book without a single letter e. (http://books.google.com/books/about/Gadsby.html?id=jG1JU82Bd...)
Or conversely, in images, the data bits are the pixels, in text the data bits are two steps removed from the words (one is a character encoding, the second is a language).
Interpreting one set of data as another is a clever trick for puzzles (thematic, for example, to: http://web.mit.edu/puzzle/www/2013/coinheist.com/oceans_11/t... -- but that's only a hint, the solution is still lots of fun!) but is kind of wrong in practice.
Compression is about conveying a similar 'message' while removing the stuff that just doesn't compress well. In signals such as pictures and audio, this is often seen/heard in high frequency detail.
To the layman, the simplest lossy compression of an image is to resize it by a half in each dimension and then stretch the output. (Not unlike retina-display artifacts you may have seen). Assuming you do it right (with an averaging filter) -- you've just removed the upper half of the representable frequencies, it just so happens that most of what the eye cares about isn't in those ranges -- and so the image still looks decent. A little blocky if you stretch it, but still pretty good. JPEG is just a more technically advanced version of similar tricks.
A better analogy of lossy compression on text is actually xkcd's Up Goer Five (http://xkcd.com/1133/) -- using only the "ten hundred" most common words, you can still convey concepts (albeit simplistically) and stay under ten bits a word, or a little less than two ascii characters per. If you were to map the dictionary into "up-goer five" speak, you could compress Shakespeare pretty well. It would lose a lot in the translation -- but so does a heavily-compressed image. If you limit yourself to the first 64k words, you have a much larger vocabulary, still limited, and still fitting within two bytes per word. Though you may have to use more words, a word like "wherefore" still counts for four words worth of compressed space, when replaced by "why". Tradeoffs!
That'd be a curious hack -- sort the words in Romeo and Juliet by word frequency. Rewrite the least common words in terms of the other words. Compress.