Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The easiest/quickest lossy text compressions that could theoretically almost halve the size of your data is to drop capitalization (and also assuming you're using some minimal encoding that can only encode the characters you're using.) there is some semantic info that is imparted by capitals, but not much.


No, that would not cut the length in half, or anything close to it.

One form of "minimal encoding" you're describing would be to use arithmetic coding where the predictor simply assigns equal probability to each symbol actually used in the input. For example, if you encode a 1000 letter message which includes at least one instance of every letter (both upper and lower case), then you'd have 52 symbols, and your simple prediction step would simply assign 1/52 as the probability for each symbol. This will result in a constant 5.7 bits of output per symbol. So your 1000 letter message would encode at 5700 bits.

Now, suppose you discard capitalization. Now there are 26 symbols, each assigned a probability of 1/26, resulting in a constant 4.7 bits per symbol. Your 1000 letter message will now encode at 4700 bits. So you only save about 17% in this very generously constructed case.

The mistake you've made is to assume that cutting the number of symbols in half will cut the symbol length in half. In fact, reducing the number of symbols by half merely reduces the symbol length by a single bit. For example, you need 8 bits to specify between 256 possibilities, but if you cut the possibilities down to 128, you still need 7 bits.

In addition, lossy compression is usually combined with lossless compression, and a smart predictor could use context to reduce the cost of capital letters to much less than one bit per symbol (assuming that the text has a predictable structure, like human language).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: