Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I am in the process of writing a program to sort one million 32bit integers in 2MB in RAM.

The trick is to notice that sorting the numbers loses information. One million sorted numbers take (in theory) log_2 (1000000!) less bits to store than the same numbers shuffled randomly. That's around 18488884.82 bits.

So you get an ideal storage requirement of +32000000-18488884=13511116 bits. That's about 1.6MB (thus less than 2MB). If you're clever you'll stay close enough to the theoretical limit in your implementation.

Strangely bz2 seems to be able to compress my list of random numbers (in a binary file) to 164KB. That's even less than what my theory says.

At first I suspected my pseudo-random number generator (/dev/urandom on a recent Ubuntu) my produce hidden patterns. But I since tested it on true random numbers from random.org. Same result.

I'll investigate.



I believe your ideal storage requirement, but what's the time complexity of your algorithm, and could you describe it in more detail than "clever"?

I'm pretty sure random reads and writes on any compressed form will be worse than O(1), and I'm pretty sure your sort will do O(n log n) of those, so I believe your algorithm is asymptotically (and in practice for n = 1 million) slower than chmike's two-pass algorithm.


I was thinking along eru's lines. The specifics I came up with: 1) Compress sorted lists by encoding the differences between successive numbers using a code that's shorter for smaller numbers. 2) Use in-memory merge sort. That way you don't need random access to the lists, and it's overall O(n log n). 3) In order to merge large lists when memory is already almost full, divide the lists into blocks, allocate blocks for the merged list while you're deallocating blocks from the two source lists. 4) To do this in Python, use the "array" module.


Ah, I made a mistake with compressing the real file - I assumed 2 Million 16 Bit numbers.

Using 1 Million 32 Bit numbers I still get down to 1891867 Bytes with 7zip. That's less than 2 MB but more than the theoretic minimum of 13511116 bits (=1688889.5 Byte).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: