Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I don't understand how 4 random words from a 50k dictionary gives 90 bits of entropy.

    50k ^ 4 => 6.25e+18

    2 ^ 90  => 1.24e+27
Am I missing something?


The words are separated (and I guess terminated) by a random symbol (of which there are apparently 32 possibilities) or (randomly) a random two-digit number. log(50000^4*(100+32)^4)/log(2) = 90.6


Doesn't that defeat the "Trivial to remember" part of the xkcd concept of "Trivial to remember, hard to guess?"


That means you only have 7 entities to remember, which is supposed to be what humans can handle easily. (Though I understand now that this is stale knowledge.)


The number refers to the capacity of working memory and has indeed been revised over time. It's a complicated issue, but 4 units of information is now considered a more accurate approximation. But if you're remembering a password, you're dealing with long-term memory, which doesn't have this limitation.

http://en.wikipedia.org/wiki/Working_memory#Capacity


Long-term memory usually accepts most information from short-term memory not direct sensory input, so it is easier to memorize when you break down things in chunks first.



Can you offer a link on how do you do these measurements and what exactly do they mean?

Reading the wikipedia article about entropy in (computing) I stumbled across this[1]:

" Around 2011, two of the random devices were dropped and linked into a single source as it could produce hundreds of megabytes per second of high quality random data on an average system. "

What does it mean high quality random data in this context?. I mean 'random' is a precise definition, something 'is random' (e.g. distance of next prime :-P, kill Riemann) or 'not random' (distance of next prime if Riemann's ζ(s) is solved).

[1] https://news.ycombinator.com/user?id=dasil003


The information theory (mathematical) definition of randomness is explained nicely in Chapter V of An Introduction to Information Theory:

http://www.amazon.com/Introduction-Information-Theory-Symbol...

In the particular case of a series of n equally-probably, independent events, the entropy is given as H = - log 1/n, measured in bits. For example, the entropy of a fair die is - log 1/6 = 2.58 bits per throw.

In this case, the random event is words chosen from a word list. Four words are chosen from fifty thousand, with each word having equal probability of being chosen. So the entropy (measure of randomness) is - log 1/50,000 = 15.6 bits per word, or 62.4 bits per four-word combination. (The script also adds random numbers or symbols, to add up to 90 bits.)


Ugh, the other replies aren't quite getting at the crux of the matter. A function f : {0, 1}^k -> {0, 1}^n -> {0, 1}^m is considered to be pseudo-random if (when provided with a key K) there is no way of distinguishing between its output and the output of a random function F : {0, 1}^n -> {0, 1}^m in polynomial time (it's a little more complicated than that, but that's the gist).

In modern computers, the basic principle is that one generates a random seed (through heat noise, keyboard movements, mouse movements, user interactions with the computer etc) and then uses it as the key for a pseudo-random function such that the next output of the function cannot be predicted given all of the previous. One can simply iterate if one so chose (first n bits output = f_K(0), next f_K(1), ...).

In this case, he's simply worked by the fact that if one knows that their target is using this key scheme, one knows that at most there are 50^4 passwords the target could be using. If there are 90 bits of entropy, that means that there are 2^90 passwords that the target could be using. What seems to be the difference here is that this program outputs a special character between each word, or a 2-digit number, meaning that we have 50000^4*(100+32)^4 > 2^90. It's kinda messy, and probably not that easy to remember.

In general, with passwords, one wants about 2^80 bits of entropy - it's high enough to be incredibly difficult to crack even when hashed insecurely. Once we start using proper algorithms like bcrypt it's even harder.

To give a comparison, a password output by this is roughly as hard to crack as a 15 character random password encoded in ASCII, or if unicode is allowed, a 6 character password in that encoding.


"Distance of next prime" is not random, it's entirely deterministic. It's a well-defined function. Just because we don't have an analytical solution for it doesn't make it random.

As for "high quality", some random variables are less predictable than others. A weighted coin that comes up heads 51% of the time is still random (not deterministic), but more predictable than a fair coin.


I don't like calling non-uniform distributions 'predictable'.

Think of it like this: A coin which produces heads 99/100 times is not predictable, since you cannot predict when the next tails is coming, you just know that they are rarer.

As for the distance to the next prime thing, you are just plain wrong, in a sense that he is talking about a seeded pnrg, i.e the starting prime is a seed and you don't get to know which it is, out of all integers...


> in a sense that he is talking about a seeded pnrg

That little "p" makes all the difference. The post I was replying to was asking about the precise definition of "random", not "pseudorandom".

As you allude to, any real randomness in such a scheme needs to come in the form of a seed.


That's random in maths, we're not talking maths here. In the real world, random is a value that's hard to predict. For example a dice throw has a high quality of random in normal circumstances.

A low quality random source would depend on some variables that make it easier to predict the next random number (perhaps just statistically). The worst quality random source would be pseudo-random numbers, where the value of each next random number directly follows from the previous value and perhaps some secret second value.

It might be distributed like a true random number, but eventually the secret could be guessed and the numbers would become predictable.


A simple explanation is that "randomness" here is related to (logarithm of) the minimum average time [1] you would take to guess the password correctly (H=sum(pi * log(1/pi)). Since the random number generator supposedly gives an uniform distribution over a set of words, this simplifies to H=n(1/n log(n))=log(n).

[1] Your average time is at least t* =1/2 * 2^H, so log( t* )>H-1.


I wrote a post about this some time ago: http://ruudvanasseldonk.com/2012/07/07/passphrase-entropy




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: