# A tale of lost entropy

Recently, while looking at a JavaScript function intended to generate a cryptographically-secure random IV to be used in AES-GCM, I noticed something interesting which I immediately suspected was not unique to this project. Sure enough, Matt, my awesome colleague, sent me a link to a how-to article describing the process of generating random values in Node.js that included the exact same quirk.

Here is their example (with minor edits so as not to call out the author of that how-to post too explicitly):

Do you notice anything fishy?

Perhaps you notice that len/2 and think that’s a little strange. This quirk is described thusly (I can’t help but quote here):

“In hexadecimal format a single byte generates two characters. So there is no point to generate the same number of bytes as number of output characters. We can only generate half of the bytes.”

What happens when you call crypto.randBytes(X)? You get X random bytes in a Buffer. You can easily represent these in hex string format with .toString(‘hex’), as in the example above. So for a call to crypto.randBytes(8).toString(‘hex’) you might get “71d75f6f286f2e96”.

Cool! That looks like 16 characters though, and we only wanted 8… Here you might say to yourself, “Aha! I’ll just generate 8 bytes like in the example!”

And if you DID think that, I now have to bore you with some math.

Those familiar with hexadecimal will remember that we have 16 distinct hex characters: 0-9 and A-F. A byte is composed of 2 hex characters, or 8 binary bits. For instance, 0x0F in binary is 00001111 (15; binary padded for emphasis). 0xFF is 11111111 (255), and so on.

When you ask the crypto module to give you 8 random bytes, think of it like getting 8 numbers between 0 and 255. You can think of this as your “character set” for random number generation. These numbers can be represented in hexadecimal form like we saw earlier, or they can be represented in something like UTF-8. Here is what crypto.randBytes(8).toString(‘utf-8’) might look like: o������1

When you generate 8 random bytes, you have 256^8 (1.84467441e19) possibilities. What if we convert the bytes to hex? Well, we know hex has 16 possible characters, and since we represent each random byte as 2 hex characters, 8 bytes turns into a 16-character hex string. We can see that there are the same number of possibilities in both cases: 16^16 = 1.84467441e19 = 256^8.

Perhaps you can see where I’m going with this now. If we shave off those last 8 characters of the hex string, we’re now looking at 16^8 (= 4294967296) possibilities, or the square root of 1.84467441e19. We’ve reduced the total possible entropy considerably.

To illustrate this more clearly, if we were representing the random bytes in binary, and we decided we only wanted 8 binary characters, that would give us only 256 possibilities (2^8), or one **actual** byte of entropy.

You’re probably thinking:

It’s true, how much entropy you need depends 100% on the context, and this whole discussion may be overkill for some situations, but if you take a look at NIST SP800-38D (Recommendation for Block Cipher Modes of Operation: Galois/Counter Mode (GCM) and GMAC), it says the following:

The IVs in GCM must fulfill the following “uniqueness” requirement: The probability that the authenticated encryption function ever will be invoked with the same IV and the same key on two (or more) distinct sets of input data shall be no greater than 2^-32. Compliance with this requirement is crucial to the security of GCM. Across all instances of the authenticated encryption function with a given key, if even one IV is ever repeated, then the implementation may be vulnerable to … forgery attacks … In practice, this requirement is almost as important as the secrecy of the key.

Funnily enough, this is exactly how likely an 8-character hex string is to be repeated if you assume that the possibilities are handed out 100% randomly, since 2^32 = 16^8! While you might think that your IV composed of 8 random bytes, or 64 bits, would give you some breathing room on this “uniqueness” requirement, your 8-character hex string does not (and you should be using 96-bit IVs anyway, but whatever). This paper, referenced in the NIST document above, explains several attacks on GCM, and is a little more pointed about that “uniqueness” requirement:

… if for some reason a small number of IVs are repeated, it is usually expected that beyond this basic attack which reveal the corresponding messages, nothing too bad should happen. With an ordinary counter mode, this is indeed the case. However, with both versions of GCM, allowing a small number of repeated IVs leads to a serious leak where the authentication key is revealed.

…

To sum it up, the only thing the adversary cannot do is recover the main key K. However, the main goal of cryptography is to protect the security of data not keys.

Probably best to avoid tempting fate here.

Sadly, I’ve since found several examples of this misguided byte-reduction on Github, and it’s pretty much impossible to miss the how-to article referenced at the beginning of this post when Googling on this topic. Again, context is key, but this is a cautionary tale against making too many assumptions in a cryptographically sensitive context.

If you’re convinced that you should let those extra bytes be free, but you don’t want to pass around some arbitrary UTF-8 string in e.g. JSON, my suggestion would be to encode it with base64 to save a little space, or to save the full hex string.

If you enjoyed this post, reach out to me on Twitter (@ccneill) and let me know!