Back in the old days, “going online” meant calling up with your modem at 300 bits per second and interacting slowly with a basic command prompt (sometimes BASIC in the literal sense).
Noise on the line and other electrical interference could easily turn zeros into ones and vice versa, causing corruption in your session, such as BANANA
turned into BANAMA
, MANGO
into MaNGO
, or even ONLINE
into +++ATZ
.
A common way to spot obvious errors automatically was by using a checksum, quite literally calculated by checking the sum of all the numeric values of every byte in the message.
Checksums were used because they were quick to calculate, as far back as the 1960s and 1970s, because even the most underpowered processors usually had an ADD
or ACCUMULATE
instruction that could efficiently maintain a running total of this sort.
But checksums were error prone.
If you swap round any of the bytes in a message, the checksum remains unchanged because A+B = B+A.
Likewise, two errors can easily cancel out, for example if BANANA
gets corrupted into CANAMA
, because (A+1) + (B-1) = A+B.
Enter the CRC
CRCs, short for cyclic redundancy checks, were a better solution, using a comparatively simple series of bit-shifts and XOR operations to maintain a bigger accumulator that wasn’t so easily fooled by double errors or swapped bytes.
CRC-32 produces a 32-bit (4-byte) checksum – today, the term checksum is used metaphorically, not literally to mean that the bytes were added together – designed to do a much better job of detecting accidental errors such as those caused by mistyping or by modem corruption.
But CRCs aren’t any good against deliberate errors.
That’s because CRCs are based on a process involving a form of long division, making the algorithm predictable, so the output can be tweaked to be whatever you like by tweaking the input slightly in a way that can be calculated automatically.
That makes it trivial to create a message with any checksum you like, for example so that its checksum matches an earlier message in order to create confusion, commit fraud, or worse.
Note that there are only 4 billion (232) different possible CRC-32 values, so that at modern computer speeds you could forge a CRC-32 without any arithmetical trickery by brute force, simply by making billions of random modifications to the message until you hit paydirt.
But even if you extend your CRC to 64 bits, 128 bits or even more to make accidental duplicates as good as impossible, it’s still easy to calculate forgeries very quickly, with no need to rely on brute force.
Moving up to cryptographic hashes
For security-related purposes, you need what’s commonly referred to as a cryptographic checksum or cryptographic hash.
This sort of algorithm is designed not only to detect accidental errors, but also to be “untrickable” enough to prevent deliberate errors.
In particular, a cryptographic hash, denoted here as a mathematical function H()
, should have at least these characteristics:
- If you deliberately create two messages
M1
andM2
(any two messages; you get to choose both of them) such thatH(M1) = H(M2)
, you have a collision, so thatH
has failed as a digital fingerprint. Therefore you should not be able to construct a collision, other than by trying over and over with different inputs until you hit the jackpot by chance. - If you know that
H(M) = X
, but you don’t know my messageM
, then you should not be able to “go backwards” fromX
toM
, other than by trying different messages over and over until you hit the jackpot by chance. - If I choose
M
and tell you what it is, so you can computeH(M) = X
for yourself, you should not be able to come up with a different messageM’
that also hasH(M’) = X
, other than by guesswork. (This is much tougher than case 1 because you don’t get to choose any matching pair of hashes from a giant pool of messages. You have to match my hash, not any hash, which squares the effort needed.)
For many years, an algorithm called MD5 was widely used because it claimed to provide these three protections against abuse, but it is now forbidden in the cryptographic world because it is known to fail on Promise One above.
Once a hashing algorithm fails in respect of Promise One, it’s prudent to assume it won’t meet its other design goals either, even if it seems to be holding out on the other two promises.
MD5 collisions are easy to generate on purpose, so the algorithm can no longer be trusted.
SHA-1 replaces MD5
SHA-1 was the next-most-complex hash after MD5, and was widely used as a replacement when MD5 fell out of favour.
Greatly oversimplified, the SHA-1 algorithm consumes its input in blocks of sixteen 32-bit words (512 bits, or 64 bytes), mixing each block into a cumulative hash of five 32-bit words (160 bits, or 20 bytes).
for block in blocks() do for i = 17 to 80 do -- each step here extends the original 16-word input -- block to 80 words by adding one word made by mixing -- together four of the previous sixteen words. block[i] = minimixtogether(block,i) end for i = 1 to 80 do -- each step here mixes one of the words from the 80-word -- "extended block" into the five-byte hash accumulator hash = giantmixtogether(block,i) end end
The giantmixtogether()
function that scrambles the extended input into the hash uses a range of different operations, including NOT, AND, OR, XOR, ADD and ROL (rotate left); the minimixtogether()
function used to massage the input data uses XOR and ROL.
The algorithm certainly looks complicated, and at first glance you would assume that it mixes-minces-shreds-and-liquidises its input more than well enough to be “untrickable”.
Indeed, the complexity of SHA-1 was considered sufficient to immunise it against the weaknesses in the similar but simpler MD5 algorithm.
At the same time, SHA-1 was not so much more complicated than MD5 that it would run too slowly to be a convenient replacement.
SHA-1 considered harmful
For years, however, experts have been telling everyone to stop using SHA-1, and to use more complex hash algorithms such as SHA-2 and SHA-3 instead, predicting that the first actual real-world, in-your-face chink in SHA-1’s armour would turn up soon.
Google’s Chrome browser, for example, stopped accepting web site certificates with SHA-1 hashes at the start of 2017, considering them no longer safe enough.
The Mozilla Firefox browser will soon follow suit.
The reason is simple: as soon as someone actually turns theory into reality, and produces a hash collision, you can no longer rely on saying, “She’ll be right for a while yet,” because your “while yet” period just expired.
So it’s a good idea to get ahead of the game and to abandon creaky cryptographic code before it goes “Bang!”
Even if a collision takes an enormous amount of work – imagine that you’d need 110 top-end graphics cards running flat out for a whole year, for example – the first actual collision would be what you might call the digital disproof of the pudding.
The digital disproof
So, to cut what has become a long story short, you need to know that researchers from Google and the CWI Institute in Amsterdam…
…have just disproved the pudding.
Bang!
A hash collision that in theory should have taken them thousands of years to stumble upon by chance has been created on purpose within all of our lifetimes – and that should simply never have happened.
Apparently, they did indeed need 110 top-end graphics cards running for a whole year, but that is still 100,000 times faster than the design goals (and the theoretical strength) of SHA-1, making SHA-1 a risky proposition for evermore.
TL;DR: SHA-1 really is broken, so use a stronger hash from now on, because cryptographic attacks only ever get faster.