Back in the old days, “going online” meant calling up with your modem at 300 bits per second and interacting slowly with a basic command prompt (sometimes BASIC in the literal sense).
Noise on the line and other electrical interference could easily turn zeros into ones and vice versa, causing corruption in your session, such as BANANA
turned into BANAMA
, MANGO
into MaNGO
, or even ONLINE
into +++ATZ
.
A common way to spot obvious errors automatically was by using a checksum, quite literally calculated by checking the sum of all the numeric values of every byte in the message.
Checksums were used because they were quick to calculate, as far back as the 1960s and 1970s, because even the most underpowered processors usually had an ADD
or ACCUMULATE
instruction that could efficiently maintain a running total of this sort.
But checksums were error prone.
If you swap round any of the bytes in a message, the checksum remains unchanged because A+B = B+A.
Likewise, two errors can easily cancel out, for example if BANANA
gets corrupted into CANAMA
, because (A+1) + (B-1) = A+B.
Enter the CRC
CRCs, short for cyclic redundancy checks, were a better solution, using a comparatively simple series of bit-shifts and XOR operations to maintain a bigger accumulator that wasn’t so easily fooled by double errors or swapped bytes.
CRC-32 produces a 32-bit (4-byte) checksum – today, the term checksum is used metaphorically, not literally to mean that the bytes were added together – designed to do a much better job of detecting accidental errors such as those caused by mistyping or by modem corruption.
But CRCs aren’t any good against deliberate errors.
That’s because CRCs are based on a process involving a form of long division, making the algorithm predictable, so the output can be tweaked to be whatever you like by tweaking the input slightly in a way that can be calculated automatically.
That makes it trivial to create a message with any checksum you like, for example so that its checksum matches an earlier message in order to create confusion, commit fraud, or worse.
Note that there are only 4 billion (232) different possible CRC-32 values, so that at modern computer speeds you could forge a CRC-32 without any arithmetical trickery by brute force, simply by making billions of random modifications to the message until you hit paydirt.
But even if you extend your CRC to 64 bits, 128 bits or even more to make accidental duplicates as good as impossible, it’s still easy to calculate forgeries very quickly, with no need to rely on brute force.
Moving up to cryptographic hashes
For security-related purposes, you need what’s commonly referred to as a cryptographic checksum or cryptographic hash.
This sort of algorithm is designed not only to detect accidental errors, but also to be “untrickable” enough to prevent deliberate errors.
In particular, a cryptographic hash, denoted here as a mathematical function H()
, should have at least these characteristics:
- If you deliberately create two messages
M1
andM2
(any two messages; you get to choose both of them) such thatH(M1) = H(M2)
, you have a collision, so thatH
has failed as a digital fingerprint. Therefore you should not be able to construct a collision, other than by trying over and over with different inputs until you hit the jackpot by chance. - If you know that
H(M) = X
, but you don’t know my messageM
, then you should not be able to “go backwards” fromX
toM
, other than by trying different messages over and over until you hit the jackpot by chance. - If I choose
M
and tell you what it is, so you can computeH(M) = X
for yourself, you should not be able to come up with a different messageM’
that also hasH(M’) = X
, other than by guesswork. (This is much tougher than case 1 because you don’t get to choose any matching pair of hashes from a giant pool of messages. You have to match my hash, not any hash, which squares the effort needed.)
For many years, an algorithm called MD5 was widely used because it claimed to provide these three protections against abuse, but it is now forbidden in the cryptographic world because it is known to fail on Promise One above.
Once a hashing algorithm fails in respect of Promise One, it’s prudent to assume it won’t meet its other design goals either, even if it seems to be holding out on the other two promises.
MD5 collisions are easy to generate on purpose, so the algorithm can no longer be trusted.
SHA-1 replaces MD5
SHA-1 was the next-most-complex hash after MD5, and was widely used as a replacement when MD5 fell out of favour.
Greatly oversimplified, the SHA-1 algorithm consumes its input in blocks of sixteen 32-bit words (512 bits, or 64 bytes), mixing each block into a cumulative hash of five 32-bit words (160 bits, or 20 bytes).
for block in blocks() do for i = 17 to 80 do -- each step here extends the original 16-word input -- block to 80 words by adding one word made by mixing -- together four of the previous sixteen words. block[i] = minimixtogether(block,i) end for i = 1 to 80 do -- each step here mixes one of the words from the 80-word -- "extended block" into the five-byte hash accumulator hash = giantmixtogether(block,i) end end
The giantmixtogether()
function that scrambles the extended input into the hash uses a range of different operations, including NOT, AND, OR, XOR, ADD and ROL (rotate left); the minimixtogether()
function used to massage the input data uses XOR and ROL.
The algorithm certainly looks complicated, and at first glance you would assume that it mixes-minces-shreds-and-liquidises its input more than well enough to be “untrickable”.
Indeed, the complexity of SHA-1 was considered sufficient to immunise it against the weaknesses in the similar but simpler MD5 algorithm.
At the same time, SHA-1 was not so much more complicated than MD5 that it would run too slowly to be a convenient replacement.
SHA-1 considered harmful
For years, however, experts have been telling everyone to stop using SHA-1, and to use more complex hash algorithms such as SHA-2 and SHA-3 instead, predicting that the first actual real-world, in-your-face chink in SHA-1’s armour would turn up soon.
Google’s Chrome browser, for example, stopped accepting web site certificates with SHA-1 hashes at the start of 2017, considering them no longer safe enough.
The Mozilla Firefox browser will soon follow suit.
The reason is simple: as soon as someone actually turns theory into reality, and produces a hash collision, you can no longer rely on saying, “She’ll be right for a while yet,” because your “while yet” period just expired.
So it’s a good idea to get ahead of the game and to abandon creaky cryptographic code before it goes “Bang!”
Even if a collision takes an enormous amount of work – imagine that you’d need 110 top-end graphics cards running flat out for a whole year, for example – the first actual collision would be what you might call the digital disproof of the pudding.
The digital disproof
So, to cut what has become a long story short, you need to know that researchers from Google and the CWI Institute in Amsterdam…
…have just disproved the pudding.
Bang!
A hash collision that in theory should have taken them thousands of years to stumble upon by chance has been created on purpose within all of our lifetimes – and that should simply never have happened.
Apparently, they did indeed need 110 top-end graphics cards running for a whole year, but that is still 100,000 times faster than the design goals (and the theoretical strength) of SHA-1, making SHA-1 a risky proposition for evermore.
TL;DR: SHA-1 really is broken, so use a stronger hash from now on, because cryptographic attacks only ever get faster.
Chris
Cool article on cryptography and checksums! FYI, misspelled the first GiantMixTogether() in an actual paragaph.
Paul Ducklin
Fixed the typo – thanks for spotting it :-)
Jonathon
Wow, it feels too soon for SHA-1. However, there it is.
There is lots of sample code floating around that still uses MD5 as the default hashing algorithm. I recently saw a website that recommends SHA-256. (SHA2 32-bit) (I personally would recommend SHA-512 (SHA2 64-bit) instead.)
However, I would like to see SHA3 available in the default code of most languages. More importantly, I’d like to see more widespread availability of bcrypt and scrypt stretched-hashing algorithms for easy use.
It’s a relief to see PHP get libsodium. I look forward to learning more about that, and hope that it becomes widely available to all major languages soon. libsodium appears to have scrypt built-in, although it appears to be scrypt based on SHA-256 (32 bit). Again, I’d prefer SHA-512 as a baseline. Of course, the ultimate preference is probably SHA3 as a baseline.
(edited to remove link)
Mike Crowley
There is not industry-wide consensus that SHA-3 is better than SHA-2. (edited to remove link)
Sandor
Jonathon, I assume this was just a typo, but SHA-256 is 256-bit (i.e. 32 bytes, not 32 bits). Similarly SHA-512 produces a 64 bytes long output (and not 64-bit).
Paul Ducklin
I think the OP meant to refer to the internal word size used inside SHA-256 and SHA-512.
SHA-256 is based internally on 32-bit operations; 32-bit constants used in the mixing functions; and arithmetic modulo 232, meaning that when you add 1 to 4,294,967,295, better known as 232 – 1 you get zero, not 4,294,967,296.
SHA-512 is based internally on 64-bit operations; 64-bit constants; and integers that overflow only after 1,844,674,407,370,9551,615 (264 – 1).
Keeley Monroy
As a non-programmer, non-super geek, I really liked your article. Well written, easy to comprehend, and enjoyable. The vast world of 1s and 0s always makes my head spin. You have the capability it make it stop for awhile. I can now actually explain to others why MD5 and SHA-1 without saying. “I heard it somewhere.”
Paul Ducklin
Thanks for your kind words. I can’t explain why you have any downvotes at all for your comment, let alone a whole bunch of them…but there’s no pleasing some people, I suppose.
Alex
The files have the same SHA-1 but different MD5. So isn’t it still secure by checking with different algorithms?
Paul Ducklin
Yes. The thing is, that if you recognise the need to switch algorithms…
…why switch *back* to MD5, which is yet more insecure than SHA-1?
Perhaps you’re suggesting using multiple algorithms at the same time and accepting that the files are the same if any one of those algorithms matches? But that puts you in tension with the equally meaningful approach of using mutliple algorithms and treating the files as different if any one of the algorithms *doesn’t* match :-)
Why not just pick one algorithm that you think isn’t broken. (And have a procedure in your business for switching cryptosystems in future. This will happen again.)
Alex
Yes my point was that maybe using different algorithms might solve this problem. But then you are adding time and steps. I just didn’t really buy the SHA-1 is dead argument. This is a lab experiment which took a year to produce. Time to plan for the next step but no need to panic is all I’m saying.
Paul Ducklin
Thing is, we’ve been planning for the next step for years already.
So, if you are cryptanalytically cautious (i.e. you accepted that SHA-1 was past its use-by date on the theoretical evidence available long before the recent collision actually happened) but cryptographically progressive (i.e. you embrace cryptographic improvemnents rather than seeking to defer them), then…
…you would instead say, “Time to *take* the next step, but no need to panic.”
That’s all I’m saying :-)