How to fit all of Shakespeare in one tweet (and why not to do it!)

Here at Naked Security, we’ve written about steganography before.
Steganography is a fascinating trick for sending secret messages – and it’s intriguingly different from cryptography, even though the two techniques are often lumped together as if they were the same.
Simply put, cryptography scrambles messages so that only the intended recipient can read them.
Generally speaking, attackers who intercept encrypted messages know that something was said, but can’t figure out what it was.
For example, I confidently predict that you will not be able to unravel the original content of this coded message:

   CBTEM YPQNE TUTMQ WLJFJ FKRBG
   OFYIA DQTLP GNCBD LDOHN AHMOR
   MHUUG EJOWN CSCXA VGTUH SXTLN
   BOTXN ATFMU WLHID RXDWC IJMEA
   KWQEI PUGMF KPHSL HCHUY TMUJE

But those five-letter groups, reminiscent of an enciphered World War Two message, will almost certainly make you suspicious.
You might therefore assume I’ve got something important to hide, a fact that could land me in even hotter water than if I’d blurted out the message openly.
Steganography therefore aims to disguise messages not only to keep the contents secret, but also to hide the existence of a message in the first place.
The process could be as simple as agreeing in advance on a special interpretation of a casual word or gesture.
According to rumours, if Queen Elizabeth II innocently twiddles her wedding ring while at an event, it’s a discreet signal that she’d like one of her entourage to extract her gracefully from her current conversation so she can move on to other guests without causing offence.

Hiding in images and sound

Another approach to steganography involves finding existing files or messages that contain data that’s unimportant and unlikely to be scrutinised, and replacing it with secret data instead, thus effectively hiding in plain sight.
One well-known trick for sneaking data into files that are popularly shared involves adding what’s supposed to look like “noise” into files such as photos, videos or music.
For example, each pixel in an 8-bit greyscale photo can have a value from 0 (totally black) to 255 (brightest white), even though you’d be hard pressed to tell the difference between a grey level of, say, 129 and 130, or between 8 and 9.
In fact, it might be technically impossible for you to discern a difference, given the combination of inaccuracies introduced by your camera when capturing the image, your screen when displaying it, and your eye when viewing it.
You can therefore replace some of the data in your image to encode information from somewhere else, for example by overwriting the bottom few bits of each image pixel with data of your own.
This leaves you with a hybrid file that is part image and part personal information.
But if anyone opens up your image-with-data-stashed-inside, it still looks like an image, albeit a noisy or low-quality one.

You can then use public services such as social media websites and photo-sharing services to distribute your hidden-message pictures in an innocently open sort of way.
Unfortunately, images distributed via online photo-sharing sites often get altered, or transcoded, along the way, on the grounds that online versions don’t have to be perfect copies of the original.
The image someone else sees after you’ve uploaded the original may have been scaled down to a standardized size, converted lossily between different formats, re-compressed to save space, or any combination of these tranformations.
The data you’ve hidden in the image pixels is unlikely to surive this sort of modification intact.
Also, as you can see in the examples above, the more data you try to hide in the pixels of an image, the less natural the image looks, limiting the data-carrying capacity of steganographic images dramatically.
Data that’s been embedded covertly in files such as images can often be detected because the resulting files don’t quite look right.

The data stashed in the images above is a pseudorandom sequence of bits, similar to what you’d see if you compressed or encrypted your personal data before hiding it in the image. Although this randomness looks like typical digital noise to the human eye, it doesn’t really match the noise produced by camera sensors, which is unpredictable but not particularly random. Indeed, images that seem to have been doctored in this way may arouse even more suspicion than explicitly encrypted information, and analytical tools exist that aim to detect unnaturally created data of this sort.

Hiding publicly on Twitter

This gave security researcher David Buchanan food for thought.
Just how much data could you stash reliably in a Twitter image upload, for example?
Twitter usually scales, crops, compresses and transcodes your images before publishing them via its website or in its apps – indeed, there are good cybersecurity reasons for Twitter to stick to a well-defined set of of image formats, sizes and content types.
Lots of image formats allow you to add what’s called metadata – information about the information in the image itself – to record details such as where a photo was taken, what camera type was used, what lens and exposure settings were used, and your personal comments about the image.
Twitter, along with many other on-line services, keeps some of the metadata from your images, but dumps the rest, at least in part to stop you sneakily using its servers to store or distribute data that isn’t really supposed to be there.
Buchanan quickly found that Twitter discarded most of the metadata in the images he uploaded, but not all of it.
Twitter always seems to keep added metadata of a type called ICC_PROFILE, short for International Color Consortium cross-platform profile format.
Because the ICC profile is supposed to explain how to display the image, rather than how it was originally captured or what the creator thought of it, retaining the profile data seems like a vital part of rendering it correctly when it’s viewed in a browser or an app.
But the ICC Specification, which takes up 130 pages, offers plenty of little known and rarely used components for stashing data that Twitter will retain but subsequently never use, providing that each chunk of data you embed is less than 64KB in size.

Data formats with 64KB limits typically have that restriction for exactly the same reason that early microcomputers couldn’t use more than 64KB of RAM – a consequence of how much storage was used to keep a count of how much storage was needed. If you routinely allocate just two bytes to store the sizes of any lumps of data in your file, you can’t keep track of lengths longer than 16 binary digits, for the same reason that the 5-digit mechanical odometers in older cars wrap round after 99,999 kilometres and go back to reading zero. The binary number 1111111111111111 (0xFFFF in hexadecimal, 65,535 in decimal) effectively fills up two bytes of storage – you can’t add to that number without wrapping back to zero and confusing your application.
Buchanan was able to perform the following data-stashing trick:

Take the Complete Works of Shakespeare, published for free in a single HTML file by Project Gutenberg, and weighing in at 7,033,657 bytes.
Compress it into a multipart RAR archive. Each archive part was limited to 64,512 bytes, for 31 parts in all totalling 1,938,197 bytes. (English text compresses rather well, because it uses a few characters very often, some characters rarely, and many characters not at all.)
Package the 31 RAR files into a single ZIP file. ZIP files contain internal headers that allow an unzipping program to skip over additional bytes between each file embedded in the ZIP.
Arrange extra data between each ZIP component to form a large and correctly formatted, though purposeless, ICC_PROFILE data file.
Add the ICC_PROFILE data to a tiny 64×64 pixel JPEG image of William Shakespeare, Esq.
Upload the image to Twitter.

If you view his tweet, the image looks like a low-resolution 10KByte image of Shakespeare…

Assuming this all works out, the image in this tweet is also a valid ZIP archive, containing a multipart RAR archive, containing the complete works of Shakespeare.

This technique also survives twitter's thumbnailer :P pic.twitter.com/P0Owq9abRC
— David Buchanan (@David3141593) October 29, 2018

…but if you download the image, you can feed it directly into an UNZIPping program, which will ignore all the interleaved data – including the image itself – that doesn’t belong to the compressed parts of the ZIP:

   $ unzip downloaded-image.jpg
   Archive:  downloaded-image.jpg
   error [downloaded-image.jpg]:  missing 454 bytes in zipfile
     (attempting to process anyway)
    extracting: shakespeare.part001.rar
    extracting: shakespeare.part002.rar
    extracting: shakespeare.part003.rar
    [. . . .]
    extracting: shakespeare.part029.rar
    extracting: shakespeare.part030.rar
    extracting: shakespeare.part031.rar
   $

Thanks to a rather forgiving file format, your unzipper spits out 31 sequentially named fragments of a RAR archive.
If you then feed the first fragment into an UNRARing program, the unarchiver will correctly figure out that it’s supposed to decompress all the numbered fragments in turn, spitting out shakespeare.html:

   $ unrar x shakespeare.part001.rar
   UNRAR 5.61 freeware      Copyright (c) 1993-2018 Alexander Roshal
   Extracting from shakespeare.part001.rar
   Extracting  shakespeare.html               3%
   Extracting from shakespeare.part002.rar
   ...         shakespeare.html               6%
   [. . . .]
   Extracting from shakespeare.part030.rar
   ...         shakespeare.html              99%
   Extracting from shakespeare.part031.rar
   ...         shakespeare.html              OK
   All OK
   $

You can then open the HTML file directly in your browser and read the Bard’s collected work at your leisure, where you will find, both literally and figuratively that:

There are more things in heaven and earth, Horatio,
Than are dreamt of in your philosophy.

Buchanan reported this to Twitter as “a bug”, which we think is a reasonable assessment…
…but Twitter replied that it wasn’t a bug, meaning that it seems fair to treat it as a feature:

I tried reporting this techinque to twitter's bug bounty program, but it's #notabug. Fair enough, but that just means we can have some fun with it 🤣
— David Buchanan (@David3141593) October 29, 2018

Should you do it?

Interestingly, if you look at the etymology of the word steganography – from the Ancient Greek στεγανός [stegaNOS], meaning covered or water-tight – you can argue that this technique does, literally at least, qualify as steganographic.
The stashed data is packaged up safely, and won’t leak out or disappear after you upload it to Twitter.
But the data isn’t κρυπτός [krypTOS], meaning hidden or secret, at all – the bloated size of the image file is a dead giveway.
In other words, if you are looking for a steganographic technique that actually conceals your data as well as containing it – the way that the word steganography is usually used these days – then be careful.
This technique is not for you if your need is to hide your data in plain sight, and might even attract additional scrutiny if you are spotted using it in real life.

How to fit all of Shakespeare in one tweet (and why not to do it!)

Hiding in images and sound

Hiding publicly on Twitter

Should you do it?

Paul Ducklin

Read Similar Articles

What to expect when you’ve been hit with Avaddon ransomware

What’s New in Sophos EDR 4.0

Sophos XDR: Driven by data

1 Comment

Ed