The Big Data picture – just how anonymous are “anonymous” records?

You know those "anonymous surveys" you may have filled in? They don't tell anyone it was you. Or do they?

anon-170On Naked Security we regularly write about, or at least make mention of, something called Big Data.

It’s what I think of as an “anything and every­thing” term.

There are things that obviously aren’t big data, like the modest collection of pictures you took of your cat until you snapped one that made a decent computer wallpaper image.

And there are things that obviously are big data, like the giant database of Wi-Fi access points from Google’s StreetView cars that it uses to aid and abet its geolocation services.

Of course, even your cat pictures – the ones that were captured with a single short press of the BURST option on your new iPhone – probably take up several times more storage than your first computer had in total.

But they fail to make the cut as “big data” not only because they’re small by modern standards, but also because you can’t dissect, compare and contrast them to look for patterns in the whole cat world, and from that to draw inferences about one particular cat in the database.

Now, if you had pictures of 1,000,000 different cats, organised by location, that would be big data.

Big Data versus privacy

Clearly, the words “big data” ought to raise privacy concerns.

Automatic Number Plate Recognition (ANPR) cameras are a good example, because your plate number stays constant while your location changes.

One ANPR data point might bust you for speeding, or running a red light, which is fair enough.

But put a city’s worth – or a State’s worth, or, heck, why think small, a whole country’s worth – of ANPR data together, and you have intrusive surveillance, especially if the data includes everyone, even drivers who haven’t come near breaking any laws.

You can nevertheless argue that, even though raw ANPR dumps on their own may be “big data,” they are essentially anonymous.

Therefore they are safe, and possibly valuable, to make available broadly:

State/Plate  Date        Time      Location
-----------  ----------  --------  -------------------------
NSW  NSG123  2014-12-01  11:54:11  Harbour Bridge N Approach 
QLD  556ARX  2014-12-01  11:54:14  Harbour Bridge N Approach
QLD  189BBQ  2014-12-01  11:54:17  Lang Park
NSW  BA45MO  2014-12-01  11:54:22  Lang Park
NSW  AM99WA  2014-12-01  11:54:23  Harbour Bridge N Approach
VIC  RST776  2014-12-01  11:54:32  Carrington St  
NSW  XLR8    2014-12-01  11:54:33  Lang Park
NSW  BA45MO  2014-12-01  11:54:34  Carrington St 
NSW  44BSD   2014-12-01  11:54:37  Lang Park

After all, unless you also have a database to turn the plate numbers into vehicle owners, all you know is that a car, some car but the same car, plated BA.45.MO, passed Lang Park and made it into Carrington Street within 12 seconds. (You could do it, but you’d need the hammer down.)

So for things like planning road safety measures, predicting traffic volumes, helping fuel companies decide where to build petrol stations, and so on, perhaps ANPR “big data” is OK, and useful, to release?

In fact, you could go one step further so that no-one, not even the vehicle licensing agencies, could actually work out which cars were there:

Hash    Date        Time      Location
------  ----------  --------  -------------------------
OEERIB  2014-12-01  11:54:11  Harbour Bridge N Approach 
7K5NR5  2014-12-01  11:54:14  Harbour Bridge N Approach
IFQS8K  2014-12-01  11:54:17  Cahill Expressway
ZJXJUN  2014-12-01  11:54:22  Lang Park
CPU069  2014-12-01  11:54:23  Harbour Bridge N Approach
6VJNJU  2014-12-01  11:54:32  Carrington St  
GG38UB  2014-12-01  11:54:33  Lang Park
ZJXJUN  2014-12-01  11:54:34  Carrington St 
6MBHSI  2014-12-01  11:54:37  Lang Park

→ If you ever need to do this sort of anonymisation, a salting-and-hashing system, like you might use for passwords, can help. But don’t make a hash of it and leave the data at risk of an attack that works backwards to the orginal plate data, like New York City did with its cab drivers.

How random is random

Of course, even the “randomly-assigned identifier” approach has some problems.

Let’s say I happen to know for sure that you turned onto the Cahill Expressway at 11:54:17 on the given date – perhaps I was tailing you in the car behind, or was able to match your car up with a CCTV camera feed of my own.

I can now assume that your anonymous tag is IFQS8K, and track you throughout the rest of the database.

That’s worrying, but the privacy risk is mitigated by the fact that I need a precise data point of my own in order to zoom in on you so precisely.

In other words, it seems as though only someone already keenly interested in me, who already has a good picture of my movements, could use the anonymised ANPR data to construct a good picture of my movements.

What about vague data?

And that raises the question, “If I don’t have precise data to get you in my sights, how much vague data would I need instead?”

And the answer, of course, depends entirely on the nature of the data, and your definition of vague.

For example, with an Australia-wide ANPR data dump, how many cars would show up in three different states some time in three consecutive weeks?

I don’t know the answer, but you can see where this is going: it’s all about intersecting sets.

Of the 150,000 cars that cross the Harbour Bridge each day, you might guess that no more than 1% also go on the Melbourne City Link in the same month.

Of that 1%, let’s assume that only 1% went on the Gateway Bridge in Brisbane as well. (I suspect the ratios are smaller than 1%, but let’s keep things simple.)

So even if all I know is that you happened to go on those three roads at some time in the last month, I’ve already pinned you down to one of just 15 cars!

Now add in a bit more detail, such as that you used the Gateway Bridge once and only once, and it was in the morning, because you blogged about getting the sun in your eyes…


That’s a made-up, theoretical example of deanonymising so-called “safe” big data.

What about the real world?

But can this sort of thing work in the real world?

It certainly can!

This paper [paywall] by a group of MIT graduate students shows you why:

It’s a tricky read, because it’s weighed down by jargon, and it’s written for a mildly technical audience.

But even a non-technical skim-read proves the point.

The authors started with three months of credit card data, which was an anonymised transaction log a bit like the made-up ANPR data we showed above.

They tried to “mine” it – to match up individuals with to their anonymous transaction tags – using ever-less precise information about each transaction.

Note that this imprecision can be applied either to what you know about the individual you are tracking, or to the data as a whole.

The authors were particularly interested in the latter: how big a privacy-sapping problem would remain even if the data points in the original data were all made wildly imprecise to “assure” privacy?

For example, in the ANPR sample, perhaps the data would be rendered harmless to privacy if all it said was:

Hash    Date        City
------  ----------  ------------
7K5NR5  2014-12-01  NORTH SYDNEY
IFQS8K  2014-12-01  SYDNEY
ZJXJUN  2014-12-01  SYDNEY
CPU069  2014-12-01  NORTH SYDNEY
6VJNJU  2014-12-01  SYDNEY
GG38UB  2014-12-01  SYDNEY
ZJXJUN  2014-12-01  SYDNEY
6MBHSI  2014-12-01  SYDNEY

How vague is vague?

Your gut feeling might be that this sort of vagueness would inevitably stop you from working out who’s who in the data set, no matter how much data it contained.

But with the credit card data, our MIT authors found that vague can still be surprisingly precise.

For example, they “defocused” the payment card records so that each record:

  • Grouped each payment into the first half or the second half of the month. (Actually, a 15-day window.)
  • Grouped payments into collections of shops near to each other. (Each group had 350 shops counted as if they were one.)
  • Grouped price into a series of ranges. (As an example, prices from $5 to $16 were considered as one.)

In other words, if you bought a jam doughnut and a coffee from the snack shop at the ferry terminal on the 12th of the month, your transaction would look the same, apart from its anonymous tag, as someone who bought a ticket to Ryde at the train station on the 7th of the month.

That’s pretty jolly vague, isn’t it?

Indeed, it’s vague enough that when the authors knew the details of any four transactions you’d made during the three month data period, as, for example, would any shop that you had visited four times, they had a chance lower than 15% of guessing which anonymous tag in the file was yours.

But with 10 known transactions, something you might easily rack up with multiple retailers due to daily habits at at a coffee shop, a parking lot, or a newsagent, their chance of pinpointing you rose above 80%.

Loosely speaking, the anonymous data they had access to, even when coarsened astonishingly, turned out to be not-so-anonymous after all.

Interestingly, and I offer this without comment or interpretation, they claim to be able to guess the identity of women about 1.2x more accurately than men.

Likewise, rich people are allegedly about 1.75x easier to pinpoint than poor people.

Big Data matters

And that, my friends, is why Big Data matters.

I’m afraid that I don’t really know what to advise you, except to say that when someone claims they have “anonymised” something, they simply might not be sure.

You can’t rely on your gut feeling about just how anonymous it ended up; nor can they.

Even the vaguest-looking data might have your name in it, if only you know how to look.

So, stick to the advice we gave on Safer Internet Day: if in doubt, don’t give it out.

Image of ginger kitten courtesy of Shutterstock.

Image of anonymous cats courtesy of Shutterstock.

Image of speed camera available under CC BY-SA 2.0 licence.


Great article! I kept stumbling over the hashing, given that you’d have to have a hashing algorithm that’s not reversible yet has no collisions (of hashes, not cars) over a huge data set. However, even if you forget the math and just assign arbitrary numbers from 1 to whatever large number to each individual, and keep the number mapping super top secret, this shows you can still get quite a bit of information from an anonymized data set. For example, say an identifiable famous person makes some expensive public purchases, and then you notice the same credit card being used later that evening on porn sites….


I just mentioned “hashing” because that’s how the NYC taxi people did it, and they hashed the implementation. Then I thought “hash” would be an OK moniker for “random string of identification”, or what you might hear called a UUID.

If you never later need to be able to show (e.g. using a carefully stored secret key) that an anonymous tag was derived from a specific number plate in your data set, you don’t need a hash. Just generate a decent-quality random number of suitable length.

If you use long enough random strings, you won’t have collisions, although you could check for them if you really wanted to and re-generate any repeats.

And, hey, if you did get collisions, the data’s only approximate eh? And no-one will be wronged as an individual by association with another vehicle if the data is truly anonymous :-)


You mentioned ANPR cameras with a picture of a Gatso Speed Camera alongside ti. The Gatso does not have an ANPR capability, it doesn’t have any connections externally. It just takes pictures of vehicles (carts, motorbikes, lorries, buses, pedal cycles, etc) that are exceeding the limit applicable at that site by enough to trigger it into action. An ANPR camera needs a data connection to a computer system so that the validity, or otherwise, of the plate seen can be checked and a response triggered if there is a doubt about that vehicle.

Liked the rest of the article though, very thought provoking.


OK, that probably is only a speed camera. In fact, it might even be one of the UK models that still used film when everyone else in the world was pulling in 20x as many fines by using digital cameras. Apparently, the law required, and the courts would only prosecute with an analog photo, until the infra-red models were ratified for use. (They can work front-facing because they don’t need a flash.)

But it’s an archetypal “law enforcement” camera. As visual communication goes, I think it’s a picture that’s worth 875 words, if not the full 1000. (And I already had the image, cropped, sized and licensed for immediate deployment.)

After all, every time you’ve driven past a speed camera since, well, since they first appeared and you knew what they were, you’ve wondered….*what else is that lens looking for*.

You have, haven’t you?

And, lo and behold, some GATSOs can, and many do, have ANPR, albeit as an optional extra.

Like the trendy STATIO model.

It’s basically a red light camera that also does speed enforcement, and has add-on modules for live video, ANPR, *and* something rather worryingly called “machine vision.” (I’m guessing that’s “facial recognition” in softer words.)

Oh, and it doesn’t have cables because it’s wireless.

You know you want one:

I’ll leave you with this.

How do you know those old GATSOs “that everybody knows only measure speed” haven’t been sneakily retrofitted with the latest modular hardware upgrades? *That* would be cunning, wouldn’t it: hiding ANPR units in plain sight!


I see some cars with licence plates covered by thick, almost opaque plastic, and trucks with plates covered by dirt. Maybe do something like this for your vehicle’s plates. Might work also for toll bridges and highways where the toll is calculated by the licence plate.

And there are devices that can block 3g and 4g wireless signals, like cell phone blockers. Put a couple of powerful ones in your car.


Also, deliberately obscuring you number plate is illegal, sooo….

Not sure about cell phone blockers, but they will certainly be annoying. And potentially life threatening. What if you have an accident, and people can’t call the emergency services? Oops…


Anonymous & Private

The argument never well made:
is why we should be so concerned about privacy at all.

People who do not grasp the general sense of the topic may be unreachably obtuse, however, the fact remains that the argument in a world full of glass houses appears to be lost on a lot of people who should be more cognizant of privacy concerns & how it impacts them. Capitulation and outright surrender to the status quo is dumb.

Seems to me that if a thing doesn’t need to be known – why share it?

Everybody screams like a stuck pig when they get clobbered personally, but shrug it off if “everybody’s doing it.” As though safer in a crowd of suckers.

Simple Example: Micro-targeting consumers alone is an important issue:
We depend on slop in the system to create swells & troughs, whether in the financial markets, or in individual purchase power.

When all store coupons are electronic & I, as the seller know everyone’s price point, why would I offer you a 50% off coupon when I know I can definitively sucker you for 25% and pocket the difference?

(This is personally poignant, as I just got a store email declaring an upcoming “PREFERRED CUSTOMER WEEK, All customers welcome”).

If I know you like the color orange & kittens – I can sway your pocketbook, by theming my advertisement based upon such detail knowledge, perhaps buying something you really didn’t need or even want.

And that’s the most benign scenario.


