Sophos News

Facebook: It’s too tough to find personal data in our huge warehouse

On 25 May, the EU’s General Data Protection Regulation (GDPR) came into force.
Mind you, the law itself had actually been in place for more than two years. The game changer: as of May, people could now demand that organizations hand over the data they hold on them – via subject access requests (SARs) – for free.
…which is how technology policy researcher Michael Veale, of University College London, wound up banging on the door of Facebook’s data warehouse.
As The Register reports, Veale submitted an SAR to the platform on 25 May, asking for whatever data it had collected on his browsing behavior and activities away from Facebook.
Facebook’s response: to slam the door in his face. Sorry, it told Veale: it’s too tough to find your information in our ginormous data warehouse.
That’s not going to fly, Veale has argued, given that the information Facebook picks up can be used to suss out highly personal information about somebody, including their religion, medical history or sexuality… and that goes for both Facebook users and non-Facebookers alike.
In particular, we’re talking about data scooped up by Facebook Pixel: a tiny but powerful snippet of code embedded on many third-party sites that Facebook has lauded as a clever way to serve targeted ads to people, including non-members.
Veale is taking the matter up with the Irish Data Protection Commissioner (DPC), given that Facebook’s European headquarters are in Ireland.
The Irish DPC has launched an inquiry into the matter, telling Veale that the case will likely be referred to the European Data Protection Board, given that it involves cross-border processing.
Veale shared his complaint with The Register. In his complaint, Veale seeks to find out whether Facebook has web history on him that pertains to medical domains and sexuality: the areas where Facebook is known to be doing highly targeted marketing, as he told The Register:

Both of these concerns have been triggered and exacerbated by the way in which the Facebook platform targets adverts in highly granular ways, and I wish to understand fair processing.

Veale says that he’s used the tools Facebook offers the public to find out what it knows about us. Such tools include Download Your Information and Ads Preferences, for example. But whichever specific tools Veale availed himself of proved “insufficient,” he said.
As Mark Zuckerberg repeatedly said over the course of two days of testimonial in front of the US Congress in April, and as Facebook reiterated yet again in a “Hard Questions” blog post in the aftermath of that question-fest, Facebook uses data collected – even when users aren’t on Facebook – in order to improve safety and security, and to improve its own and its partners’ products and services.


But unlike Google, which offers a tool to see what it knows about us, Facebook earlier this year revealed to activist Paul Olivier Dehaye that it can’t share users’ data with them.

We’re all stuck in the Hive

As Facebook said in an emailed response that Dehaye shared with the UK House of Commons digital committee, he had asked for data regarding what ads he saw as a result of advertisers’ use of Facebook’s Custom Audiences product. He also asked what data Facebook got on him via Facebook Pixel on third-party sites: data that’s not available through its self-service tools because it’s tucked away in a Hive data warehouse.
The Hive data is kept separate from the relational databases that power the Facebook site, Facebook told him, and is primarily organized by hour, in log format. That warehouse is vast, and it’s stuffed with people’s personal data, but it’s way too hard to get at it, Facebook said, and if everybody lines up to ask for their data, we’ll blow a gasket.
The data isn’t indexed by user, Facebook explained. In order to extract one user’s data from Hive, each partition would need to be searched for all possible dates in order to find any entries relating to a particular user’s ID.
From the company’s response to Dehaye:

Facebook simply does not have the infrastructure capacity to store log data in Hive in a form that is indexed by user in the way that it can for production data used for the main Facebook site.

As Dehaye points out, Facebook’s claims mean that as its user base grows, its data protection obligation “effectively decreases, as a result of deliberate architecture choices.”
Likewise, Veale isn’t buying Facebook’s argument. He pointed out that those who research Big Data have already clearly established that even if such data isn’t stored alongside a user ID, web browsing histories can be linked to individuals using only publicly available data. Toss machine learning into the mix, and even more patterns begin to emerge, he told The Register, including information on sexuality, purchasing habits, health information or political leanings:

Web browsing history is staggeringly sensitive.
Any balancing test, such as legitimate interests, must recognize that this data is among the most intrusive data that can be collected on individuals in the 21st century.

He told The Register that he wants to debunk the notion that it’s beyond the technical wherewithal of Facebook – or of any other online platform, for that matter – to handle requests like his:

I hope to refute emerging arguments that the data processing operations of big platforms relating to tracking are too big or complex to regulate.
By choosing to give user-friendly information (like ad interests) instead of the raw tracking data, it has the effect of disguising some of its creepiest practices. It’s also hard to tell how well ad or tracker blockers work without this kind of data.