In April, the power of online genealogy databases to help track down and identify people became clear.
That’s when police arrested Joseph James DeAngelo on suspicion of being the Golden State Killer: the man allegedly responsible for more than 50 rapes, 12 murders and more than 120 burglaries across the state of California during the 70s and 80s.
Investigators had collected and stored DNA samples from the crime scenes over the years. They ran the genetic profile they derived from those samples through an online genealogy database and found it matched with what turned out to be distant relatives – third and fourth cousins – of whoever left their DNA at the crime scenes.
Getting a match with the database’s records helped investigators to first locate DeAngelo’s third and fourth cousins. The DNA matches eventually led to DeAngelo himself, who was arrested on six counts of first-degree murder.
It wasn’t that DeAngelo submitted a DNA sample to any one of numerous online genealogy sites, such as 23andMe or AncestryDNA. Rather, it was relatives with genetic makeups similar enough to whoever left their saliva on something at a crime scene who made the search possible.
The more people who submit DNA samples to these databases, the more likely it is that any of us can be identified. According to new research published in Science Magazine, the US is on track to have so much DNA data on these databases that 60% of searches for individuals of European descent will result in a third cousin or closer match, which can allow their identification using demographic identifiers.
As time goes by, given the rate of individuals uploading genetic samples to sites that analyze their DNA, “nearly any US individual of European descent” will be able to be identified in the near future, researchers say.
The study comes out of Columbia University. Its lead researcher was computer scientist Yaniv Erlich, who’s also chief science officer at MyHeritage: a DNA testing and family history company.
To test the growing power of the online DNA sites – to which people upload full genomes so that powerful computers can crunch through them, searching for stretches of matching DNA sequences that can be used to build out a family tree – the Columbia University researchers set themselves the task of finding out whether they could find a person’s name and identity if all they had to start with was a piece of DNA and a small amount of biographical information.
They note that third-party services, such as DNA.Land and GEDmatch, now allow participants to upload their raw genotype files for further analysis, and nearly all of those services offer to find genetic relatives by locating identity-by-descent (IBD) segments that can indicate a shared ancestor. That can lead to matches with distant relatives, such as second or third cousins, and has led to “success stories” such as reunions of adoptees with their biological families.
Law enforcement has been far from oblivious to the potential of finding people through distant genetic relatives, as the case of the Golden State Killer’s arrest shows. You can see why investigators would prefer to use these genetic databases rather than forensic databases, which can only identify close (1st or 2nd degree) relatives and which are highly regulated, the researchers note.
Besides the Golden State Killer, there have been at least 13 other cases reportedly solved by long-range familial searches between April and August 2018. One forensic DNA company, Parabon Nanolabs, has also announced a division that will use long-range familial searches. It’s already uploaded 100 cold cases to third-party services, pointing to the potential for long-range familial searches becoming a standard investigative tool, the researchers said.
The study started with a full DNA sequence from somebody whose genetic information was published anonymously as part of an unrelated scientific study. The researchers had actually identified her in a previous study, but for the purposes of this study, they pretended they didn’t know who she was.
Erlich’s team uploaded her genetic code to GEDmatch and ran a search to see if they could turn up any relatives. They found two: one from North Dakota and one from Wyoming.
The amount of genetics those two shared translated into a distant match, as in, six to seven degrees of separation. But those two relatives also shared genetics that showed they were distantly related via an ancestral couple who lived 4-6 generations ago.
The researchers could tell that all these people were related because they shared a number of single nucleotide polymorphisms, or SNPs. These are single letters in specific spots among the roughly 3 billion As, Cs, Ts and Gs that make up the human genome: adenine (A), cytosine (C), guanine (G), thymine (T) and uracil (U). Those are bases retained in DNA and RNA that have been modified after the nucleic acid chain has been formed.
Within about an hour of work, the researchers identified who the ancestral couple was, using publicly available genealogical records. They then searched for the couple’s descendants who would match the demographic data – for example, what were their expected years of birth?
That wasn’t a trivial step, given that the ancestral couple had 10 children and hundreds of grandchildren and great-grandchildren. But after a full day of work, the scientists eventually traced the identity of their target, who was the same person they had previously re-identified, based on surname inference from the Y-chromosome, which is one of several genetic re-identification tactics.
The takeaway
The study’s success in identifying individuals was based on the use of the genomic data of 1.28 million individuals tested with consumer genomics – a data set that grows bigger every day as more people submit their DNA samples to these services.
The researchers suggest that we need to re-evaluate how we use this powerful data. Law enforcement, policy makers and even the general public may well be in favor of using these “enhanced forensic capabilities” for solving crimes, but we need to keep in mind that these databases and services are open to everyone, and not everyone will use them with good intentions.
For example, research subjects can be re-identified from their genetic data. Yet rules that, starting in 2019, will regulate federally funded human subject research fail to define genome-wide genetic datasets as “identifiable” information.
The researchers say that their work shows that such datasets are indeed capable of identifying individuals. That’s why they’re encouraging US Health and Human Services (HHS) to rethink that classification.
To better protect our genomes, Erlich and his team are proposing that the text files of raw genetic data be cryptographically signed:
Third-party services will be able to authenticate that a raw genotyping file was created by a valid [direct-to-consumer] provider and not further modified. If adopted, our approach has the potential to prevent the exploitation of long-range familial searches to identify research subjects from genomic data. Moreover, it will complicate the ability to conduct unilaterally long-range familial searches from DNA evidence. As such, it can complement previous proposals regarding the regulation of long-range familial searches by law enforcement and offers better protection in cases where the law cannot deter misuse.
Erlich and his team have uploaded demo source code to GitHub that can be used to sign and verify the raw genotype files using a previously published digital signature scheme.
This is the clarity and data protection we need to keep genomics going, they wrote:
Overall, we believe that technical measures, clear policies for law enforcement in using long-range familial searches, and respecting the autonomy of participants in genetic studies are necessary components for long term sustainability of the genomics ecosystem.