Researchers at Elon Musk’s AI think tank OpenAI have created what amounts to a text version of a deepfake – and it’s too scared for humanity to release the full version.
Its AI writing tool generates reasonable-looking text on a wide range of subjects. It is based on research that the organization did to predict the next word in a sequence of text, it explains in a blog post on the topic. The tool takes a sample piece of text written by a human and then writes the rest of an article, producing dozens of sentences from a single introductory phrase.
The tool doesn’t discriminate between topics. Instead, it uses over 40Gb of text gathered from the internet to help it produce convincing-sounding copy on anything from Miley Cyrus to astrophysics.
The problem is that while the copy sounds convincing, all the facts in it are fabricated. The tool writes names, facts and figures effectively synthesized from something that the system read online. It’s like an electronic version of that old school friend who you regrettably accepted a Facebook invitation from and who now keeps writing bizarre posts with ‘alternative facts’. For example, it takes the following phrase…
A train carriage containing controlled nuclear materials was stolen in Cincinnati today. Its whereabouts are unknown.
…and builds an entire news story around a fictional event. It fabricates a quote from Tom Hicks, who it says is the US Energy Secretary. At the time of writing, that role is occupied by Rick Perry.
OpenAI built the training data set, consisting of eight million web pages, by scanning Reddit for links that received more than three Karma (the site’s reward for popular content). The researchers were not necessarily looking for truth here, so much as interesting text that was either educational or funny.
The tool is also good at reading, understanding, summarizing and answering questions about text, along with translating.
This isn’t going to replace factual reporting anytime soon (phew), but it could automate some darker things online. It’s an article spinner’s dream, and as OpenAI points out, it could easily be used to write fake Amazon reviews by the thousand.
Perhaps the most worrying use case is the production of fake news via social media and blog posts. Marry it with other forms of deepfake (such as NVIDIA’s recently launched ThisPersonDoesNotExist) for the creation of fake faces, and deepfake video and audio, and you have the makings of an automated disinformation-spewing social media machine.
OpenAI realises this. It says:
These findings, combined with earlier results on synthetic imagery, audio, and video, imply that technologies are reducing the cost of generating fake content and waging disinformation campaigns. The public at large will need to become more skeptical of text they find online, just as the ”deep fakes” phenomenon calls for more skepticism about images.
No wonder the researchers decided not to release the fully-trained model. Instead, they released a scaled-down one, which uses less data and only included the sampling code. It didn’t release the broader 40Gb dataset, or the code used to train it. However, reproducing what they did is only a matter of time, they admitted:
We are aware that some researchers have the technical capacity to reproduce and open source our results. We believe our release strategy limits the initial set of organizations who may choose to do this, and gives the AI community more time to have a discussion about the implications of such systems.
That’s the problem in a world where knowledge – or the power to get it – is easily distributed. Secrets are difficult to keep. And with computing power increasingly cheap, AI’s processor-intensive training is becoming easier to reproduce.
Andy Loates
How do we know this article was actually written by Danny Bradbury, and not this new ‘tool’?
(jus’ askin’ . . .)
delayedthoughtengineering
I think that’s part of the problem, and why this system has been compared to “deepfakes”. You wouldn’t really know, unless the software’s output is so different from the real writer’s style, that the style difference could be used reliably as a “fingerprint”.
Then again, given enough writing examples from the original writer to analyze in the AI learning engine, the AI text generator could eventually adopt the original writer’s style, so the AI could generate copy while simulating the original writer’s style, thereby targeting the writer.
Steve
We need to see him reading it in a video to prove it’s really him. Oh, wait…….
Danny Bradbury
You wouldn’t be the first reader to tell me I was a bit of a tool…
Anonymous
How do we know these comments are not from bots?
David C.
When people say “don’t believe everything you read”, I frequently reply sarcastically with “I don’t believe anything I read”. We’re quickly coming to the point where it won’t be sarcastic anymore. Which is sad.
Wilbur
“The researchers were not necessarily looking for truth here, so much as interesting text that was either educational or funny.”
Odd, I would have thought “educational” material would be truthful (factual), but apparently not.
Danny Bradbury
Boolean logic applies. If they were only looking for material that was educational, then they would be looking for truth. If they’re looking for material that is educational OR funny, then truth isn’t a priority.
content = text_from_reddit
content_status = FALSE
IF (content == funny OR content == educational) THEN content_status = FALSE
IF content == educational THEN content_status = TRUE
Damo
lol, this is awesome. IMO AI will require us to define what is human, a step closer. Not sure what they worried about, publicity stunt? Doesn’t sound like this will be hard to reverse lookup and label ‘fake’… like SPAM hasn’t stopped people using email. Probably more useful for social engineering and phishing, a systems weakest point is becoming the space between the seat and the keyboard. So now they will build AI blockers into browsers and search engines? It’s like whitehat/blackhat AI, where we need AI to fight AI, initially to save time recognising the fake. Like using AI to verify authenticity of a login system first, then allow the more easily tricked human; AI is more ‘trustworthy’ than a human… (when interacting with the computers)
John Griffith
It is a Brave New World, and Welcome to it.