People are gaga for voice recognition apps: A new study from Juniper Research has found that smart speakers such as Amazon Echo and Google Home will be installed in over 70 million US households by 2022, reaching 55% of all homes.
Imagine all those voice-recognizing apps, ordering our pizzas and sending our Amazon Christmas presents to Grandma.
And allowing us to conduct phone banking, of course. Voice recognition is the darling of a number of banks that have opted to fight fraud by using voice prints: supposedly unique biometric markers that can be made up of more than 100 characteristics, based on the physical configuration of the speaker’s mouth and throat.
Why “supposedly?” Well, for one thing, they’ve been outfoxed at least one time that we know of: when a reporter’s nonidentical twin outsmarted a bank voice recognition system, managed to log in to an account belonging to his brother (a BBC reporter) and view his balances. He was also offered the chance to transfer money between accounts.
For another thing, it turns out that you can fool the voice recognition systems by squeaking like a little kid or lowering your tone like an old person. Sure, you might well be a known fraudster who’s been testing a targeted bank’s voice authentication system, but try out your little kid impression, and behold! That pesky you’re-a-known-crook voice print gets transformed into a brand new, unrecognized, clean-slate voice print.
The finding that voice impersonators can fool speaker recognition systems comes from the University of Eastern Finland.
For her doctoral dissertation, researcher Rosa González Hautamäki and her team analyzed speech from two professional impersonators who mimicked eight Finnish public figures. According to the abstract of the team’s paper, which is titled “Acoustical and perceptual study of voice disguise by age modification in speaker verification,” impersonators were able to fool automatic systems and listeners in mimicking some speakers:
Skillful voice impersonators are able to fool state-of-the-art speaker recognition systems, as these systems generally aren’t efficient yet in recognizing voice modifications.
The team studied voice disguise that included acted speech from 60 Finnish speakers who participated in two recording sessions. They were asked to modify their voices to fake their age, attempting to sound like an old person and like a child. In the case of acted speech, the researchers found that sounding like a child worked well to confound the automatic systems, since performance degraded with this type of voice camouflage.
Hiding your identity through modifying your voice is of course a common way to avoid being recognized, particularly in situations where face-to-face communication isn’t necessary. Phone banking comes to mind, but so too do prank calls or crimes such as blackmail or harassment. Or, say, telling Alexa where to deliver that pizza.
As the use of voice commands for mobile devices becomes ever more widespread, more and more people are expecting to use it as authentication or for public safety, the researchers point out. But misuse of the technology is going to rise right along with the popularity of voice apps, they say.
We’ve already seen voice attacks against speaker recognition that have been accomplished through technical means. For example, in September 2017, researchers demonstrated that Siri – along with every other voice assistant they tested – will respond to commands that don’t come from a human. They’re not just outside the human vocal range; they’re also completely inaudible to humans.
The laundry list of what they were able to get Siri, Google Now, Samsung S Voice, Huawei HiVoice, Cortana and Amazon’s Alexa to do by sending ultrasonic voice commands, at frequencies of more than 20 kHz, is eye-popping.
We’re talking about the capability to trick any of those voice assistants into visiting a malicious website, that could then launch a drive-by-download attack or exploit a device with 0-day vulnerabilities; to spy on the user by initiating outgoing video/phone calls and thereby getting access to the image/sound of device surroundings; or to conceal an attack by dimming the screen and lowering the device’s volume, for example.
The Finnish researchers note that other technical voice attacks include voice conversion, speech synthesis and replay attacks. The scientific community is “systematically developing techniques and countermeasures against technically generated attacks,” they say.
But voice attacks via voice modification? Produced by humans?
Those aren’t easy to detect, they say.