It’s easy enough to stalk kitty cats or track fugitives to, say, the jungles of Guatemala if you have photo EXIF data.
After all, EXIF data reveals, among other things, GPS latitude and longitude coordinates of where a photo was taken.
But really. EXIF data? Bah!
Enter Google. It don’t need no stinkin’ EXIF data.
Tobias Weyand, a computer vision specialist at Google, along with two other researchers, have trained a deep-learning machine to work out the location of almost any photo, just going by its pixels.
To be fair, the learning machine did get trained, initially, on EXIF data.
Make that a huge amount of EXIF data: after all, imagine how many images Google can wrap its tentacles around.
It trained its system on 126 million of them.
The result is a new machine that significantly outperforms humans at determining the location of images – even images captured inside, without geolocation giveaways or hints such as palm fronds, street signs, local-language bearing billboards or Niagara Falls misting away in the background.
There are sites such as GeoGuessr and View From Your Window that suggest that humans are pretty good at integrating clues to guess a photo’s geolocation: we get our tips from landmarks, weather patterns, vegetation, road markings, and architectural details to figure out at least an approximate location, and sometimes even an exact one.
When computers try to figure it out, they’ve usually used image retrieval methods.
In contrast, Weyand and his colleagues approached it as a classification problem.
As they explain in their paper, titled PlaNet – Photo Geolocation with Convolutional Neural Networks, they first divided the earth’s surface into a grid consisting of over 26,000 squares of varying sizes that depend on the number of images taken in that location.
Next, they trained a deep network using millions of geotagged images.
Of course, that meant that the system had a lot more images to go on when dealing with photos of cities, where scads of photos are taken. PlaNet had far fewer images to rely on when it comes to remote regions where people don’t take many photos, such as oceans or polar regions, so the team ignored such areas.
They created that huge database of 126 million photos with EXIF geolocations mined from all over the web.
It’s a noisy data set. The Google team excluded non-photos, such as diagrams or clip art, as well as porn.
That left all manner of photos: those taken indoors, portraits, pet photos, food snaps, and other images that don’t have geolocation cues.
The Google team started training the powerful neural network with 91 million of these images to teach it to work out the grid location using only the image itself, the idea being to input an image and get an output of a particular grid location or a set of likely candidates.
They used the rest of the images – 34 million of them – to validate the results.
Of course, it’s easy to be correct when there’s a famous landmark in the photo, like the Statue of Liberty, the Sydney opera house or Big Ben, for example.
But PlaNet also learned to recognize locally typical landscapes or objects – think red phone booths – architectural styles, and even plants and animals.
To gauge how well it was doing, the team pitted PlaNet against 10 well-traveled humans in a game of Geoguessr.
Geoguessr presents players with a random street view panorama and asks them to place a marker on a map at the location the panorama was captured.
It normally allows players to pan and zoom, but not to navigate to adjacent panoramas. To keep the comparison fair, the Google team didn’t allow the humans to pan or zoom.
You’d imagine that well-traveled humans would have an advantage by knowing, for example, that Google Street View isn’t available in countries including China, thereby allowing them to narrow down their guesses.
But PlaNet, trained solely on image pixels and geolocations, still beat humans by a decent percentage: it localized 17 panoramas at the country level, for example, while humans only localized 11.
From the paper:
We think PlaNet has an advantage over humans because it has seen many more places than any human can ever visit and has learned subtle cues of different scenes that are even hard for a well-traveled human to distinguish.
Or, to phrase it with more “in your FACE, humans!”, PlaNet is “superhuman.”
In total, PlaNet won 28 of the 50 rounds with a median localization error of 1131.7 km, while the median human localization error was 2320.75 km. [This] small-scale experiment shows that PlaNet reaches superhuman performance at the task of geolocating Street View scenes.
When it comes to geolocating photos that don’t have location cues, such as those taken indoors, the team figured out how to teach PlaNet to scrutinize photos that are part of albums.
Even if PlaNet can’t determine that a picture of, say, a toaster is in China, if it’s in an album with photos of the Great Wall, it can assume that the toaster’s in the same place.
Do you like the idea of Google using Street View photos and its mighty search muscle to pinpoint your photos’ geolocation?
And how do you feel about Google unleashing that might onto mobile phones?
It could well happen. For a deeply powerful neural network, PlaNet is one svelte bit of code:
Our model uses only 377 MB, which even fits into the memory of a smartphone.
Google has already seen its share of privacy wreckage over Street View.
It’s long been a challenge for the company to operate Street View in countries with stronger privacy laws than the US, such as in the European Union.
Though Google uses technology to blur faces and license plates in Street View images, European data protection authorities have also required that Google notify the public before the Street View cars start driving on European streets and that it limit the amount of time that it keeps unblurred images of faces and license plates.
So. Imagine this: Google Street View on steroids, beefed up with machine learning and running on the fuel of all the images Google has access to, in back pockets throughout the land.
Images, mind you, that don’t necessarily have to use EXIF data for geolocation but can instead be crunched by pixels alone.
Please do give us your thoughts on that scenario in the comments below.
Image of Google Street View car courtesy of 1000 Words / Shutterstock.com