The limits of Automatic Tag Extraction

I'm working on rewriting an app with my daughters, and one of the features we wanted to test was whether or not we could add automatic tag extraction from photos. I've long been interested in services like Google's Cloud Vision API, which seem to do amazing things on images taken with a camera or modern phone.

Today we did some experiments with three different competing tag extraction services:

Our goal was to be able to identify objects captured from an IP Camera feed, specifically things like cars, people walking, animals, etc. In Google Photos I can search for any of those terms and get back reasonable results, so I was hopeful I could do it here as well.

For our test we used three different photos:

no object (i.e., just the road) as a baseline
a car on the road, but far away
a car on the road, close to the camera

Here are the three images, along with some of the top tag's that each API was able to identify and extract automatically (as well as rounded confidence % data). NOTE: I haven't included everything that each API will return, but it's generally more of the same sort of thing that I have included, with lower confidence.

1. No Object

no object

Google: Property (88), Land Lot (87), Road (84), Ecosystem (84), Grass (76), Plain (72)
Imagga: road (96), way (74), highway (45), landscape (29), asphalt (20)
Clarifai: road (99), landscape (99), nature (97), tree (97), guidance (96), grass (96), field (96), no person (89)

2. Car Far Away

car far

Google: Property (87), Land Lot (86), Road (86), Ecosystem (84), Pasture (83), Grass (79)
Imagga: road (97), way (83), highway (40), landscape (29)
Clarifai: road (99), landscape (99), guidance (97), field (97), tree, (97), nature (97)

3. Car Close to Camera

Google: Ecosystem (84), Land Lot (83), Lawn (70), Screenshot (58), Vehicle (54), Driving (52)
Imagga: car (64), road (24), motor vehicle (23), vehicle (23), transportation (20), automobile (19)
Clarifai: road (99), car (99), transportation system (98), vehicle (97), landscape (96)

We also tried another series of photos that included people walking and bicycles, both close and far away. None of the three services was able to identify any of these.

I have to admit that I was somewhat surprised at how poorly suited these services are to the kind of thing I wanted to do. I suspect that part of the issue is the quality of the image feed, which is a 704 x 480 pixel JPG. When you use any of these APIs with good quality images, the results are much better. Another reason I think it's not doing what we want is that the majority of the photo is not actually about what interests us; namely, these photos are of grass, road, landscape, etc. and the inclusion of a car is a minor detail--minor as measured in pixels, major as measured in perceptual context for an informed viewer. Credit where credit is due, all three were able to find the car when it was fully in frame and large enough, and Clarifai got it with the best confidence rating.

In the end, we've decided that it's probably not the right time to invest our time and money in this approach. I have no doubt that things will get better, to the point that even images of this quality can be analyzed with great accuracy. We do not live in that future, however. My youngest daughter thought that this was a good sign, actually: "Dad, you don't want to turn a computer into a person."