Tunnel Vision

April 2020

Philipp Schmitt is an artist, designer, and researcher based in Brooklyn, NY, USA. His practice engages with the philosophical, poetic, and political dimensions of computation. His current work addresses notions of opacity in artificial intelligence research and its history.

Read full Bio

Declassifier is a software-artwork based on a modified computer vision algorithm that recognises objects in pictures. Instead of showing the program’s prediction, a photo is overlaid with images from COCO, the training dataset from which the algorithm learned in the first place. The piece exposes the myth of magically intelligent machines, highlighting that prediction is based on others’ past experience; that, for instance, it takes a thousand dog photos in order to recognize another one. Declassifier visualises which data exactly conditioned a certain prediction and encourages chance encounters with the biases and glitches present in the dataset. Thus it, ultimately, helps intuit how machines see.

For the Data/Set/Match programme at The Photographers' Gallery, Declassifier is presented in dialogue with a new series of 35mm photos, titled Tunnel Vision. This text contextualises the series through notebook entries and speculates on how computational logic in computer vision might come to influence human perception.

Introduction

When a computer vision algorithm recognises something in a picture, it soberly frames what it ‘sees’ in confetti-coloured rectangles, digital hues that contrast with the everyday shapes and colours that we see in a space with plain eye. Each neatly labelled with a single category, these annotations highlight answers but don't give explanations. To the uninitiated, it seems almost magical, or at least akin with some sort of intelligence.

Computers don't really see like humans, of course. They recognise pixel formations statistically similar to the data they have previously learned from. Each prediction by a computer vision algorithm is an encounter with an image dataset — a collection of annotated pictures showing what-is-what in the real world. This encounter is generally invisible. The data is seen as a given while the incredible amount of labour that went into a collection’s creation remains unacknowledged.

COCO, short for Common Objects in Context, is one such image collection. Originally published by researchers at Microsoft, the dataset is maintained by, and attributed to, a consortium of 13 contributors (all but one are male) at major American universities and tech companies. COCO consists of 328,000 annotated images containing “91 object types that would be easily recognisable by a 4-year-old”. To make sure these types would “form a representative set of all categories” — all presumably understood in the sense of all there is — the authors began by editing an existing list of most frequently used words to come up with an initial list of classes. They voted — among themselves — on the best categories, and even consulted several children in ages from four to eight. What a responsibility for a four-year-old! Soccer ball didn’t make the cut, baseball bat did. What is common, of course, depends on who is looking.

Tunnel Vision

For my commission for the Data/Set/Match programme at The Photographers’ Gallery, I photographed computer-recognisable examples of COCO’s object categories in the city of New York, where I live. My goal was to test the dataset’s alleged universality in one actual place. [1]I find New York an interesting location to conduct this experiment, considering its geographical and cultural proximity to the site of COCO’s genesis. My demographic affiliation with the dataset’s creators might be a limitation, in this case, however, I see this as a method of working, a recipe I would love to see adapted by others, elsewhere. Try it yourself: the online version of Declassifier, lets anyone test their photos against the algorithm — just drag an image onto the browser window. Marking a shift of perspective I frequently explore in my work, the project required me to reorient my own gaze — to look through the viewfinder of my camera from the worldview the dataset puts forward. My photography would become algorithmic, its results embodying the feedback loops of (mis-)representation, over-, and underexposure that datasets like COCO inevitably embody.

pizza, fire hydrant

On the first day, I copy the 80 main categories from COCO in my notebook. The vehicles: bicycle, car, motorcycle, airplane, bus, train, truck, and boat. The animals: bird, cat, dog, horse, sheep, cow, elephant, bear, zebra, and giraffe. I try to memorise them, so I could immediately recognise them when I would encounter them later, outdoors.

Some categories jump out. The ones that appear uncommon, like a cow in New York, or out of season, like a frisbee in January. When I discover them after all, I remember the occasions fondly: The cast-metal elephant in Brower Park, the discarded remote control in Bushwick, the life-sized plastic cow in someone’s front yard. When I show the photos to the algorithm, it doesn’t smile, or blush, and as weird as the sight may be, it doesn’t show any signs of surprise.

Over the course of the project, I constantly think about whether it’s okay to photograph subjects without their approval. Scraping is a violent act executed from an office chair, abstracted and administered in code. On the street, considering a single picture at a time, my choice of analogue photography adds to making it a visceral experience.

Person is the biggest category in the COCO dataset. I don’t ask for people’s permission when I take their picture. The heist is only a matter of seconds. I press the shutter. Curtains slide open explosively. Light is diverted in the lens and thrown at the film where it starts a chemical reaction and burns the picture into the emulsion. The curtains close with an audible crack, like breaking a bone. A person becomes data.

clock, clock, person

These photos are for a computer 'audience' first. The human viewer is secondary. I shoot matter-of-fact, with little occlusion, as too much complexity will confuse the algorithm. It’s challenging to still take pictures that I myself find interesting. When I feel confident that I can see through its eyes, I take pictures that I hope will trick the computer. Sometimes this works. Other times it doesn’t.

toilet, sink, frisbee

To re-enact the combinatory logic by which the original dataset was compiled, I give myself the rule that each photo should include two, or more, object categories. Not every type of picture is included in a dataset, anyway.

The COCO authors wanted, for example, pictures of cats on couches, but not of cats posing in front of a white background for a feline studio shoot. To avoid images of single objects in isolation, the authors used combinations of object categories as search terms. Considering that Flickr already had ten billion pictures in its database by 2015 and more than a million are uploaded every day, it is reasonable to assume that there would be a photo of almost anything. Of course, there are pictures online for the search term cat+sink (COCO contains 198 images showing both categories) and it isn’t at all inconceivable that cats could be in sinks, yet I would argue that they generally have no business in there. Similarly, one finds umbrella+toilet rarely together, or one in the other, unless the encounter is amplified, thus naturalized, by the search query. This is how common objects are put in context: not primarily by any natural context in which they might appear, but explicitly by their appearance together with other dataset classes. This explains why the objects in the COCO dataset frequently seem neither common nor in context.

These instances aren’t particularly concerning by themselves. What I am suggesting is that, as a result, COCO represents more than an, obviously biased, collection of objects. It captures a logic of how things should be connected: In this other world, umbrellas should be in toilets, and cats in sinks. Artificially intelligent systems adopt the data’s coded logic and then get deployed in our world. Consider all the ways digital technology has subtly morphed minds for generations. Is it unreasonable to speculate that it could instil its logic in the minds of people who are constantly subjected to its algorithmic worldview? What would be considered normal, what absurd? In other words, could data change us cognitively?

sandwich, zebra

Day after day, in moments of clarity, after hours of searching for objects in the city, I realise that I am indeed fully immersed in the combinatory logic of COCO. The experience resembles a form of algorithmic tunnel vision where peripheral details are blacked out and nothing but COCO objects come into focus; only that the tunnels are multiple. Motives stand out to me when multiple categories intersect. If it’s together, it makes sense. Otherwise logic becomes secondary.

I become aware of my empty stomach and take a lunch break, for human computers have to eat. I am beyond excited when my bagel sandwich comes with a zebra napkin! Is this how computer vision feels like?

References

[1]

I find New York an interesting location to conduct this experiment, considering its geographical and cultural proximity to the site of COCO’s genesis. My demographic affiliation with the dataset’s creators might be a limitation, in this case, however, I see this as a method of working, a recipe I would love to see adapted by others, elsewhere. Try it yourself: the online version of Declassifier, lets anyone test their photos against the algorithm — just drag an image onto the browser window.