On Lacework: watching an entire machine-learning dataset

Everest Pipkin is a drawing and software artist currently based in Pittsburgh, Pennsylvania.
More on Everest Pipkin

I proposed what would become Lacework in the Summer of 2019. In my proposal, I describe a cycle of videos curated from MIT's 'Moments In Time' dataset, each then slowed down, interpolated, and upscaled immensely into imagined detail, one flowing into another like a river.

In many ways, Moments In Time is unremarkable. Like so many datasets with similar goals, Moments In Time is intended to train AI systems to recognize actions. It contains one million, 3-second videos scraped from websites like YouTube and Flickr, each tagged with a single verb like asking, resting, snowing or praying.

There are a few things that are particular about Moments In Time. It tries to break down most possible actions into just 339 'doing' verbs. It also doesn't clarify about the subjects of its videos- for instance, it is more interested that something is flying than if that thing is a bee, flower, person, plane, satellite, or bird. Moments In Time decenters human actions in favour of how words might apply to broader swaths of doing.

Because these decisions require a particular type of logic, the language of this research is oddly poetic. The associated whitepaper reads: "three seconds is a temporal envelope which holds meaningful actions between people, objects and phenomena"; and "visual and auditory events can be symmetrical in time ("opening" is "closing" in reverse), and either transient or sustained."

One million videos is not a particularly large archive in the world of big data, but it is still a number meant for machine comprehension, not human. The first time I opened the dataset I got dizzy. One million moments, placed in folders named simply- sleeping, slicing, sliding, smelling, smiling. I felt like I’d been placed down somewhere on the surface of the earth, and told to walk home.

Without a methodology, I began haphazardly moving through the folders, dipping into specific verbs, pulling up a few dozen random videos until I found one that I liked. This was enough to test the aesthetic process through which I would approach the work but did little to teach me about the structure of the dataset itself. It was not until I entered quarantine in March, months after I’d begun, that this process moved from a random selection to a deliberate and focused one.

When the pandemic moved the project online, I considered focusing on videos about touch, this subject that had been transmuted from a casual action to a precious and dangerous one. I pulled roughly 20 folders from the archive by verb- cuddling, hugging, punching, reaching. These folders still contained 60,000 videos between them, but this was a scale that I felt was possible to approach, as one person stuck in a room.

As the world entombed, I spent my time watching these short, tiny moments of intimacy- two lovebirds beak to beak, a tiger putting paws around a keeper, the contact of a fist to a face, a kiss, a crowd, a touch.

I was looking for videos with a lot of texture to them, with subjects at a distance, in shapes or colours that were already confusing. I wanted compositions that the AI upscaling could catch onto and drag against, pulling a cloud into a mountain or a face into a coastline.

Somehow, I had expected the act of watching Moments in Time to be calming or exploratory, like seeing the world out of a window. But the archive is not entertaining, poetic, beautiful, or joyful- even though many videos that evoke those feelings are contained within it. It is an archive with purpose, an archive of actions for an inhuman eye. Here is the world, here are things that are done there. It feels raw.

Eventually, I built myself a Tinder-like interface for screening the videos. Each video would pop up and I would then tap the left or right arrow key to sort it into place- Yes or No, keep or throw away.

Out of these 60,000 videos I selected roughly 400. I slowed each one down dramatically, drawing the video out to 15 seconds from its original three. I interpolated the frames, attempting to make a new flowing motion to replace the smoothness lost in the time conversion. I wanted to be able to study these moments intended to teach machines what it is to touch. I wanted to see each small choice in each body taking action, removed from human time.

It was now early April. I had watched 6% of the dataset, about 50 hours of videos, and from them had made a little over an hour of material.

I could have stopped there. But I kept thinking about the rest of the archive- all those other verbs, all this other life. I felt that the videos around touch were not enough of the story. They didn’t say enough about what the dataset was trying to do, what it contained.

I decided I needed to watch the rest of the dataset. This time I started at the top, alphabetically.

Aiming, applauding, arresting, ascending, asking.

Last fall, when we could still gather in rooms together, I taught a class called 'Data Gardens'. One of the units was around performance as AI. We talked about 'Wizard of Oz prototyping' and human labour in machine-learning systems. We watched movies of actors pretending to be androids, replicants, and machines. We looked at histories of automata, including the miraculous 18th-century chess-playing Mechanical Turk, which in reality hid a petite human chess master among its clockwork.

When I first started watching the dataset I assumed that the team of researchers who had put it together at MIT had seen the bulk of it, but I’m now convinced that assumption was wrong. This is because so much of the archive is so, so hard to watch.

This is partly to do with time. The videos in Moments in Time have a severe, automated cut (3 seconds, sharp) that severs these moments, sometimes chopping them in the middle of the action which they are meant to describe. I eventually found that I had to mute the videos to keep watching at all- that the images could dissolve into colours and shapes but the jarring severance of the sound remained distinct and pointed no matter how much I watched.

The difficulty of watching is also partly to do with consent. Moments in Time severs the relationship between recorded action and original maker. The researchers did not ask for permission to use these videos, and all ownership of- and control over- the image is pulled away from the person who held the camera, and from what that camera depicts.

In the archive, there are moments of extreme emotion and personal vulnerability- tears, screaming, and pain. Moments of questionable consent, including pornography. Racist and fascist imagery. Animal cruelty and torture. And worse; I saw horrible images. I saw dead bodies. I saw human lives end.

Even though I’m probably the first person to watch all of Moments in Time, every part of the dataset has had human eyes on it before. This is because after being gathered and cut, the videos of Moments in Time were automatically uploaded to Amazon Mechanical Turk for annotation. Amazon Mechanical Turk is a crowdsourcing service that connects 'requesters' to 'workers' who generally perform small, computer-like tasks for pennies. It is owned by Amazon and takes its name from the fake chess-playing machine.

The Moments in Time whitepaper describes the process of annotating the videos: “Each AMT worker is presented with a video-verb pair and asked to press a Yes or No key signifying if the action is happening in the scene. Positive responses from the first round are sent to subsequent rounds of annotation. Each HIT (a single worker assignment) contains 64 different 3-second videos that are related to a single verb”.

I’m reminded of my own years spent as an AMT worker, which kept me employed at well under minimum wage during and after my undergraduate education. I think about all those thousands of tasks which involved the repetition of my labour. Hitting buttons with my hands, matching emotions with my face, recording words with my voice. How many datasets my body must be contained in. What those datasets are used for. How much violence my body does to others, through them.

The example interface described by the Moments in Time whitepaper has an unsettling resemblance to my own handmade video-screening method, with the right and left arrow keys standing in for Yes and No. Both contain only two options- include, discard. No ability to report a video, reclassify it or clarify its inclusion, no middle ground.

It comes to explain the brutality within. When you ask:

Does a dog fight match ‘barking’?

Does a sexual assault match ‘kissing’?

Does a police murder match ‘arresting’?

Sometimes the answer is Yes.

Around hour 250 of the dataset, sometime in late April, I started having The Dream. In The Dream, I am living on- or perhaps am- a research satellite, orbiting a planet and monitoring planetary broadcasts. For each brief moment I am floating over a place, I can see just a flash of what is happening under me. Just a fleeting, tiny vision of something happening then something happening then something happening. The dream goes on this way for years.

Even awake, I’ve come to see patterns in camera quality, in shadows, in the colour of paint and the types of trees, in movement, in texture that could unfold into ever more detail. I see patterns in everything, patterns that unify all of the videos which have nothing to do with their subjects and everything to do with the way the lens sweeps up to capture a running dog, or pivots to see the sunset or the way the light flares and the compression pushes against the edges of the frame.

I wonder if this is how a sorting algorithm feels.

By now, I am running the upscaling algorithm parallel to my own curatorial work. My computer hums and struggles as it processes and reprocesses images, imagining detail where there is none- from 80 pixels to 160, from 160 to 320, 320 to 640, and finally to 1280, all full of imagined detail.

I’m using Topaz Lab’s proprietary software AI Gigapixel to do the bulk of the upscaling, which is wrapped in an easy to use command-line interface. It was trained on yet another dataset of pictures at various resolutions. It sees patterns too. This process has been described as hallucinatory, which is an accurate marker- it is a recurrent looking, a push in and in for ever more detail which then spirals into something else entirely.

Like looking at images of the earth via satellite.

By the point that I am halfway through the dataset in early May, I am running out of time. I start to spend whole days in the archive, watching from the moment I wake to when I go to sleep. I’m this far, and I want to do it right. The act of watching itself has become important to me.

In this archive of actions, I want to perform action. I become grateful to wake up every day knowing how I will spend it. I’m not building a cathedral, but I think about what building a cathedral would let me do, how it would allow me to move my hands in a task and see something monumental grow very slowly, with immense care. A bricklayer understands brick in a way that is devotional.

Repetition is devotional.

Very slowly, over and over, my body learns the rules and edges of the dataset. I come to understand so much about it; how each source is structured, how the videos are found, the words that are caught in the algorithmic gathering.

I see the subjects of the videos, the people living their lives. I meet their dogs, I see their homes. I see wild animals, strange weather, places I’ll never get to visit, video games I haven’t played. I see so much life.

I can also see the hands of the person who held the camera, and the hands of the workers who first sorted the videos. These others who have also watched this exact moment, who had to decide before I did- Yes, or No.

I memorise qualities of the pattern; light, colour, noise, compression, blur, frame-rate. I know how these aspects will interact with the interpolation and the upscaling. I don’t have to think about it anymore- it is all automatic.

I learn the exact length of 3 seconds.

Every once in a while, in the satellite dream, there will be no broadcasts. I can’t pick up anything- I just see the landscape that is underneath me, the mountains, the coastline. I can zoom the picture in, I can get closer, but all I see is world, not actions.

Sometimes I zoom in so far that I can see the lines in the dirt left by a tractor, by thousands of tractors, the ways in which they are interrupted by the boundaries of jagged creeks. I see the wind, the wind-formed dunes, the oil wells and the patterns they scar on the desert, the ocean, the wakes of boats, the wakes of islands. From above, the intertwined complexity of it all is so clear.

I see all these millions of lives, all of this infinite detail, this lacy intricacy that grows ever more granular the closer I get, then grows again, and again.

Everest Pipkin is a drawing and software artist currently based in Pittsburgh, Pennsylvania.
More on Everest Pipkin

Suggested Citation:

Pipkin, E. (2020) 'On Lacework: watching an entire machine-learning dataset', The Photographers’ Gallery: Unthinking Photography. Available at: https://unthinking.photography/articles/on-lacework
< Prev Next >