An Introduction to Image Datasets
Data / Set / Match is a year-long programme by The Photographer’s Gallery digital programme seeking new ways to present, visualise and interrogate contemporary image datasets. This introductory essay presents some key concepts and questions that make the computer vision dataset an object of concern for artists, photographers, thinkers and photographic institutions.
Computer vision is the production of an understanding of digital image's content and the generation or transformation of images through software. In the recent years, algorithms have been increasingly successful in automatically tagging photographs, reading car plates or assessing the presence of tumours in medical images. Once a black box, the digital photograph has become a terrain of experimentation and product development. The advance in computer vision has produced techniques to optimise photographs, as all sorts of filters have become available on social platforms (See for example Snapchat or Facebook Messenger). The same techniques and advancements have also powered up the generation of the uncanny psychedelic imagery of deep dreams or deep fakes that have become cultural references. Computer vision algorithms are not mere technical improvements but intervene in the common understanding of what an image is, what it can do and whether it can be trusted. These developments have been achieved by emulating algorithmically the ways humans see, interpret and produce images. To emulate these cognitive abilities, computer vision algorithms make heavy use of collections of images called datasets.
A dataset in computer vision is a curated set of digital photographs that developers use to test, train and evaluate the performance of their algorithms. The algorithm is said to learn from the examples contained in the dataset. What learning means in this context has been described by Alan Turing (1950): “it is best to provide the machine with the best sense organs that money can buy, and then teach it to understand and speak English. This process could follow the normal teaching of a child. Things would be pointed out and named, etc.” A dataset in computer vision therefore assembles a collection of images that are labeled and used as references for objects in the world, to ‘point things out’ and name them.
The example of ImageNet
This learning-by-example approach depends on a particular scale for its success. Algorithms trained with larger datasets perform significantly better than those trained on smaller ones. With more data come more variations and the algorithm can learn from the myriads of differences of the visual world . In contemporary machine learning, a change of magnitude of sample data provokes a qualitative change in the algorithm's performance. To understand more concretely how such collections of images are assembled in vast quantities, let's examine one of the largest database of human annotated visual content to date: ImageNet (Deng et al., 2009).
Datasets such as ImageNet are built on an array of practices of mediation of photography: collecting, labelling, composing, assembling images and distributing them.
Such practices, familiar to the world of photography, are now translated at an industrial scale. The ImageNet project for instance is a collection of tens of millions of images manually annotated, sorted and organised according to a taxonomy. ImageNet, exclusively composed of digital photographs[1]For example, the neural networks achieving state-of-the-art performance are trained using datasets with millions of labelled faces: Facebook’s DeepFace and Google’s FaceNet were trained using 4 million and 200 million training samples, respectively (Hu et al, 2016)., functions as a large cache of photos culled from the internet. The dataset's content comes from various sources: amateur sites, blogs, stock agencies, news sites, forums, etc. Background information as well as authorship notices are absent from the dataset; the link to the environment where the image was liked, shared, commented and tagged is severed . And the authors of the photographs as well as the people portrayed in them are unaware of their presence in the collection. In visual datasets, photographs are considered as self-standing documents free from the contexts from which they originated and through which they travelled.
The delegation of vision
Computer vision datasets depend on the availability of large volumes of photographs. Each category of ImageNet reportedly contains a minimum of 1000 images and categories include a vast variety of topics from plants and geological formations to persons and animals. At the same time, the amount of annotation work involved in the production of datasets is even more impressive than the amount of photographs[2]After all, 14 millions of images is only a fraction of the monthly 575 million public uploads of photos on a platform like Flickr. it contains. The work of manually cross-referencing and labelling the photos is what makes datasets like ImageNet so unique[3]And hard to replicate for specific domains. Different research projects are attempting to produce artificially the image datasets rather than collect the images.. In fact, there has been rarely in the history so many people paid to look at images and report what they see in them (Krishna et al, 2016). The automation of vision has not reduced but increased the number of eyeballs looking at images, of hands typing descriptions, of taggers and annotators. Yet what has changed is the context in which the activity of seeing is taking place, how retinas are entangled in heavily technical environments and how vision is driven by an extraordinary speed.
Computer vision researchers make use of crowdsourcing platforms like Amazon Mechanical Turk (AMT) to recruit workers that classify images for them under precarious labour conditions[4]A worker is typically paid 1 to 4 cents (US currency) per annotation.. The annotators are not passive retinas, they are asked to interpret, filter, clean and perform these tasks at a rapid pace. To produce a dataset at “the scale of the web” implies to impose a particular way of seeing images, of pointing and naming. Through the interfaces of the AMT, the workers are not only guided but also monitored, their vision is oriented and framed. On such platforms, the work is divided in micro tasks and the workers, if they want to make a living or just a semi-decent income from the execution of these tasks, need to work at a pace that barely allows them to see the images.
From this perspective, the speed of vision is built in the platform economically. For the annotators, structurally, the glance is the norm, not the gaze. The speed also corresponds to the ever-increasing need of the software industry that training sets must be produced fast; which means that lots of workers are mobilised intensely for a short period of time. Through the interface of the AMT, the requesters (employers, in AMT's parlance) are managing the cadence of the annotation work. They want to ensure the workers go fast enough to match production deadlines. At the same time, they attempt to preclude them from overlooking their task. The interfaces of annotation are designed to control workers' productivity, to find the optimal trade-off between speed and precision. The cost of the labelling effort leads computer vision researchers to approach visual content in the form of informational currency and attention scarcity. As the volume of requests augments, the unit of measurement for a labelling task is moving towards the millisecond rather than the second. Which raises questions about the nature of what can be perceived at that speed, what is emphasised, what is overlooked and how the complexity of the photographic object is dealt with.
Classification, stratification
To organise the labelled images they collected, computer scientists rely on often pre-existing classification systems. ImageNet for example makes use of WordNet, a baroque construct containing 117.000 categories of words, whose stated aim is to provide an extensive lexical coverage of the English language. Using a subset of WordNet as its semantic backbone, ImageNet includes a large botanical section featuring a dizzying array of plant varieties and a detailed catalogue of geological formations. Adding to that, its category of natural objects stretches from rock to extra-terrestrial object. A place of honour in the top categories is given to sport activities and fungus. The core of the database is constituted by the sets ‘person’, ‘animal’ and ‘artefact’ that offer an overview of the wide variety of living bodies and manufactured objects. Finally the rather poetic ‘Misc’ category brings together disparate entries such as tabernacle, meunière, shit and puce.
The fact that WordNet can be used free of charge is an important factor for its adoption. Classification systems are expensive to develop and for that matter datasets tend to recycle existing ones. Pressing into service an existing classification system however brings in its own share of problems, omissions and decision-making issues. WordNet for instance unreflexively integrates and naturalises racial and gender binaries and its structure contributes to reifying social norms. For instance, the category transvestite is filed under abnormal together with the entries eccentric, aberrant or zombie.
This aspect of the classification logic is particularly crucial as a category is not a mere placeholder for a set of images. The photographs are, in theory, subordinated to the categories, they are said to image a given category. Images are curated to fill in and fit the categories. Their presence in the dataset depends on how well they ‘illustrate’ and confirm the taxonomic structure. But while doing so, they also dramatically redefine the category itself. For instance, the category ‘wrongdoer’ is populated with a selection mugshots stigmatising a particular segment of the population. By excluding images of white-collar (or female) criminals and focusing on people of colour, the photos introduce their own racial discrimination into the wrongdoer category. The category ‘Bornean’, a sub-category of people, contains only photographs of monkeys whereas ‘English woman’ contains a huge proportion of brides in white dress.
In other sections, different breeds of animals are given contrasted treatments. A hammerhead fish is seen swimming, through the lens of an underwater telescopic camera, whereas a trout is mostly shown dead in the hands of a proud fisherman and a lobster is represented cooked on a service plate. The photographs in these three categories of fishes provide more than three different kinds of specimens. They bring in competing regimes of animal representation each with its own pictorial conventions (underwater, amateur and food photography). They are also bound to specific social practices of photography. In the world of ImageNet, a hammerhead fish is an object of scientific observation, a trout is a dead trophy one posts on a Flickr group and a lobster illustrates a restaurant menu. Photographs bring always more than an ideal entity to the dataset. They encode socio-technical practices of representation and circulation. The photographic object gives always too much and too little to the classification. The photograph never adheres fully to the label and exceeds its meaning. This makes the dataset discriminatory, uncertain and wobbly.
Dialogue
The scale of contemporary datasets, the speed at which they need to be produced, the specific ways of seeing they impose onto the annotators, the uncritical recycling of taxonomies and the treatment of the photograph as an object that can represent categories are among the major factors that generate tensions in the learning of vision. These tensions are the object of an increasing awareness within the computer vision community, and as a recent post on the ImageNet website testifies, research efforts are undertaken to 'mitigate' these concerns constructively. However, it remains that computer science has an epistemic monopoly on the visual training of machines. As the model of vision that subtends the relation between words and images in the digital world is of uttermost importance, it becomes urgent to find a common language to speak about – and act upon - this technology. As computer vision raises issues that are far larger than its disciplinary boundaries, it needs to open itself to dialogue. Which confronts an institution of photography such as The Photographers' Gallery with the following challenge: to cultivate modes of learning informed by the practices of computer scientists while remaining critical of the assumptions that underpin their knowledge. To resist the temptation to give technologists the full authority on the subject while engaging with the techniques they have created.
The photographic institution needs to firmly highlight the role of the various processes of mediation of photography in the training of visual algorithms: to curate images, to assemble datasets, to look and describe photographs, to design classifications are ways to make computer vision as much as writing code. The complexity of photographic objects and practices creates a need for discourses and engagements that interfere with computer vision's questions and inflect its trajectory. It gives some urgency to finding a ground for understanding and learning how machines learn to see that exceeds the confines of the AI lab and extends the limits of the white walls of the gallery.