The Performativity of Ground-Truth Data
Images are the fuel of computer vision. Applications such as facial and object recognition, emotion detection, and self-driving cars are trained on datasets that include hundreds of thousands of images, many of which have been labeled by human workers. In most forms of (supervised) computer vision, the dependent variable is called ground truth. Ground-truth labels synthesise the often complex realities comprised in each image to make them “readable” in computational terms: “this is a woman”, “this is a weapon.” Computer vision models then learn to recognise and predict this ground truth based on a number of independent variables or features that they recognise as patterns.
Ground-truth labeling sets the tone of how computer vision systems will learn to “see” and what they will consider to be “true” about the world. However, the power to inscribe “truth” in computer vision is unevenly distributed, and the authority to impose certain interpretations over others is possible. Even if ground-truth labels are often presented as self-evident, the interpretation and classification of images are not neutral or straightforward, but are shaped by specific values and worldviews. The question is: whose worldviews get encoded in computer vision systems?
Whose truth?
The ascription of “truth” to training datasets is carried out by large amounts of precarized[1]“Precarized” is a term used to highlight the active and purposeful action of placing someone in a position of precarity. "Precarious,“ instead, suggests that being in a position of precarity or having a precarious job are unavoidable situations that can’t be subject to change. workers called data labelers, many of whom are located in the so-called Global South. Labeling work for computer vision is fundamentally about making sense of images, that is, about classifying and annotating photographs and video footage to interpret the information contained in the images. This work is commonly outsourced to crowdsourcing platforms or specialised companies, broadly located in the so-called Global South, where workers are paid a few cents per labeled image. For five years, we have collaborated with data labelers at such companies in Argentina, Syria, Germany, and Bulgaria. We have had the opportunity to observe how data collection and labeling projects are conducted. We also had the chance to interview labelers, managers, and several of their clients.
Data labelers constitute automation’s “last mile”[2]Gray, Mary L. and Siddharth Suri (May 2019). Ghost Work: How to Stop Silicon Valley from Building a New Global Underclass. Englisch. Boston: Houghton Mifflin Harcourt. ISBN: 9781-328-56624-9., that is, the bottom end of a hierarchical labour structure where the power to define the ground truth is unevenly distributed. Although these workers are the ones who effectively assign the labels, they do so by following strict instructions provided by clients and managers. The instructions typically comprise a set of predefined, mostly mutually exclusive, classes and attributes that will be used to label the images: male/female for gender; white/african american/latinx/asian/indian for race; anger/disgust/fear/happiness/sadness/surprise for emotion. The quality of the labelers’ work is then defined in terms of how faithful they are to the truth values instructed by the clients. In their own words, low quality is “anything that contradicts instructions.” Moreover, it is very common for clients to refuse to pay after arguing that the quality of the labeling is low in cases where instructions were questioned or contradicted by labelers. Under such conditions, labelers’ income depends on their obedience to the predefined truth values imposed onto them, and through them, on data. Economic dependency makes labelers prone to accept the judgments and decisions instructed by clients without questioning them[3]Miceli, Milagros, Julian Posada, and Tianling Yang (2022). “Studying Up Machine Learning Data: Why Talk About Bias When We Mean Power?” en. In: Proc. ACM Hum.-Comput. Interact. GROUP, (2022) 6.Article 34, 14 pages. DOI: 10.1145/3492853. URL: https://doi.org/10.1145/3492853., which reinforces the authority of those with more economic power to define the “truth” in data and thereby in computer vision systems.
One of the labelers that we interviewed, Sarah, is a single mother and lives with her two young children in Bulgaria under refugee status after escaping Afghanistan by road with a toddler and while pregnant. Her second son was born on the road and became severely ill. He now needs permanent medical attention. In Bulgaria, Sarah’s only source of income is her work as a data labeler. Her experience in image data services is vast and ranges from annotating satellite imagery to classifying selfies according to gender, race, and age. Like Sarah, most labelers are located in impoverished areas of the world and/or belong to vulnerable populations. This makes workers highly dependent on the earnings from labeling tasks and allows requesters to pay meagre wages. Moreover, labeling work is often performed under precarious labor conditions and without job security. As Sarah puts it, the worst part is the lack of work continuity: when a project finishes, she has to wait for new projects and never knows how long this will take, which results in the instability of her income.
The availability of a large workforce that depends on labeling work to make ends meet is a blessing for requesters looking to outsource work no one else wants to do at low costs. While the AI industry insists on portraying labelers as low-skill, the truth is that labeling work can be extremely hard. It requires high levels of accuracy and attention to detail. Workers need to develop a feel for the relationship between data and instructions that allows them, for instance, to interpret satellite imagery and quickly tell a tree from a bush and a bomb from a pile of firewood. All this while acting as a proxy of the client and labeling in the way instructed by them. Moreover, labeling work often involves working with images that might cause discomfort and even trauma. Graphic depictions of abuse, violence, murder, and blood are hard to forget and leave their mark on the labelers: “I am scared for life,” whispers Omar, one of the workers we interviewed in Germany. The lack of protection faced by data labelers also means that Omar and his colleagues do not receive mental-health support after reviewing disturbing material.
Defining the ground truth in training data is a process full of arbitrary decisions. Clients, i.e., computer vision engineers, generally have the power to define how labelers will interpret and label data and what computer vision systems will consider true. This way, ground-truth labels are often based on economic imperatives and technical possibilities. In such cases, the clients base their definition of ground truth on the features that best fit the model they are developing and its commercial application. For instance, to train a computer vision-based scanner that is able to recognise contamination on hands, a large dataset with images of hands (big, small; dirty, clean) was produced in Bulgaria. In this case, a computer vision engineer was in charge of defining what “contamination”—the ground-truth label in this case—is. As the lead engineer in this project, his knowledge of the model alone conferred him enough authority to decide how training images are to be interpreted. “I’m the ground truth that [we]’re training the model off of,” he describes eloquently. The truth value decided by him was outsourced to a group of data labelers, including Sarah, who then proceeded to label thousands of images following the engineer’s interpretation.
Epistemic Authority
Labels are imposed top-down on labelers, and through them, on the images. This imposition is rendered invisible as it is considered business as usual because it seems common sense that labeling services are carried out according to the preferences of paying clients. However, the naturalisation of such hierarchical processes conceals the fact that computer vision systems learn to “see” and “create” reality in line with the interests of those with the financial means or the epistemic authority to impose their worldviews on data[4]Miceli, Milagros, Martin Schuessler, and Tianling Yang (2020). “Between Subjectivity and Imposition: Power Dynamics in Data Annotation for Computer Vision”. en. In: Proceedings of the ACM on Human-Computer Interaction 4.CSCW2, pp. 1–25. ISSN: 2573-0142, 2573-0142. DOI: 10.1145/3415186. URL: https://dl.acm.org/doi/10.1145/3415186..
By assuming that those in higher positions know better, managers’ interpretations are accepted without questions. As described by a Syrian labeler, Mariam, in the face of ambiguities or doubt, the judgement of managers is broadly perceived as correct:
“She knows better than me. She taught me, no, this is right, this is a tree. And I thought okay.”
This process reinforces the standardisation of ground-truth labels following clients’ wishes. It also naturalises the epistemic authority of managers and clients, and the role of labelers as obedient executors of their orders.
The question of epistemic authority, that is, the power to impose meanings that appear as legitimate and part of a natural order of things, becomes even more complex in cases where ground-truth labels are defined by so-called subject matter experts whose academic degree or employment record confers them enough authority for their definition of truth to prevail. Such “experts” are recruited to provide classification of higher precision and quality, to review labeled images, or to increase the accuracy of labeling tasks. What their status as experts tends to conceal is that their ground-truth definitions are, too, subjective in nature. For instance, one of the labeling projects we observed involved the development of a computer vision model that rates images according to their perceived beauty. To develop the ground-truth labels, subject matter experts such as visual artists and art curators were recruited. Personal and subjective as their aesthetic judgements were, they became the crucial criteria according to which the computer vision model recognises and judges beauty.
In critical areas such as security and healthcare, specialised knowledge is often sought to interpret and classify image data. In such domains, the potential consequences that subject matter experts’ subjective interpretations could have are even more severe. One of our interview partners working at a data labeling company specialised in military and defence applications describes what kind of expertise is required to interpret the satellite imagery that trains computer vision models to recognise weapons:
“[the expert is] a retired Air Force person. He actually trained people in the military to look at the images and to analyze images.[...] His speciality was a certain region of the world. So he knows very well when he looks at synthetic aperture radar data. He can look at it and just look at the shape and how things reflect and can tell you oh that is a car or building or that is a helicopter [...] we believe what the experts say. They are the experts on this.”
Whereas the subjectivity of precarized labelers is often presented as a hazard to the quality of data[5]Miceli, Milagros, Julian Posada, and Tianling Yang (2022). “Studying Up Machine Learning Data: Why Talk About Bias When We Mean Power?” en. In: Proc. ACM Hum.-Comput. Interact. GROUP, (2022) 6.Article 34, 14 pages. DOI: 10.1145/3492853. URL: https://doi.org/10.1145/3492853., the interpretations of subject matter experts are perceived as plain truth. While data labelers are instrumentalized to label data according to categories and instructions imposed by actors with more power, the epistemic authority of managers, “experts”, and requesters is consolidated and naturalised in a multi-layered data labeling process. In this process—that of producing computer vision datasets—truth is created.
Producing Data, Creating Reality
Falling into water in winter in high-latitude regions can be dangerous and poses risks to life. For a computer vision model that aims to detect such incidents and improve safety at a harbour front, Syrian labeler Mariam and her colleagues were asked to label humans in video footage. The model trained on this data automatically identifies people falling into the water and sends a warning signal to the harbour authorities. As some of the labelers working on the training images explained, they were asked not to label animals that appeared in the collected footage. The clients who provided these instructions explained that budget limitations prevented them from having animals labeled and that underrepresentation made the few images of animals present in the video footage unworthy of being considered relevant data. Underlying this justification is the implicit decision to exclude animals from the list of beings that are worth saving by the system. This decision gets inscribed in the training data and becomes legitimised as the computer vision model functions according to this “truth”.
This way, the power involved in image data labeling produces a twofold form of legitimisation: it legitimises certain “truths” over others, and it also legitimises certain images by naming them “data”. While an image can still be incomplete or biased, data is perceived as neutral and indisputable. The making of images into data and their classification through labeling is not a mere descriptive exercise. The making of a category—the act of classifying—reifies the categories’ existence[6]Crawford, Kate and Trevor Paglen (2019). Excavating AI: The Politics of Images in Machine Learning Training Sets. en-US. tex.ids: zotero-3263. URL: https://www.excavating.ai (Visited on 02/07/2020)., thus creating reality. A reality, for instance, where the category “beauty” exists and where it is represented by the images it serves to classify. Data production is “a doing” that reifies data’s existence and the existence of that what is accounted for in data. The performativity of data thus refers to data’s potential to constitute and form reality.
Performativity is the capacity of speech and communication to consummate an action[7]Austin, J. L. (1962). How to Do Things with Words. English. Second edition. Cambridge, MA: Harvard University Press. ISBN: 9780674411524.. Naming something is not just a representation of the thing named, but also “enacts or produces that which it names”[8]Butler, Judith (1993). Bodies that matter: on the discursive limits of ”sex”. New York: Routledge. ISBN: 978-0-415-90366-0 978-0-415-90365-3.. The performative act of naming is influenced by the power dynamics at play, where those with greater power can impose their worldview[9]Bourdieu, Pierre (1992). Language and Symbolic Power. Englisch. New. Cambridge: Blackwell Publishers. ISBN: 978-0-7456-1034-4.. Similarly, the concept of performativity extends to data[10]Blouin, Gabriel G. (2020). “Data Performativity and Health: The Politics of Health Data Practices in Europe”. en. In: Science, Technology, & Human Values 45.2, pp. 317–341. ISSN: 01622439, 1552-8251. DOI: 10.1177/0162243919882083. URL: http://journals.sagepub.com/doi/10.1177/0162243919882083.[11]Currie, Morgan and Umi Hsu (2019). “Performative Data: Cultures of Government Data Practice”. en. In: Journal of Cultural Analytics. DOI: 10.22148/16.045. URL: https://culturalanalytics.org/article/11042., which has the capacity to shape the realities it represents. The production of image data and the values encoded within them determine which experiences are brought into existence and which are erased. For instance, the underrepresentation of Black people in training data makes them less visible to facial recognition systems[12]Buolamwini, Joy and Timnit Gebru (2018). “Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification”. In: Proceedings of the 1st Conference on Fairness, Accountability and Transparency. Vol. 81. PMLR, pp. 77–91. URL: http://proceedings.mlr.press/v81/buolamwini18a.html., effectively denying their personhood within such systems. Similarly, the encoding of cis-normativity into datasets can make airport security scanners flag trans bodies as alert-worthy[13]Costanza-Chock, Sasha (2020). Design Justice: Community-Led Practices to Build the Worlds We Need. Cambridge, MA: The MIT Press. ISBN: 978-0-262-04345-8. URL: https://design-justice.pubpub.org/.. This exemplifies how preconceptions about gender and the body shape the classification and labeling of ground-truth data, defining which bodies are deemed “normal” and which are not. Furthermore, and following the examples presented in this essay, the realities produced by ground-truth data can range from a contamination scanner flagging a worker for not washing their hands properly to a weapon-detection system misrecognising a threat to public security. This creates a reality in which the worker is an infractor prone to lose their job and a knife is just a kitchen tool. These realities are dependent on the “truths” taught to the respective models through training data.
Computer vision systems are imprinted with power differentials between those who are in the position to impose their worldviews on data, and those who are subject to imposed classifications. The performative capacity of image data reveals the danger of leaving preconceptions encoded in computer vision unquestioned and the power imbalances involved in their production uncontested. Moreover, these arbitrary truths unobtrusively shape realities. We often do not consciously choose how and what realities are produced by image datasets because they are set as fixed and “default” in computer vision applications used in everyday life. The many technical complexities of how image datasets are created and used obscure the subjective assumptions encoded in computer vision systems, making them likely to escape public scrutiny. By using and being subject to such systems, we automatically enact specific worldviews. Interrogating the “realities” created by computer vision is worth attention. Making the production of image data visible and explicit is the first step to challenge, contest, and, sometimes, actively reject the often incomplete “truths” that are imposed onto us.