“The cat sits on the bed”, Pedagogies of vision in human and machine learning.
I am watching Professor Fei-Fei Li, director of Stanford Artificial Intelligence Lab giving a TED Talk titled ‘How we teach computers how to see’, uploaded to the YouTube platform. She begins her lecture by evoking an image of a three-year-old girl: “She might still have a lot to learn about this world, but she is already an expert at one very important task: to make sense of what she sees.” (00:38)
The little girl is the first of the many children that will illustrate the presentation, including Leo, Li’s son. However, another child (albeit never qualified as such) looms in the background: the machine learning algorithm.
The sequence of images which open Li’s talk – the cat on the bed, the elephant in the zoo and the airplane on the tarmac – will eventually re-appear at the end, this time described by the artificial voice of a computer. Between these two moments Li narrates the epic adventures of the development of computer vision, and how computer scientists, inspired by the capacity of the human child's brain, can transfer visual knowledge to machines. This is a story of two parallel translations: from the words of the little girl to those of the artificial voice, and from the human brain to the neural network.
For Li, this translation is not only desirable, it is also necessary because “collectively as a society, we are very much blind because our smartest machines are still blind.” This blindness is not limited to our sensory organs, it is also a blindness of the mind: “Just like to hear is not the same as to listen, to take pictures is not the same as to see. And by seeing we really mean understanding.”
Such an understanding requires a very specific kind of education, a form of learning that can only be attained “through real world experiences and examples”. When conceived as such, learning can only be translated to the machinic realm if a series of features of the human subject and the machine can be treated interchangeably. Or, as Li speculates:
“If you consider a child’s eyes as a pair of biological cameras, they take one picture about every two hundred milliseconds, the average time an eye movement is made. So by age three, a child would have hundreds of millions of pictures of the real world. That's a lot of training examples. So instead of focusing on solely better and better algorithms, my insight was to give the algorithms the kind of training data that a child was given by experiences, in both quantity and quality.” (06:02)
In a few sentences, Li establishes a whole series of equivalences. The human and the machinic bleed into each other: eyes and cameras, experience and training, looking and taking pictures. The computer becomes more biological, and the child becomes more robotic. Or, to put it differently, the biological and the machinic flow from one figure to the other, and their borders become blurred.
Perhaps more subtly, another border is being eroded. Vision, once the sense that grounded human subjectivity, becomes collective. To gather the required training data for the computer to make sense of the visual world, Li created ImageNet, a database of 15 millions of images, in 2007:
“Luckily we didn't have to mount a camera on our head and wait for many years. We went to the Internet, the biggest treasure trove of pictures that humans have created. We downloaded nearly a billion images.”
In contrast to the child who experiences the visual world through her two eyes, the computer vision algorithm is immersed in the visual world of millions of people. Nurtured by the Internet, the algorithm has a collective vision tied together by the computer network.
At this point in the talk, the intelligent machine described by Li could be understood as a hybrid: a machine largely understood in biological terms that assimilates experience, but whose vision is made of millions of eyes and a collective brain. Li is careful to avoid presenting this machine as a monster challenging our accepted categories and divisions between humans and technical objects. The computer vision algorithm is likened to the child and contextualized in a comforting domestic sphere, surrounded by pets and familiar objects. The computer vision algorithm's impressive ability to track car plates and brands is counter-balanced by its inability to understand the subtleties of social life and basic human emotions. And all the examples of its usefulness are geared towards human restoration: machines will help us cure bodies, give back vision to the blind, repair the damages of natural catastrophes, etc.
When the artificial voice reads the image, the relationship between the child’s perception and the computer vision algorithm is reinforced. The public applauds. The intersection between human interpretation and what the machine has learned is confirmed. But has it, really? A boy patting an elephant and a man standing next to it do not describe exactly the same thing. What has been lost in translation is what grounds perception in a body: touch and affect. For Li, it is only a matter of time before these nuances will be incorporated by the algorithm. The loop is closing. Learning is what humanises the machine. It's all in the family.
The video ends. But the experience proposed by YouTube’s interface doesn't end there. A playlist offers up suggestions of related videos alongside those I have watched earlier. Beneath the YouTube player, a series of comments and icons invite me to interact with Li’s talk and to enrich it in various ways. I am watching a video and the page is watching me in return. Every click or comment is registered, linked to my profile, correlated with my browsing history. The page around the video is constructed with this information, it speaks to me in my language, it invites me to linger based on what it knows about me.
Interestingly, my interaction with the webpage reflects the essence of Li’s talk. The algorithms at work on the page are not only recording my activity – but learning from it too. Whether I am watching, commenting on or sharing the page, I am training YouTube’s algorithms. As I listen to Li talk about teaching computers how to see, the website is learning from me how to watch a video. If I take literally the message of the video that ‘machines are being taught how to see’, I have to also accept that every time I am using the social web I become a pedagogue for algorithms.
Our shares and likes, our annotations and social metadata are training a generation of AI agents. Everyday, we are already all teaching bots and algorithms how to look at images. If we consider the extent of our relationship with algorithms, we realise the magnitude of the effort of teaching and learning that is taking place. This vast operation of photographic learning is happening outside of the institutions of education: on our phones, tablets, and computers. And it is not about training students in the art of visual literacy, but machines.
If we acknowledge this fact, several questions may be asked. If we are recruited as teachers, what is our responsibility towards our trainees? And as instructors of machines, how are we able to question the pedagogical methods that are in place? And as photography is being learned by machines, what should be the role of the institutions dedicated traditionally to photographic education? What can they learn about the dynamics of machine image learning? How can they contribute to it? Transform it and be transformed by it? Or, to formulate this in terms even closer to Fei-Fei Li's, how can we think productively about the fact that a generation of humans and algorithms are learning together to look at images?
Suggested Citation:Malevé, N. (2016) '“The cat sits on the bed”, Pedagogies of vision in human and machine learning.', The Photographers’ Gallery: Unthinking Photography. Available at: https://unthinking.photography/articles/the-cat-sits-on-the-bed-pedagogies-of-vision-in-human-and-machine-learning