Language Using, Collaborative Robots: Why Humanoids, Why Now?

Robots seem to be everywhere all of a sudden. From demonstrations of robots using the state-of-the-art in AI to follow human commands, to promises that robots will become fixtures in workplaces and homes in the near future, a survey of technology media could give the impression that the world is about to be inundated with sophisticated autonomous machines. Furthermore, we are led to understand that an important aspect of these machines operating in our midst will be their ability to participate in linguistic communication, both understanding and expressing themselves with words.

But something seems a bit off here, because, while there has been a great deal of press about robots in 2024, robots are still distinctly missing from our everyday lives: robots are not helping us out around the house, or doing dangerous work on construction sites, or providing assistance to overstretched healthcare staff. This blog is about why we seem to hear about robots all over the place, and yet not really see them anywhere at all.

In one sense, one might argue that robots actually are all over the place. If we think of a robot in its simplest form, as a thing with sensors that give it an ability to detect what’s happening in its environment and actuators that give it an ability to do things to the environment, coupled through a central computational control mechanism then yes, such devices have become commonplace in our daily life at home and work. Mechanical robots are indeed currently in operation in industrial settings, where for instance robotic arms help assemble vehicles and move heavy things around, while robotic technology enables devices ranging from toys to precise and increasingly complex surgical applications. In the broadest sense, even something like a mobile phone is a robot, because it senses with things like a microphone, a speaker, and GPS, and then it actuates through things like a speaker, a screen, and a vibrator.

But what is missing from the heterogeneous mix of machines that interact, with varying degrees of autonomy, with humans and our environment are machines that actually resemble humans in some combination of form and function. What this means is that, while we have robots that are quite effective at handling very specific tasks in particular environments, we don’t have robots that are generally good at coexisting with us and participating in a helpful, collaborative, communicative way in the various aspects of being humans at home, at work, and out in the world. The idea of a world in which autonomous machines are integrated in our daily lives, capable of communicating directly with us and sharing in our formulation and execution of goals, is ambitious, but it is also worthwhile. Robots like that could change life for us in radically positive ways, making it easier for us to take care of ourselves, each other, and even the planet itself.

In what follows, we’ll explore what it will take to have a world where robots coexist with us in our day-to-day lives. We’ll first examine some of the reasons that the human form makes sense for collaborative robots, including how they fit into our world and also how they align with the language we use to talk about doing things in that world. Then we’ll get deeper into the nature of human linguistic and cognitive development, and think about how building communicative humanoid robots might present a challenge to current trends in language technology in particular, especially around advances in AI. And then finally we’ll reflect on how the impetus to build robots that can coexist alongside us might indicate some new directions in technological research and development.

Why Humanoids?

Of the various reasons that we should build robots that, on a number of levels, look and seem like us, the first and probably most obvious has to do with us and our environment, not with the mechanics or information processing abilities of robots. To put it simply, we live in a human-shaped world, because the amazing thing about humans is that we have built up a world that fits us. The structures we spend our lives in, the spaces we pass through, and the objects we deal with every day are full of features – handles, buttons, portals, passageways – that are ergonomic with our own bodies and the things we like and need to do. It therefore stands to reason that agents that are designed to collaborate with us in this world that we’ve built not so much in our image as around our image should resemble us in a fairly fine-grained way. For instance a robot that is expected to be helpful around the house will need to be involved in activities like opening doors and handling objects with graspers that are functionally similar to human hands, and moving through spaces that are meant to be amenable to the movement of human bodies.

Then there is the fact that humans themselves are geared towards communicating collaboratively with other humans in ways that have to do with the human form, and so collaboration with robots will be naturally enhanced if the robots also match a generally human profile. Think for instance of phrases like "hand me that thing", "stand by for a minute", or "run this downstairs" that involve actions made by human bodies, even if the phrase doesn’t necessarily need to refer literally to a body part or function. Human bodies and abilities are of course varied, but there is nonetheless something essential about having a body built into some of the most basic elements of the language we use, especially when communicating about accomplishing practical tasks. Using this sort of language requires some kind of bodily understanding on either side of the conversation.

On top of this, consider that a specific robot will need to learn how to do things in a specific environment, like for instance a certain household or a particular construction site. When we transfer knowledge through demonstration, we use a dynamic combination of language and action to simultaneously describe and enact the activity we’re demonstrating. How do we demonstrate and at the same time explain for instance the operation of a particular vacuum cleaner to an agent that doesn’t have a body similar enough to our own to use the device as we do? Moreover, a good collaborator is one who knows how to ask for help when they need it. How would a robot go about seeking assistance with some new way of doing things or some unforeseen challenge if it doesn’t have at its disposal the same basic bodily affordances as the person who will be helping the robot to learn?

“A drawing of a person trying to show a Dalek how to use a vacuum cleaner” generated by Microsoft Designer

The basic conceptualisations that humans form about the relationships between their bodies and the world are called image schemas by cognitive linguists: these structures for representing being in and moving through space are the basis for phrases like "opening up", "moving along", and "sorting out". A significant thing about these phrases is that they are very often employed in linguistic constructions which are on the one not really literal and on the other hand very fundamental to our way of communicating about things that are happening. For instance in the course of a collaboration a task leader might say something like "we’ll sort this out later, let’s move along to the next thing", even if the task doesn’t have anything to do with physically sorting or moving through space. We use language like this all the time without thinking much about it, and at the same time our understanding of this sort of language is tied up with our experience of having a body, and often a particular type of body, in a world of spaces that we can be inside, outside, alongside, moving through, moving around, and so forth.

A consideration of the role of image schemata in everyday conventional communication makes it clear that language learning involves something important about having a body in space. There is ongoing theoretical discussion over the ways in which very young children begin to learn words and phrases, and the relationship between these first encounters with language and the way a human begins to have thoughts; we are not going to come to conclusions about these issues here. But it seems certain that there are things about the universal experiences of being in and moving through space, and of encountering objects that have insides and outsides and relations to one another in this space, and then also of having these experiences with a certain type of body that are wrapped up in the process through which we have all learned language. If we accept that language learning is entangled with our physiology, questions arise about how we might go about emulating the process of language learning in order to build robots that have language in a way that makes them useful for us.

Why Now?

Something worth noticing about the way humans become language users is that we each begin with no specific language at all – and, concomitantly, no real capacity for using our bodies in the world. To be blunt, we are all born useless. A debate has been unfolding amongst linguists over the course of many decades as to the extent to which the ability for language use in general is encoded, perhaps through evolution on the level of genes, in being human, but there is no doubt that no human comes into this world knowing a specific language. And it is an important feature of language that we are not born with any language in particular: this gives languages, in general, their ability to adapt to a world that is itself always changing. If humans had evolved to be born with a particular vocabulary some 2 million years ago, then we would be dependent on random genetic mutations to update our vocabulary and syntax to enable us to keep pace in our linguistic communication with various changes in society and advances in technology. So the way that every human who uses language, which is almost all humans, including those of us who do not have the ability to see or hear, is required to learn the specific language of their time and place is in fact a powerful feature of the emergence of linguistic communication – it is arguably the feature of being human that makes our species uniquely capable of superseding the mere randomness of natural selection to become a lifeform with a destiny of its own.

How can this ineluctably learnable property of any particular natural language inform the way we develop humanoid robots? Another thing that’s clear from the study of developmental linguistics and cognitive development in general is that humans learn to understand language, express themselves through language, reason about the world, and just be basically capable with their bodies in that world all at the same time. This is significant: a newborn doesn’t just not understand language; they also don’t have the fundamental control over for instance their breathing and the intricate muscles in the larynx to perform the nuanced articulation associated with spoken language. It’s also worth noting that babies who are not able to hear still exhibit similar vocal behaviour to babies who can hear for about the first six months of their lives, and deaf children develop linguistically in parallel to children who can hear, just through alternative modes of expression such as sign language. This means something about learning to collaborate through communication has to do with a mind growing into a body at the same time as that body grows into the world.

What would it mean for a physically mature and capable but non-language-using human to be suddenly given the ability to communicate about the world they inhabit and the way they might use their bodies to do things in that world? This is a question that is becoming more than just a thought experiment, because it is basically what’s happening right now with state-of-the-art applications of language-using robots. Robots with the ability to navigate tricky spaces and manipulate a variety of objects, like the ones big robotics companies like Boston Dynamics and Tesla have demonstrated, are now being equipped with large multimodal models that are ostensibly capable of taking fairly complex natural language as input and mapping that to output that combines robotic action policies with natural language responses appropriate to whatever is happening around the robot.

It’s important to emphasise just how far removed techniques in robotic design enabled by recent advances in AI are from the way that real humans become capable collaborators and communicators. The foundational principle of the current approach to AI is that massive amounts of data can provide sufficiently complex pattern-detecting networks with an ability to generalise from observed data to predictions about unobserved data. These networks begin as random configurations, and are gradually shaped into configurations that accurately predict ways to continue or complete inputs observed as parts of training data. Once the information processing AI is adequately accurate at predicting its own training data, it is coupled with a machine that has been engineered to be able to deal with basic animal-level navigation of the world and can be prompted to perform combinations of physical routines like manipulating objects and moving from one place to another. Humans, on the other hand, begin "blank" in terms of linguistic, cognitive, and physical capabilities, but we do not begin random; we are primed by our evolution to grow into functional, interactive adults. Compared to AI-equipped robots, this process of growing into a human mind-and-body happens slowly on a scale of hours, days, and years, but incredibly rapidly on a scale of the number of experiences we have. To put it in data scientific terms, our training data is sparse, but our experiences are rich.

Linguists refer to this aspect of human linguistic development in particular as the poverty of the stimulus: what foundational AI models learn from rapid, iterative computations over hundreds of billions of words in a setting effectively isolated from anything like the real world, human children learn through exposure to merely hundreds of thousands of words, but we do it more slowly and through a perceptually rich dynamic with the humans teaching us to communicate and think, entangled with the environment we’re learning about. A difference in process does not necessarily result in a defect in outcome, but we do have to pause to wonder if machines built in this way can really become useful collaborators working alongside us in our world. Both humans and intelligent machines are expected to generalise from their learning in order to deal with novel situations in the world. But humans do this deductively and abductively, by determining likely causes for unexpected information that fits with a detailed and multifaceted model of the world. Data-driven computational systems, on the other hand, behave inductively, fitting whatever specific information they encounter to general patterns in similar information rather than clear conclusions about cause and effect, or in a more absolutely linguistic sense, meaning and intent. They are in a sense doing "book learning" as compared to the hands-on way that humans learn, with the important caveat that a data-driven model doesn’t actually even comprehend the conceptual content of the data that it processes; it simply learns to mimic a likely reaction to it.

Would we ever be able to rely on machines like this to coexist with us in a truly collaborative way? Consider the nuance we regularly apply in resolving the instantaneity of perception with what’s really happening in the world – the difference between the sound of children being raucous or yelling for help, noticing that a door to a dangerous place has not closed properly, making sure that all of a corrosive cleaning substance has been wiped off a surface where food is prepared, knowing an enthusiastic dog as compared to a dangerous one, detecting people who are speaking or behaving in a hostile way. Asked to analyse how we make determinations about all sorts of things like this, we might, with considerable thought, be able to come up with some hypotheses – something about the tone of a voice, realising something we see or hear is a bit out of context – but the way we react to these things in-the-moment has to do with complex interactions of conscious and unconscious mental processes. It is certainly the case that we learn to notice the things we need to notice through experience, but it is just as certainly the case that an important aspect of this has to do with having these learning experiences in the rich context of the real world. In the end there is something going on under the surface that has to do with trying to fit perceptions to explanations of the causes of these perceptions, and this is not something that is available in decontextualised streams of data, no matter how long, broad, or fast these streams are.

“A drawing of two robbers holding up frightened guests at a cocktail party while a robot serves them a platter of hors d’oeuvres” generated by Microsoft Designer

So there is reasonable cause for concern that we are currently on course to build robots that are dangerously good at usually seeming to be capable enough to convince us that they’re working alongside us in our lives, but that turn out to lack a crucially human way of understanding the world in the critical, often surprising situations where we most need help. In fact, more than just being bad at identifying a crisis precipitated by unexpected external causes, it seems like the chance that these robots might not have any understanding of the circumstances where their own established patterns of action are going to become problematic is good enough that we should be worried. If this is the case, then the time to think about how we might want to change course in the development of collaborative robots has to be right now.

What Next?

If we want robots to become able to use language in the course of their interactions with us, it will be important to consider how these robots learn the languages they use, with particular focus on an emulation, at an appropriate level of abstraction, of the way that humans learn language in the context of their developing human bodies.

Current applications of AI to robotics, while well motivated and perhaps useful to a point, do not seem to be on this track of embodied language learning. For instance Nvidia have undertaken a project to develop "a general-purpose foundation model for humanoid robots". The idea here seems to be to take a simulation of a fully physically developed robot and use it as a basis for generating a great deal of data about things that might happen to the robot in the world, including goal-oriented linguistic encounters and corresponding expectations to perform reasoning about tasks, actions, and outcomes. This data then becomes the basis for developing a model that can generalise to helpful reactions to novel situations. The purpose of this blog is not to rubbish this approach; the availability of hardware and software capable of supporting this programme is exciting, and there could be some valuable results. But if all the points made above about the way humans learn to use language and at the same time to reason about the world they’re in make sense, it also makes sense to anticipate a point at which a big-data approach to robotics will come up short, perhaps catastrophically so. And so, given that this research is happening now, it makes sense to think right now about what we can do next to establish an alternative programme of research and development aimed at eventually integrating with commercially available collaborative robots.

Academic research programmes can, will, and have considered alternative, theoretically motivated approaches to robotics. An area of especial interest here is developmental robotics, which seeks to take inspiration from ideas about human development in order to conceive of machines that can learn in and adapt to their environment in an ongoing way, and then likewise to use experiments involving robots to explore hard questions about human development. Findings in this area can no doubt be the basis for advances in building commercially viable, environmentally integrated robots. But it is also the case that the academic approach to developmental robots tends, for good reasons, to take a holistic perspective on exploring machines that emulate human development, examining ways to model the gradual overall development of a human from the time of conception. The result is models that have explanatory power in terms of why very young humans are the way they are, but not so much machines that could be useful to humans in general in the chaotic environments we inhabit. To put it simply, the present ambitions of developmental robotics research programmes tend towards building machines with the capacities of infant humans.

This presents a challenge from a commercial perspective. Why would an investor be interested in funding a project aimed at developing a robot that might ambitiously be expected to behave like a stroppy toddler? The advances necessary to get much beyond that point are so significant that it probably makes sense to talk about the timeline for development in terms of decades right now, and this in itself is not an appealing pitch for a commercial research and development programme. But an alternative – or maybe an adjunct – to a quest for full-out robotic emulation of a human being might be to focus on particular aspects of the way that humans perceive and reason about the world. An overall project to explore what we might label intelligent sensing could take on aspects of human cognitive and linguistic development piecemeal, resulting in milestones with clear commercial applications. For instance developing software and hardware to emulate the way that human (and generally mammalian) ears support the way that humans perceive spoken language and monitor the environment for signals that require attention could lead directly to compelling applications for security systems that use audio signals to operate with a degree of autonomy. Similar stories could be told about things like machine vision, or the way that proprioception – an agent’s sense of the situation of its body in space – is modelled.

Taking a compartmentalised approach to environmentally situated robotics, by way of intelligent sensing, opens research and development up to the application of exciting emerging theories about modelling cognition and language. Finding a way to explore this kind of approach in parallel with the application of massively complex models trained on likewise massive amounts of data could be an important aspect of the forthcoming development of collaborative robots.


Posted

by