The Autonomous Future: Humanoids vs. Self-Driving Cars - Part 3: The power (and business) of orchestration
This is part 3 of a 3 part series of post on the topic. You can find part 1 and part 2 here.
The previous post explored the two other key advantages humanoids have over self-driving cars, namely: "Interacting at Human Speed" and "More Benign Failure Modes". Now let's look at the potentially most important advantage of them all:
3) Orchestration and Context Dependency: This final element is the most decisive advantage, one that will actually allow humanoids to get by on much less power and cooling-hungry hardware compared to their four-wheeled automotive cousins. As humans, we tend to operate and interact within defined contexts, and we sequentially switch and move from context to context. We get up, we walk across the room, we brew a cup of coffee. But we don't do all this while juggling three colored balls and dictating poetry to our assistant at the same time. Yes, some contexts are more involved and complex than others, but again, taking the initial premise of our chores and labor-based humanoid, these contexts still tend to be quite bounded. This is particularly true in the labor context. A bot working on an assembly line, much like its human counterpart, will perform only a limited number of operations at its station. This is precisely because the line was designed for humans, and humans tend to get better with experience and training, which involves repetition of a sequence of process steps, evolving their skill with which the steps are performed in the process. While this approach may be best for labor productivity, it is crushing for the human with boredom and likely leads to repetitive strain injuries over time. Thus, we swap human workers between a number of stations over time. Humanoids, on the other hand, don't need this level of intellectual engagement and are likely to be immune to repetitive strain/wear if correctly maintained.
Of course, one might argue that autonomous cars, by that definition, should have it much easier still since they perform just one well-bounded task: driving. However, unfortunately for us, this task is actually quite broad and complex, with a large number of high-stakes decision points being constantly thrown at the system. From other reckless drivers, to pedestrians and animals, roadworks and other temporary obstructions, add in changing weather and lighting conditions, and almost all these diverse challenges may manifest unannounced with sub-second speed. A self-driving car may need to avoid a child darting out between parked cars in the rain one moment and face a construction site missing road markings the next, while needing to get out of the way of a police chase the next. Each of these sub-tasks, if you like, can appear from second to second, and thus, with current neural network-based technologies at least, models able to cope with these situations need to be large, well-trained, and operate at low latency to avoid potentially catastrophic failures. All issues that quickly drive up the cost, power, and thus cooling requirements of the hardware running them. And this is even before we have built situational awareness, at an adequate latency, to give the vehicle context within which to make its path-finding decisions. Hardware-based sensors help to deliver that situational awareness with requisite redundancy and latency at a manageable cost, although some automakers insist on using camera-only systems, thereby trying to cram a much more complex object detection, classification, and depth estimation pipeline into an already short decision-making window. This interaction between model complexity and decision speed is detrimental to resource efficiency for the compute hardware used to run such a model.
Humanoids, on the other hand, just like real humans with their set of learned skills do, could quite feasibly switch into skill-specific "sub-models" to deal with their labor or chores. These specialist models can be swapped in and out on demand without having any major impact on the overall viability of the humanoid. Walking across the room, using the walking model, to then make coffee, using a coffee-making model/skill that is loaded upon arrival at the counter top, may take one second to initialize—but that is still faster than many humans initialize their own coffee-making routine as they fumble through the kitchen cupboard to find the beans. Simply put, a humanoid designed to operate in the human world at human speed, even if it is the speed of an arthritic pensioner, can still function and accomplish successfully most common tasks, from household chores to industrial labor, by simply swapping in and out requisite specialist AI models as needed. This is because our world and affairs are largely organized in such a way most such activities can be broken down into concrete skills or steps that compact AI models can be optimized for. In fact, the very fact that driving is so complex and so challenging for autonomous cars is also reflected in the fact that humans require considerable training and an official examination of their proficiency by means of a driving test, whereas folding clothes or making coffee is something we generally are able to accomplish after a quick, informal demonstration.
In my estimation, computer hardware and the upcoming generation of AI chips are sized in such a way that, with a well-designed system, the entire working memory, representing the active model, could be swapped out within a second or two from onboard storage. So that leaves the final question, how does our humanoid know what model to load? This is where orchestration models come in. We are seeing these concepts just starting to be explored as part of agent-based AI, whereby an orchestration model will pick smaller specialized sub-models to accomplish specific tasks. The orchestration model is what bounds the possible applications of a humanoid. If the experiences with agent-based AI in the context of LLMs are anything to go by, it looks likely that a smaller orchestration model using several context-specific specialist models to accomplish a task is far less resource intensive compared to the use of a very large generalist model in the classic single model approach. It is for this reason that I believe the experiences in orchestration and results hand-off between specialist small sub-models, gained in the context of model swarms by the LLM-centric AI industry, will provide the conceptual foundations for making humanoids a reality sooner than autonomous cars. The former will, by the very nature of their approach and operating environment, simply have much more modest hardware and power demands, putting them on a viable economic and technical footing much sooner than fully autonomous, driverless vehicles.
This begs the obvious question, could orchestration not be applied to self-driving? It seems that, with current technology, unlike in the more forgiving humanoid environment, specialist model activation might be more of a challenge. There might be room for a different kind of segmentation, however, models that work on a limited set of city blocks perhaps, with a handoff to the next location-specific model at boundaries. Essentially, a vast patchwork of geo-fenced driving domains, like a patchwork quilt. Thus far, to the best of my knowledge, no company has yet officially endorsed this approach, although Tesla has been "caught" optimizing model variants for the common routes of Musk and some influencers, thereby maybe indirectly validating the merit of this approach. Other "robo-taxi" providers such as Waymo and Cruise also limit operations to geo-fenced localities, presumably not just for regulatory reasons.
Ultimately, though, this approach might still increase training demand exponentially for very similar tasks, with the availability of autonomy still dependent on which tiles in the patchwork have stable models available. Furthermore, this still requires faith that driving within a geographically confined context can truly be condensed down enough to enable fully intervention-free and safe operation at all times. Something companies from Tesla to Waymo still struggle with massively.
While a missing city block or region significantly reduces the usefulness of a proclaimed universal autonomous vehicle, for humanoids skills-gaps present opportunities. Imagine this for contrast, a humanoid gaining new skills from specialist models each month will awe and delight their owners. Now the bot can iron, next month it can fold clothes, and the next month it can prepare a selection of breakfast dishes. This means humanoids mirror the same virtuous cycle of an app ecosystem on a smartphone, whereas autonomous cars suffer from the same annoyances as a cell phone network with patchy coverage. Add to that the lower risk of killing or maiming humans with humanoids compared to doing the same with autonomous cars, the economic incentives and marketing opportunities start to skew heavily for team human(oid).
These are my thoughts for now; the rest of this decade will reveal whether I was on the right track with them. This was a fairly long post. I'll try and keep my next one more manageable when I intend to look into whether humans, once humanoid helpers become pervasive, may become politically more restless.