May 20, 2026
WTF are Robotics Foundation Models!?
In 1966, a professor at MIT assigned a student a summer project: solve computer vision. One student. One summer. Solve seeing. Fifty years later, we were still working on it.What finally cracked vision was not careful engineering. It was throwing enormous amounts of data at a neural network and watching what happened. That worked so well we spent the next decade doing the same thing for language, then code, then images, then video.Now we are doing it for bodies.In 2026, the labs are training foundation models on robot data. Arms, legs, wheels, logged at millions of hours. Then they pour the trained models into physical machines. Figure's humanoids are working factory lines. Google's arms generalize to kitchens they have never seen. Tesla's Optimus walks, with some generosity, upright. The method is the same method that cracked language. Ignore the domain experts, pour in data, let the model figure it out. It is working.We are not far from the moment when a robot in your house is cheaper than a dishwasher and more useful than one.A robotics foundation model is not a robot that thinksHere is what everyone pictures: a brain in a metal body, reasoning about the world, deciding to pick up the cup.That is not what this is.A robotics foundation model is one neural network trained on the logged sensor readings and motor commands of many different robots, doing many different tasks. It takes in what the robot sees and the instruction it was given. It predicts the next action. Then the next one. The way a language model predicts the next token, this predicts the next motor command. There is no separate reasoning module. There is one network and a very large pile of other robots' experience.The interesting word in that sentence is "different." Older robot learning trained one policy per robot per task. A model that could open a specific drawer on a specific arm in a specific lab. Move the drawer, retrain. Change the arm, retrain. The robot was a craft object, hand-tuned, and the tuning did not transfer.The thing everyone gets wrong is which part is hard. People assume the bottleneck is the body. Actuators, torque, hands, batteries, the wet mechanical reality of touching the world. Hardware is genuinely hard. But hardware is not what stalled this field for thirty years. What stalled it is the same thing that stalled language before the data era: there was no way to learn something general from a pile of narrow, incompatible examples. Every robot's data was an island.The shift is that the data stopped being islands. Pool the logs from many robots and many labs into one training set, train one model on all of it, and the model gets better at every robot at once, including ones it was barely trained on. The body is now the easy part. The dataset is the moat. This is the same lesson "WTF is Context Engineering!?" landed for language, arriving in robotics about three years late.Finding 1: one model on many robots beats specialists at their own tasksThe cleanest evidence is the Open X-Embodiment project. Thirty-four robotics labs pooled their data into one set: more than a million robot trajectories, 22 distinct robot types, hundreds of distinct skills. Then a single model, RT-X, was trained on the pile.The specialist robots were the control group. Each had a hand-tuned policy, trained only on its own data, by the people who built it. RT-X, trained on everyone's data at once, beat those specialists at their own tasks by roughly 50 percent on average. Skills also transferred across bodies: a behavior present in one robot's data showed up on a different robot that had never performed it.Read that twice. The generalist did not win because it was cleverer. It won because it had seen more bodies do more things, and structure that looks specific to one robot turns out to be mostly shared.Finding 2: the scaling curve bends the same way it did for languageThe second finding is less a single result and more a shape. Across vision-language-action models, including Physical Intelligence's pi-zero and Google's RT-2 line, the relationship holds: more diverse robot data plus a bigger model produces better generalization, and it keeps producing it as you add more. No plateau yet. The curve looks like the language curve looked in 2020, before anyone believed it would keep going.There is a second-order effect that makes this stranger. RT-2 was trained on web images and text alongside robot trajectories. It inherited concepts it never saw a robot do. Ask it to move an object toward "the extinct animal" and it picks the toy dinosaur, because the language half of the model knows what extinct means and the action half just has to point the arm. The robot benefits from data that was never about robots. That is the entire foundation-model bet, now closing a loop through a physical arm.This matters because it converts robotics from a research problem into a logistics problem. If performance is a predictable function of data and scale, you do not need a conceptual breakthrough. You need a fleet, the pipes to log it, simulation to cover the cheap cases, and the compute to train on the pile. Those are purchase orders, not insights. Insights are hard to schedule. Purchase orders have lead times.Finding 3: the data now comes from deployment, not demosThe flywheel only spins if real robots are running real hours. They are. Figure put humanoids on a commercial automotive line, doing repetitive material handling, on shift, not in a demo booth. The relevant number is not the polish of the demo video. It is that every deployed unit logs operating hours on a real task, those hours become labeled training data for free, the next model trains on them, and the next model is the one that ships back to the same units overnight. The robot is collecting the dataset that trains its own replacement, and the replacement arrives as a software update.This is why the company with the mediocre robot and the deployed fleet beats the company with the brilliant robot and the lab. One of them has a dataset that compounds weekly. The other has a paper. That loop did not exist in 2022. It is the single biggest change in the field, and it is almost invisible from the outside because it looks like nothing. It looks like a robot doing a boring job, badly, on purpose, while logging."Robots are nothing like LLMs"The strongest objection, and a serious one: language is discrete and the internet handed us trillions of free tokens. The physical world is continuous, every sample costs a motor cycle and a chance of breaking something, and a wrong token is a typo while a wrong torque is a snapped wrist. The analogy to language is seductive and the economics are not the same.All true. Robot data is orders of magnitude more expensive per useful sample than scraping text, and the failure modes are physical, not embarrassing.But the claim does not require internet scale. It requires that pooled, cross-body data keeps bending the curve, and the evidence so far says it does. You do not need every robot in the world. You need fleet scale, simulation to cover the cheap cases, and real deployment to cover the expensive ones. Language needed the whole internet because no one owned a corpus. Robotics companies own their fleets. The corpus is being manufactured on purpose, by the same machines that will consume it. Expensive is not the same as impossible. It is just a moat with a price tag, which is the most durable kind.What to actually doStop watching demo videos. A humanoid folding a shirt in a lab tells you almost nothing. The question that predicts who wins is boring: who is logging the most real operating hours across the most different bodies, and who owns that data when the contract ends. Watch fleets and data rights, not choreography.If you build anything physical, the strategic layer is no longer the controller. Controllers become a downloaded model, the way an operating system kernel became a thing you apt-get. The value moves to whoever owns the data pipeline and the safety contract around a model that now has actuators. Which is last post's argument with a body attached. A runtime with the keys to your files needed an immune system. A foundation model with the keys to a physical arm needs one more. "WTF is the OpenClaw Ecosystem!?" was about software agents without a trust layer. Bolt that argument onto a 30-kilogram machine moving at speed near a person and the missing contract layer stops being a CVE and starts being an incident report. That gap is the work we care about at Dube International, and it is the same gap, one domain over.My advisor asked, again, when I am going to stop writing about other people's robots and finish the category theory chapter. I told him robots are just functors from intention to motion and watched him decide whether that counted. It did not. It was worth a try.Here is the compression. For thirty years robotics was a craft, one robot at a time, and the craft did not transfer. The moment the data stopped being islands, the same boring recipe that ate vision and language started eating motion, and the only question left is who owns the dataset the robots are building for their own replacements.Next week: what we get wrong about what these models actually understand.See you next Wednesday 👋