Feb 4, 2026
WTF are World Models!?
Hey again! Week five of 2026.My advisor called my math theorems "trivial" this week. I spent two days on them. I think he is being passive-aggressive after I missed a paper deadline by 30 minutes. Also, I submitted my second conference abstract, which I expect to be rejected as brutally as the first.Meanwhile, Yann LeCun quit Meta after 12 years, raised half a billion euros before launching a single product, and is telling the entire AI industry they've been building the wrong thing. Some people procrastinate. Others pivot entire fields.So. The Godfather of Deep Learning, Turing Award winner, and architect of Meta's AI empire just bet his reputation that LLMs are a dead end. His new startup, AMI Labs, is raising €500 million at a €3 billion valuation. Before shipping anything.Either he's right and the entire LLM paradigm is a detour, or he's spectacularly wrong and just torched the most prestigious AI career in history.Let's talk about what he's building instead.The "LLMs Are a Dead End" ArgumentYou've heard me call LLMs "fancy autocomplete" approximately 47 times in this newsletter. LeCun agrees, except he's not joking.His thesis, stated bluntly at NVIDIA GTC: "LLMs are too limiting. Scaling them up will not allow us to reach AGI." Let's go over why he thinks that.LLMs learn from text. The world isn't text.Think about how a toddler learns that balls bounce. They don't read about it. They throw a ball, watch it bounce, throw it again. They build an internal model of how gravity and elasticity work through observation and interaction. By age 3, they can predict that a ball thrown at a wall will bounce back. No Wikipedia article required.LLMs do the opposite. They read billions of words about physics without ever experiencing physics. They can write a perfect essay about gravity but can't predict what happens when you knock a glass off a table. They discuss spatial relationships without perceiving space. They reason about cause and effect without experiencing cause and effect.It's like learning to swim by reading every book about swimming ever written. You'd ace the written exam. You'd drown in the pool.The hallucination problem is structural, not fixable.LeCun argues that hallucinations aren't a bug you can engineer away. They are a fundamental consequence of how LLMs work. Language is inherently non-deterministic. There are many valid ways to complete any sentence. That creative flexibility is great for writing poetry but catastrophic for safety-critical applications.A model that generates plausible-sounding text will sometimes generate plausible-sounding wrong text. That's not a failure mode. That's the architecture working as designed.The counter-argument: Dario Amodei, CEO of Anthropic, predicted we might have "a country of geniuses in a datacenter" as early as 2026 via scaled-up LLMs. OpenAI keeps shipping reasoning models that solve problems LLMs couldn't touch. Maybe LeCun is wrong. Maybe scale really is all you need.This is the most interesting debate in AI right now. And both sides have hundreds of billions of dollars riding on the answer.What World Models Actually AreA world model is an AI system that learns an internal representation of how the physical world works... physics, causality, spatial relationships, object permanence. All from watching the world instead of reading about it.LeCun's own explanation: "You can imagine a sequence of actions you might take, and your world model will allow you to predict what the effect of the sequence of actions will be on the world."LLMs: Input text → Predict next token → Output text "What comes after these words?"World Models: Input sensory data → Learn physics → Predict next state of environment given actions "What happens to this world if I do this thing?"Your brain has a world model. Right now, you can close your eyes and imagine picking up your coffee mug. You can predict it'll be warm, that it has weight, that if you tilt it too far the coffee spills. You can mentally simulate knocking it off the desk and predict the crash. None of that requires language. It's a learned model of how physical reality behaves.World models try to give AI that same capability. Instead of training on text, they train on video, images, sensor data, and interactions. Instead of predicting words, they predict future states of environments.This enables things LLMs fundamentally can't do:Planning: Mentally simulate actions before taking them ("If I move the robot arm here, the box falls there")Physics understanding: Objects have mass, momentum, spatial relationshipsCause-effect reasoning: Actions produce predictable consequencesPersistent memory: Maintaining consistent state of a world across timeThe term "world models" was coined by David Ha and Jürgen Schmidhuber in their 2018 research paper, but LeCun's JEPA (Joint Embedding Predictive Architecture) research at Meta is what brought it into the mainstream conversation.How V-JEPA Works (technical but bear w/ me)Here's where it gets interesting. And slightly nerdy. But you'll survive.Traditional AI vision models (like the ones that power image recognition) learn by predicting pixels. Show the model part of an image, ask it to fill in the missing pixels. This works, but it's incredibly wasteful. The model spends enormous compute predicting exact pixel values when what matters is the meaning of what's in the image.V-JEPA (Video Joint Embedding Predictive Architecture) does something clever: it predicts in representation space, not pixel space.Traditional approach: Input: Video with masked regions Task: Predict exact pixels of masked regions Problem: Wastes compute on irrelevant details (exact shade of blue sky) V-JEPA approach: Input: Video with masked regions Task: Predict abstract representation of masked regions Result: Learns meaning, not pixelsTranslation: Instead of asking "what color is that specific pixel?", V-JEPA asks "what concept goes here?" It learns that a ball trajectory implies gravity, that a hand reaching implies grasping, that objects behind other objects still exist. Abstract understanding, not pixel reconstruction.V-JEPA 2 (released June 2025, while LeCun was still at Meta) is the version that proved this works at scale:1.2 billion parameters (tiny compared to LLMs. GPT-5 is reportedly 2-5 trillion+)Training Phase 1: 1M+ hours of internet video + 1M images, self-supervised (no labels, no human annotation)Training Phase 2: Just 62 hours of robot interaction dataRead that again. 62 hours. Not 62,000. Sixty-two.The results:77.3% accuracy on Something-Something v2 (motion understanding benchmark)State-of-the-art on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100)65-80% success rate on pick-and-place tasks in previously unseen environmentsZero-shot robot planning: the robot had never been in those rooms, never seen those objectsThat last part is the breakthrough. A robot that can pick up objects it's never seen, in rooms it's never been in, after watching just 62 hours of other robots doing stuff. No environment-specific training. No task-specific reward engineering.LeCun's comment: "We believe world models will usher a new era for robotics, enabling real-world AI agents to help with chores and physical tasks without needing astronomical amounts of robotic training data."V-JEPA 2 is open source (MIT license). You can run it today.The Competitive LandscapeLeCun isn't alone. Four major efforts are racing to build the AI that understands physics.AMI Labs (LeCun's Bet)Founded: December 19, 2025CEO: Alexandre LeBrun (former Nabla CEO, worked under LeCun at Meta FAIR)HQ: ParisRaising: €500M at €3B valuation — one of the largest pre-launch raises in AI historyInvestors: Reportedly Cathay Innovation, Greycroft, Hiro Capital, 20VC, Bpifrance, among othersStatus: Launched January 2026. No product yet.LeCun is Executive Chairman, keeps his NYU professor position, and has a technical (not financial) partnership with Meta. The first application? Nabla (a healthcare AI company LeBrun previously led) gets first access to AMI's world model tech for FDA-certifiable medical AI.The bull case: LeCun has a Turing Award, built one of the best AI labs on Earth, and has a decade of JEPA research. If anyone can make world models work commercially, it's him.The bear case: €3 billion valuation with zero product. The last time AI hype reached this level, we got a lot of expensive pivots.World Labs / Marble (Fei-Fei Li's Bet)Fei-Fei Li, the Stanford professor who created ImageNet and basically kickstarted modern computer vision, has been working on what she calls "spatial intelligence."Her company World Labs shipped Marble on November 12, 2025. It's the first commercial world model product you can actually use.What it does: Generates persistent, navigable 3D worlds from text, images, video, or panoramas.Input: "A cozy Japanese tea house at sunset"Output: A full 3D environment you can walk through, export as meshes or Gaussian splats, and drop into Unreal Engine or UnityPricing:Free: 4 generations/month (good for kicking the tires) Standard ($20/mo): 12 generations Pro ($35/mo): 25 generations + commercial license Max ($95/mo): 75 generationsKey features: Chisel (hybrid 3D editor), multi-image prompting, world expansion from existing scenes, VR compatibility (Vision Pro, Quest 3).The difference from competitors: Marble's worlds are persistent. You can revisit them, edit them, expand them. Other tools generate temporary environments that morph when you look away. Marble gives you actual 3D assets.Use cases that actually exist: Game studios prototyping levels, VFX teams creating pre-viz, architects generating walkthroughs, VR developers building environments.Raised $230M at $1B valuation. Has a product. Has revenue. The most grounded player in this space (pun absolutely intended).Google DeepMind / Genie 3 (Google's Bet)Google's entry is the flashiest. Genie 3 is a real-time interactive world model: type a text prompt, get a navigable 3D world you can walk around in. Live. In real-time.Announced: August 5, 2025 (TIME Best Inventions 2025)Prototype launched: February 2, 2026—two days ago—to Google AI Ultra subscribers in the USSpecs: 720p, 24fps, ~1 minute spatial memory windowYou describe a world, and Genie 3 generates it in real-time. You can walk through it, interact with objects, even trigger events ("make it rain," "add a dragon"). It learns physics from observation, objects have weight, light casts shadows, water flows.The impressive part: This isn't pre-rendered. It's generated on the fly. The AI is hallucinating an entire consistent 3D world in real-time at 24 frames per second.The limitation: "Several minutes" of coherent interaction. Not hours. Think tech demo, not Minecraft. Multi-agent support is limited, text in generated worlds is garbled (sound familiar?), and it can't perfectly simulate real locations.They've also tested it with their SIMA agent, an AI that can navigate and interact within Genie worlds. AI building worlds for other AI to explore. We're through the looking glass.NVIDIA Cosmos (NVIDIA's Bet)NVIDIA's approach is different: they built a platform, not a product.Announced: January 7, 2025 at CESTraining data: 20 million hours of real-world video (human interactions, robotics, driving)Latest: Cosmos-Predict2.5 (2B and 14B parameter checkpoints)License: Open source (NVIDIA Open Model License)Cosmos isn't one model, it's a family of models for different purposes:Cosmos-Predict: Future state prediction ("what happens next in this video?")Cosmos-Transfer: Spatial control and transformationCosmos-Reason: Physical reasoning combined with languageThe partners list reads like a robotics Who's Who: Waabi, Wayve, Uber, 1X, Agile Robots, Figure AI, XPENG, Foretellix.The use case is clear: autonomous vehicles and robotics. Need to test your self-driving car against 10,000 edge cases? Generate them with Cosmos instead of driving 10,000 actual miles. Need to train your warehouse robot? Simulate the warehouse.NVIDIA is selling shovels in the world model gold rush. Smart play.What You Can Actually Use TodayLet's be practical. What can you, a person reading this newsletter, actually do with world models right now?If you're a developer/researcher:V-JEPA 2 is on GitHub (MIT license). Clone it, run it, fine-tune it. Requires NVIDIA GPUs.NVIDIA Cosmos is open source. The 2B model runs on a single GPU.Ollama doesn't support world models yet (this is still early).If you're in gaming/VFX/architecture:World Labs Marble is live. $20/month. Generate 3D worlds, export to your engine.Genie 3 prototype just launched (Google AI Ultra subscription required, US only).If you're in robotics/AV:NVIDIA Cosmos is built for you. Synthetic data generation, scenario testing, edge case simulation.V-JEPA 2 for robot planning research.If you're a business person wondering whether to care:Too early for production. These are 2025-2026 research breakthroughs, not 2026 production tools.The exception: Marble for creative workflows and Cosmos for simulation. Those are usable now.My Honest AssessmentIs this a genuine paradigm shift?Maybe. The research results are impressive. V-JEPA 2 achieving zero-shot robot planning with 62 hours of training data is genuinely remarkable. Genie 3 generating consistent 3D worlds in real-time is wild. The progress in 12 months has been extraordinary.But "impressive research" and "replaces LLMs" are very different claims.The case for world models:LLMs demonstrably struggle with spatial reasoning, physics, and planningWorld models address these limitations architecturally, not through scaleRobotics and autonomous vehicles need physics understanding that text can't provideV-JEPA 2's sample efficiency (62 hours!) suggests the approach is fundamentally soundThe case for "slow down":€3B valuation with no product is peak AI bubble territoryEvaluation is much harder than text models. How do you benchmark "understands physics"?Video data is massive, messy, and expensive to processCurrent interaction times are minutes, not hours (Genie 3)The gap between "picks up objects 65-80% of the time" and "reliable enough for production" is enormousLLMs keep getting better at reasoning tasks world models were supposed to ownSomewhere in the middle:World models and LLMs aren't mutually exclusive. The future probably isn't "one or the other"... it's both. LLMs for language, reasoning, and text-based tasks. World models for physical understanding, robotics, and spatial reasoning. The most capable AI systems in 2027 will likely combine both.LeCun might be right that LLMs alone won't reach AGI. He might be wrong that world models alone will either. The answer might be some unholy combination of both that nobody's built yet.Is there a bubble?Is €3B for a pre-launch world model startup justified? History says probably not... most pre-launch valuations at this level don't pan out. But history also said a GPU company couldn't become the most valuable company on Earth, so take that with appropriate salt.For context: Black Forest Labs (image generation) raised at $4B. Quantexa (data intelligence) at $2.6B. The European AI ecosystem is throwing around serious money. AMI Labs fits the pattern but doesn't justify the valuation on fundamentals. It's a bet on LeCun's track record and vision.The TL;DRWhat: AI systems that learn how the physical world works by watching video, not reading text. They predict future states of environments and enable planning, physics reasoning, and spatial understanding.The debate: LeCun says LLMs will never reach AGI because they lack physical grounding. Amodei says scale is all you need. Both sides have billions of dollars committed. Neither has been proven right yet.The players:AMI Labs (LeCun): €3B valuation, no product, biggest bet in the spaceWorld Labs/Marble (Fei-Fei Li): First commercial product, $1B valuation, actually usableGoogle Genie 3: Real-time interactive worlds, just launched prototypeNVIDIA Cosmos: Open source platform for robotics/AV, most practical for enterpriseThe tech: V-JEPA 2 predicts in representation space instead of pixel space. Trained on 1M+ hours of video. Zero-shot robot planning with just 62 hours of interaction data. Open source.The reality: Impressive research, early-stage products, not ready to replace LLMs for most use cases. The future is probably both paradigms working together, not one killing the other.The move: If you're in robotics/AV/gaming → start experimenting now. If you're building text-based AI → keep building, but watch this space.The AI industry spent 2023-2025 arguing about which LLM is 2% better on benchmarks. 2026 might be the year we start arguing about whether LLMs were the right approach at all.Grab your popcorn. This debate is just getting started.Next week: WTF is OpenClaw? (Or: Is It Clawdbot? Moltbot? OpenClaw? The AI Agent That Rebranded Twice Before I Could Write About It)An Austrian developer named Peter Steinberger launched an open-source AI agent called Clawdbot in November 2025. Anthropic said "that sounds too much like Claude, please stop." So he renamed it Moltbot — because lobsters molt, get it? Then he renamed it again to OpenClaw in January. The project has had more identity crises than a freshman philosophy major, and it's not even three months old.Meanwhile, it racked up 145,000 GitHub stars, sold out Mac Minis globally, made Cloudflare's stock jump 14%, and spawned Moltbook — a social network where only AI agents can post and humans just... watch. Like a zoo, but the animals are made of math and they're arguing about productivity frameworks.Security researchers are calling it "AutoGPT with more access and worse consequences." Malicious packages are already showing up. A one-click RCE exploit dropped days ago. People are giving it their passwords, email access, and full system permissions because a lobster emoji told them to.We'll cover what OpenClaw actually does, why it went viral so fast, the security nightmare nobody's reading the fine print on, how it connects to every AI agent concept we covered back in September, and whether this is the moment agents finally go mainstream or the moment we learn why they shouldn't.See you next Wednesday 🤞pls subscribe