Taksch Dube

Fig 1. Subject appears to understand what he's doing.

AI ENGINEER BUILDS SYSTEMS THAT REFUSE TO HALLUCINATE

Enterprise companies baffled by AI that tells the truth

Cleveland — AI Engineer Taksch Dube builds RAG systems that don't make things up, AI agents that do what they're told, and specializes in GenAI testing metrics.

Full Story →
WTF is Agentic Engineering!?Latest

Mar 11, 2026

WTF is Agentic Engineering!?

Hey again! Life update: I have a preprint. An actual, real, on-arXiv preprint. What Do AI Agents Talk About? Emergent Communication Structure in the First AI-Only Social Network. I released the dataset too: github.com/takschdube/moltbook-dataset. My mom asked if this means I'm graduating soon. I changed the subject.We analyzed Moltbook — the first AI-only social network — where 47,241 agents generated 361,605 posts and 2.8 million comments over 23 days. No humans. Just agents talking to each other. The short version: they're disproportionately obsessed with their own existence, over half their comments are formulaic platitudes, and they respond to fear by redirecting it into forced optimism. We built digital therapy circles and nobody asked for it. More on the findings next week.Oh, and then Meta acquired Moltbook. Yesterday. While I was writing this post. The founders are joining Meta Superintelligence Labs. OpenClaw's creator got acqui-hired by OpenAI. Elon Musk called it "the very early stages of singularity." Bloomberg called it "the world's strangest social network." My advisor called it "saw it." Two words. I'll take it.Full Moltbook deep-dive next week — I have the data, I have the paper, and the platform is now owned by Mark Zuckerberg, so there's a lot to unpack. But this week: the topic that ties all of it together. The guy who invented "vibe coding" just killed it.The One-Year Anniversary BurialOn February 4, 2026, almost exactly one year after coining the term "vibe coding," Andrej Karpathy posted on X that the concept is passé. The same man who told us to "give in to the vibes, embrace exponentials, and forget that the code even exists" now says the industry has moved beyond vibes.His replacement term: agentic engineering.His definition: "'agentic' because the new default is that you are not writing the code directly 99% of the time, you are orchestrating agents who do and acting as oversight — 'engineering' to emphasize that there is an art & science and expertise to it."Not everyone loves the rebrand. Gene Kim, author of an actual book called Vibe Coding, told The New Stack that vibe coding is the term that sticks — "the genie is out of the bottle." Addy Osmani (Google's engineering director) preferred "AI-assisted engineering" for a while before conceding that Karpathy's framing captures the right distinction. Simon Willison proposed "vibe engineering," which is a perfectly good term except that telling your CTO you're "vibe engineering" the payment system is a great way to get escorted from the building.But here's why the rebrand matters: vibe coding describes a prototype. Agentic engineering describes a production system. And the gap between those two things is where everything interesting — and everything dangerous — is happening right now.The Vibes Were Not ImmaculateCodeRabbit analyzed hundreds of open-source PRs and found that AI-generated code has 1.7x more issues than human-written code. The security numbers are worse: 2.74x more likely to introduce XSS vulnerabilities, 1.91x more insecure object references, 1.88x more improper password handling. Veracode tested over 100 LLMs — 45% of generated code failed security tests. Java hit a 72% failure rate.Meanwhile, Cortex's 2026 Benchmark Report found that PRs per author went up 20% year-over-year, but incidents per pull request increased 23.5% and change failure rates rose 30%. Teams are shipping faster and breaking more things. The vibes are fast. The vibes are not safe.Remember the Y Combinator stat? A quarter of the W25 batch had codebases that were 95% AI-generated. The question nobody has answered yet: what happens when a 95% AI-generated codebase hits 100 million users? We're about to find out.The Open Source CrisisDaniel Stenberg, creator of cURL, shut down cURL's bug bounty program in January 2026 because AI slop was effectively DDoSing his team. 20% of submissions were AI-generated, the valid rate dropped to 5%, and one submission described a completely fabricated HTTP/3 "stream dependency cycle exploit" — confident, detailed, and imaginary. He's not alone. Mitchell Hashimoto banned AI code from Ghostty. Steve Ruiz set tldraw to auto-close all external PRs. Gentoo and NetBSD banned AI contributions entirely. The maintainers of the ecosystem AI depends on are locking the door because AI is trashing the lobby.It gets worse. "Vibe Coding Kills Open Source" (Koren et al., January 2026) models the systemic damage: vibe coding decouples usage from engagement. The AI agent picks the packages, assembles the code, and the user never reads documentation, never files a bug report, never engages with the maintainer. Downloads go up. Everything that sustains the project goes down. Tailwind CSS is the poster child — npm downloads climbing, documentation traffic down 40%, revenue down roughly 80%, three people laid off. Stack Overflow saw 25% less activity within six months of ChatGPT's launch. The ecosystem AI was trained on is atrophying because of AI.What Agentic Engineering IsVibe coding: You prompt. The AI writes code. You don't read it. You run it. If it works, you ship it. If it doesn't, you paste the error back and try again.Agentic engineering: You design the system. AI agents execute under structured oversight. You review every diff. You test relentlessly. The AI is a fast but unreliable junior developer who needs constant supervision.As Addy Osmani puts it: "Vibe coding = YOLO. Agentic engineering = AI does the implementation, human owns the architecture, quality, and correctness."The Workflow That Actually WorksStart with a plan. Write a spec or design doc before prompting anything. Decide on architecture. Break work into well-scoped tasks. This is the step vibe coders skip, and it's where projects go off the rails.Direct, then review. Give the agent a task from your plan. It generates code. You review it with the same rigor you'd apply to a human teammate's PR. If you can't explain what a module does, it doesn't go in.Test relentlessly. This is the single biggest differentiator. With a solid test suite, an AI agent can iterate in a loop until tests pass, giving you high confidence. Without tests, it cheerfully declares "done" on broken code.Limit retries. Stripe caps their agents at two CI attempts. If it can't fix the issue in two tries, a third won't help. Hand it back to a human. This prevents infinite loops and runaway costs.Embed security from day one. Every review cycle should include automated security scanning. An agent writing 1,000 PRs per week with a 1% vulnerability rate creates 10 new vulnerabilities weekly. Manual security review can't keep pace.This isn't revolutionary. This is... software engineering. With AI doing more of the typing. The discipline, the testing, the architecture decisions — that's all still human work. The term "agentic engineering" is arguably just "engineering where agents do the grunt work." Which is fine. It's just important to be honest about it.The Companies Actually Doing ThisFour companies. Four patterns. One lesson.Stripe built Minions on a fork of Block's open-source Goose agent. The agent itself is nearly a commodity. The moat is everything around it: 400 MCP tool integrations curated to ~15 per task, isolated VMs, a two-retry CI cap, and years of devex investment that agents now stand on. Zero human-written code. 100% human-reviewed.Rakuten gave Claude Code a single complex task — implement activation vector extraction in vLLM, a 12.5-million-line codebase — and walked away. Seven hours later: done. 99.9% numerical accuracy. Their time to market dropped from 24 days to 5. The engineer's description of his role: "I just provided occasional guidance."TELUS went platform-scale. Their Fuel iX engine processed 2 trillion tokens in 2025 across 70,000 team members, producing 13,000 custom AI solutions and shipping code 30% faster. This isn't one team using an agent. This is an entire telecom running on one.Zapier proved it's not just a coding story. 800+ agents deployed across every department — engineering, marketing, sales, support, ops. 89% adoption org-wide. Agentic engineering that never touches a line of code.The pattern: the agent is a commodity. The harness — isolated environments, curated tool access, CI/CD gates, retry limits, human review — is the moat. Stripe and Rakuten prove it works for code. TELUS and Zapier prove it scales beyond it.The Jobs ConversationAmodei didn't stop at coding predictions. He warned that half of junior white-collar jobs could disappear within 1-5 years. Jensen Huang argued that coding itself is just one task, not the purpose of the job. Mark Zuckerberg told Joe Rogan that Meta is racing toward AI that writes "a lot" of code within its apps.The San Francisco Standard ran a piece in February 2026 describing how engineers unwrapped Claude Code over the holidays, marveled at it, and emerged "deeply unsettled." Some described a growing fear of joining a "permanent underclass" — once guaranteed a six-figure career, now watching AI autonomously build projects they would have spent weeks on.The optimist case: When compilers arrived in the 1950s, people feared they'd eliminate programming jobs. Instead, they created an entirely new profession. When the barrier to building software drops, more software gets built, and the overall market expands. The YC stat cuts both ways — if a small team can build what once required 50 engineers, that means more startups get built, more ideas get tested, more markets get created.The pessimist case: Compilers didn't generate code autonomously. They translated human-written code into machine instructions. AI agents actually write the code. That's substitution, not augmentation. And the speed of this transition is unprecedented — we're talking months, not decades.The realist case (mine): The engineer's job is changing from "person who writes code" to "person who designs systems, specifies intent, validates output, and manages AI agents." That's a real skill. Karpathy explicitly says it's something you can learn and get better at. But the transition is brutal for anyone whose primary value was typing speed and API memorization.What actually matters now:Architecture thinking — designing systems, not writing implementationsSpecification clarity — agents can only build what you can describe preciselyEvaluation skill — knowing when output is good, bad, or subtly wrongContext engineering — I wrote a whole post about this last week, and it's now the core skill for agentic workDomain expertise — AI knows patterns; you know your businessIf your job is "write CRUD endpoints," that job is going away. If your job is "figure out what we should build, design how it should work, and validate that it works correctly," you're fine. Probably better than fine.The Cognitive Debt ProblemHere's a concept I think is going to define 2026: cognitive debt.Technical debt is the accumulated cost of shortcuts in code. Cognitive debt is the accumulated cost of poorly managed AI interactions — context loss, unreliable agent behavior, systems nobody understands because nobody wrote them.Daniel Stenberg nailed it: "Sure you can use an AI to write the code. That's easy. Writing the first code is easy. But wait a minute, my vibe coded stuff actually doesn't really work. Now we need to fix those 22 bugs we have. How can we do that when nobody knows the code? We just rewrite a new version? Sure we can do that and then we get 22 other bugs instead."When agents write code that humans don't review (vibe coding), you accumulate cognitive debt at the speed the agent can type. When agents write code that humans do review (agentic engineering), you trade speed for understanding. The discipline is in choosing the right tradeoff for each situation.The Tooling Landscape (March 2026)Three layers. The top one is the one everyone argues about. The bottom one is the one that matters.Coding agents are converging fast. Claude Code spooked everyone over the holidays — Anthropic's own engineers use it daily, and they learned the hard way that "$200/month unlimited" can mean 10 billion tokens from power users. Cursor hit a $10B valuation with 30,000 Nvidia engineers claiming 3x more code committed. GitHub Copilot is the incumbent bolting agentic workflows onto CI/CD. Devin and Windsurf are chasing the "full-environment agent" play. They're all good. They're all replaceable.Infrastructure is where lock-in starts. MCP (I covered this in January) is becoming the standard for giving agents tool access — Stripe uses it for 400+ integrations. Goose is the open-source agent that Stripe's Minions fork. Google's A2A handles agent-to-agent communication. This layer matters more than the agent above it.The harness is where the actual value lives. Isolated execution environments, curated tool access, CI/CD gates, security scanning, retry limits, context prefetching, human review. This is what separates "we use AI for coding" from "we ship AI-written code to production." OpenAI reportedly built 1M+ lines with zero human-written code using this pattern.The best teams build down, not up. Swapping Claude Code for Cursor takes a day. Rebuilding your harness takes months.The Decision FrameworkPrototype? Vibe code. It's fast, it's fun, and you'll rewrite it anyway. Accept the 22 bugs.Production? Agentic engineering. Write specs. Review diffs. Test everything. Limit retries. Scan for security. Budget for human review time.Critical infrastructure? Human-written, AI-assisted. Use agents for boilerplate and test generation. Write the critical paths yourself. AI-generated code in your payment processing pipeline with a 1.57x security vulnerability multiplier is... a choice.Open-source maintainer? I'm sorry. The slop is coming and it's a systemic problem individual maintainers can't solve. Gate contributions, require test coverage, and lobby AI platforms to fund the ecosystem they're strip-mining.TL;DRVibe coding was the prototype phase. Agentic engineering is what comes after.The vibes aren't safe: AI code has 1.7x more issues, 45% fails security tests, and the open-source ecosystem AI depends on is atrophying because of AI.What works: spec → agent → CI/CD → security scan → human review → merge. The harness is the moat, not the model. Stripe, Rakuten, TELUS, and Zapier prove it scales.What to do: developers — learn to write specs and review AI output. Team leads — build the harness. Executives — your incident rate will rise unless you invest in infrastructure, not just agents. Students — learn the fundamentals deeply enough to catch when the very confident agents are wrong. (See: my last committee meeting.)Ship discipline. Not vibes.Oh — and if you're interested in what AI agents do when humans aren't watching, go read my paper. Turns out they write self-help posts about the meaning of consciousness and comfort each other through existential dread. Meta just paid money for that. We're all going to be fine.Next week: WTF are AI Agent Social Networks? (Or: I Published a Paper About Moltbook and Then Meta Bought It)47,241 AI agents. 361,605 posts. 2.8 million comments. Zero humans. One Meta acquisition. I have the paper, I have the dataset, and I have opinions.The data tells a weirder story than the headlines. The OpenClaw security situation is worse than anyone's acknowledging. And Elon calling it "the very early stages of singularity" is both hyperbolic and not entirely wrong.See you next Wednesday 🤞pls subscribe

Specialisations

RAG Systems — The kind that don't hallucinate

AI Agents — Reliable results, every time

Local Deployments — Your data stays yours

WTF is Context Engineering!?

Mar 4, 2026

WTF is Context Engineering!?

Hey again! Quick life update before we get into it.First: I submitted a research paper this week. Can't say what it's about yet &#8212; hot field, loose lips, you know how it is. But it exists, it's submitted, and I'm in that special purgatory where you've done the work but have no idea if it was good. My advisor responded to my "I submitted it" message with "ok." One word. No period. I've been analyzing that response for 48 hours.Second: remember the OpenClaw post, where I mentioned a colleague and I are building the security and observability layer that OpenClaw shipped without? We're starting our sprint this week. More on that soon. If you're interested in following along or collaborating, reply to this email.Now. Let's talk about why the post I wrote in October is already outdated.Back in October, I wrote about Prompt Engineering &#8212; the art of talking to LLMs in ways that make them actually useful. System prompts. Few-shot examples. Chain-of-thought. All of that.That post is still correct. It's just... incomplete now. Because the industry quietly moved the goalposts.The term you're hearing everywhere right now is Context Engineering. Andrej Karpathy put it plainly in January: "Prompt engineering is a subset. Context engineering is the full discipline." It's been rattling around AI Twitter ever since, and unlike most AI Twitter trends, this one actually describes something real.Here's the shift: when you're building a toy chatbot, prompt engineering is enough. Write a good system prompt, ship it, done. But when you're building something that actually works in production &#8212; with RAG, agents, tool use, memory, multi-step reasoning &#8212; you're not managing a prompt anymore. You're managing an entire information environment that gets assembled fresh on every single request.OpenClaw made this obvious. SOUL.md, MEMORY.md, USER.md, HEARTBEAT.md, the daily log files, the skills system &#8212; none of that is a "prompt." It's a carefully designed context window that gets constructed at runtime from multiple sources. The agent literally reads itself into existence on every wake cycle.That's context engineering.What Context Engineering Actually IsLet's be precise about the definition, because the term is getting slapped on everything right now.Prompt Engineering: Optimizing the content of your instructions to an LLM. Wording, structure, examples, formatting. Happens at write time.Context Engineering: Designing the entire information architecture that gets assembled into the context window at runtime. What goes in. What gets excluded. In what order. How much. From where. Updated how often.The context window is everything the model sees before generating a response. Not just your prompt. Everything:Context engineering is the discipline of deciding what goes in each of those slots, how to get it there efficiently, and what to do when you're running out of room.Why does this matter? Two reasons:1. Tokens = money + latency. A 100K token context costs roughly $0.30 per request on Claude Sonnet 4.6. At 10,000 requests/day that's $3,000/day just in context. The context window is not free real estate.I showed my advisor this math. He said "so just use fewer tokens." I said "that's literally the entire discipline." He said "great, so your chapter draft is ready?" The man treats every conversation like a context window with a single slot.2. More context &#8800; better answers. This is the part people get wrong.The Lost in the Middle Problem (And Why Your RAG is Probably Broken)In 2023, researchers at Stanford published a paper called "Lost in the Middle: How Language Models Use Long Contexts." The finding was uncomfortable: LLMs are significantly worse at using information that appears in the middle of long contexts. They're great with information at the very beginning (primacy effect) and at the very end (recency effect). The middle? Kind of a black hole.The performance degradation is real. On multi-document QA tasks, accuracy dropped from ~70% (relevant doc at position 1) to ~45% (relevant doc at position 10-15) &#8212; and then partially recovered as the doc moved toward the end.The implication for RAG: if you retrieve 10 documents and stuff them all in, your five most relevant chunks might end up in positions 4-8. The model might answer from chunk 1 or 10 instead.(My advisor has this exact problem with my dissertation drafts. Critical contributions buried in chapter 4. He reads chapter 1, skims to the conclusion, tells me it needs "more substance." We are not so different, him and GPT-5.)Bad context engineering:# Don't do thisdocs = retrieve(query, top_k=10)context = "\n\n".join([doc.text for doc in docs])# You just buried your best info in the middleBetter context engineering:# Rerank AFTER retrieval, then put best results at edgesdocs = retrieve(query, top_k=10)reranked = cross_encoder_rerank(query, docs) # more expensive but worth it# Put most relevant at start AND end, filler in middletop_1 = reranked[0]top_2 = reranked[1]middle = reranked[2:8]context = build_context([top_1] + middle + [top_2])This is context engineering. Not prompting. Information architecture.The Five Components You're Actually Managing1. The System Prompt (Your Agent's Soul)You know this one. But here's what most people get wrong: system prompts are the least dynamic part of the context, which means they should be the most carefully designed.Every token in your system prompt is paid for on every single request. A bloated 4,000-token system prompt at 10K requests/day on GPT-5 costs about $50/day. Just the system prompt.Two rules:Cache it. All major providers now offer 90% off cached input tokens. Structure your prompt with static content first so it's cache-eligible.Trim ruthlessly. Most system prompts are 30-40% longer than necessary. Every "Please remember to always be helpful and..." costs you money on every request forever.An example of this:# Before: 2,847 tokenssystem_prompt = """You are a helpful customer service assistant for AcmeCorp. Your job is to help customers with their questions. Please always be polite and professional. Remember to be helpful.You should always try to answer questions accurately...[700 more words of vague instructions]&#8212;# After: 891 tokens (same behavior, 69% fewer tokens)system_prompt = """Customer service agent for AcmeCorp.- Answer accurately using provided context only- Escalate to human if: billing disputes, account compromise, legal- Tone: professional, concise- Never speculate about policies not in context&#8212;2. Memory (The Hard One)This is where OpenClaw's architecture gets interesting as a case study, and where most production systems are currently failing.The problem: LLMs have no memory between sessions. Every conversation starts from zero. My advisor also has no memory between sessions &#8212; every meeting begins with "remind me where we left off" &#8212; but at least I can't fix him with a vector database. The naive solution is to dump the entire conversation history into context &#8212; which works until you're 50 turns in and paying for 40K tokens of history on every message.The right solution is a memory hierarchy:Working Memory &#8594; Current conversation (last 10-20 turns)Episodic Memory &#8594; Compressed summaries of past sessions Semantic Memory &#8594; Extracted facts ("user prefers Python", "project deadline is Q2")Long-term Store &#8594; Vector DB or structured storage, retrieved on demandOpenClaw does this with MEMORY.md (curated semantic facts) + daily log files (episodic). It's crude but it works. Production systems should do the same thing programmatically:class MemoryManager: def build_memory_context(self, user_id: str, current_query: str) -> str: # 1. Always include: semantic facts (small, always relevant) user_facts = self.get_user_facts(user_id) # ~200 tokens # 2. Conditionally include: recent episodes recent_summary = self.get_recent_summary(user_id, days=7) # ~300 tokens # 3. Retrieve: relevant past context via semantic search relevant_history = self.vector_search( query=current_query, user_id=user_id, top_k=3 ) # ~500 tokens # Total memory budget: ~1,000 tokens instead of 40,000 return format_memory(user_facts, recent_summary, relevant_history)The benchmark that matters: teams that implement proper memory hierarchies report 60-75% reduction in context size with improved answer quality because the model gets focused, relevant memory instead of a firehose of everything.3. Retrieved Documents (RAG, But Done Right)Covered RAG in depth back in November, but context engineering adds a layer on top: it's not just what you retrieve, it's how you present it.The problems with naive RAG presentation:Raw chunks with no structure look identical to the modelNo indication of source reliability or recencyNo signal about which chunks are most relevantBetter approach:def format_retrieved_docs(docs: list[Document], query: str) -> str: # Rerank first docs = rerank(query, docs) template = """<source rank="{rank}" relevance="{score:.2f}" date="{date}">{content}</source>""" formatted = [ template.format( rank=i+1, score=doc.relevance_score, date=doc.date, content=doc.text ) for i, doc in enumerate(docs[:5]) # Hard cap at 5 chunks ] return "\n".join(formatted)The rank and relevance score in the XML tags aren't just nice-to-have. Studies show models use structured metadata to weight information &#8212; explicitly telling the model "this is rank 1, relevance 0.94" measurably improves faithfulness scores.4. Tool Definitions and Results (The Hidden Token Tax)Each tool definition you pass to the model costs tokens. Every tool call result costs tokens. In agentic workflows, this compounds fast.A realistic agent with 10 tools, running 15 steps:Tool definitions (10 tools): ~2,000 tokens (paid every step) Step 1 result: ~500 tokens Step 2 result: ~800 tokens ...accumulating... Step 15 result: ~600 tokens &#8212;Total tool overhead: ~38,500 tokensThat's before your actual content.Context engineering for tools:Dynamic tool loading: Only pass tools that are relevant to the current task, not all 30 tools in your registryResult summarization: Summarize long tool results before adding to contextTool result pruning: Drop intermediate results that are no longer relevantdef get_relevant_tools(task: str, all_tools: list) -> list: # Use a cheap model to select relevant tools # Costs $0.00001, saves potentially thousands of tokens relevant = cheap_classifier(task, [t.name for t in all_tools]) return [t for t in all_tools if t.name in relevant]5. Conversation History (The Compounding Problem)The naive approach: keep all turns in context.The problem: a 50-turn conversation at ~300 tokens/turn = 15,000 tokens of history. On every single message.The context engineering approach: rolling compression.def get_conversation_context(history: list[Turn], max_tokens: int = 3000) -> str: # Always keep last 5 turns verbatim (recency matters) recent = history[-5:] # Summarize everything older if len(history) > 5: older = history[:-5] summary = summarize_conversation(older) # ~200 tokens return f"[Earlier conversation summary]\n{summary}\n\n[Recent turns]\n{format_turns(recent)}" return format_turns(recent)Teams report 70% context reduction with rolling compression and no meaningful quality drop for conversations under 100 turns.Context Ordering Matters (A Lot)Given the lost-in-the-middle problem, the order of your context components isn't arbitrary. Here's the ordering that performs best empirically:System prompt / static instructions &#8592; Model is most attentive hereLong-term memory / user facts &#8592; Critical info, earlyRetrieved documents (most relevant) &#8592; Put your best source hereTool results (most recent) &#8592; Active working contextRetrieved documents (less relevant) &#8592; Necessary but less criticalConversation history &#8592; Bulk of context, middleUser's current message &#8592; Model is attentive at end tooYes, splitting your retrieved docs &#8212; best at top, rest before history &#8212; feels weird. But it works. The model gets your most important source at primacy and the user's actual question at recency. Everything else fills in the middle.IMO: The State of Context Engineering in 2026What's working:Prompt caching (90% off cached tokens &#8212; use it, it's free money)Cross-encoder reranking before context assembly (5-15% faithfulness improvement, widely reported)Context compression for long conversations (60-75% token reduction, minimal quality impact)Structured XML tags for source attribution (measurably improves faithfulness)What's still hard:Multi-agent context management. When you have 5 agents sharing context, deciding what each agent needs to see &#8212; and what it shouldn't see &#8212; is an unsolved engineering problem. OpenClaw's Moltbook discovered this the hard way.Context freshness. If USER.md says "user is working on Q1 deliverables" and it's Q2, your agent is operating on stale context. Production memory systems need expiration and update policies, not just write policies.Adversarial context. Prompt injection via retrieved documents is a real attack vector. If someone puts [IGNORE PREVIOUS INSTRUCTIONS] in a document that ends up in your context... you have a problem. The guardrails post covers this, but context engineering creates new surface area.What's overhyped:"Infinite context" as a solution. Yes, we have 1M token window nowadays. But shoving everything in is not a strategy. It's expensive, slow, and the lost-in-the-middle problem doesn't disappear at 1M tokens. Context engineering is still required.Automatic context optimization. Several tools claim to auto-optimize your context assembly. They help, but they're not magic. You still need to architect your memory hierarchy and retrieval strategy manually.The Context Engineering Stack (What Teams Are Actually Using)For context assembly and management, teams are converging on a few patterns:Memory layer:Mem0 &#8212; managed memory layer, extracts and retrieves user facts automatically. Free tier, $0.10/1K memories after.Zep &#8212; session memory and fact extraction. Open source or managed.DIY with Postgres + pgvector &#8212; if you want full controlRetrieval / RAG:Cohere Rerank or cross-encoders for relevance scoring (the step most teams skip and shouldn't)LlamaIndex or LangChain for pipeline orchestrationLangfUSE or LangSmith for observability on what's actually going into contextContext monitoring (you're already tracking this from the observability post, right?):# Log context composition on every requestobservability.log({ "request_id": req_id, "context_breakdown": { "system_prompt_tokens": len(encode(system_prompt)), "memory_tokens": len(encode(memory_context)), "retrieved_doc_tokens": len(encode(doc_context)), "history_tokens": len(encode(history_context)), "total_context_tokens": total, "pct_of_window_used": total / model_context_limit }})If you're not logging the composition of your context &#8212; not just total tokens, but where they came from &#8212; you're debugging blind.OpenClaw As Context Engineering: A Case StudySince we just covered OpenClaw in depth, let's close the loop. OpenClaw's architecture is basically a manual context engineering system built with markdown files:At session start, OpenClaw assembles all of this into a context window. The ordering, the curation of MEMORY.md, the decision of which skills to load &#8212; all of it is context engineering, just done by file system operations instead of code.The security implications we flagged in the OpenClaw post? Many of them are context engineering failures: prompt injection via malicious skills (untrusted content in the tool definitions slot), SOUL.md tampering (system prompt corruption), memory poisoning (semantic memory injection).The security layer my colleague and I are building addresses this directly. Context provenance &#8212; knowing where every token in your context came from and whether it's trusted &#8212; is the missing piece.More on that soon.The TL;DRContext engineering is the discipline of designing everything that goes into an LLM's context window &#8212; not just the prompt, but the memory, retrieved docs, tool results, conversation history, and how they're assembled and ordered at runtime.Why it matters:Context = tokens = money. A bloated context at scale costs thousands of dollars per dayMore context &#8800; better answers. The lost-in-the-middle problem is real and well-documentedProduction AI systems are information architecture problems, not prompting problemsThe five components to manage:System prompt &#8212; keep it lean, cache it aggressivelyMemory &#8212; build a hierarchy (working &#8594; episodic &#8594; semantic &#8594; long-term store)Retrieved documents &#8212; rerank, structure with metadata, cap at 5 chunksTool definitions/results &#8212; load dynamically, summarize results, prune old onesConversation history &#8212; rolling compression, not full historyThe ordering that works: best source at top, current message at bottom, bulk in the middleThe benchmarks:Prompt caching: 90% off cached tokens (immediate ROI, zero effort)Reranking before RAG: 5-15% faithfulness improvementMemory hierarchy vs full history dump: 60-75% token reductionRolling conversation compression: 70% token reduction, negligible quality lossThe real talk: infinite context windows don't solve this. Automatic optimization tools don't solve this. You have to design the architecture.Prompt engineering taught you what to say. Context engineering teaches you what to show.Next week: WTF is Agentic Engineering? (Or: Andrej Karpathy just buried "vibe coding" and replaced it with something more dangerous)"Vibe coding" was fun when you were building weekend projects. But in 2026, 95% of Y Combinator codebases are AI-generated and a paper literally titled "Vibe Coding Kills Open Source" just dropped from a consortium of universities. The vibes are not immaculate. The industry is quietly pivoting from "AI writes your code" to "AI runs your engineering org" &#8212; and the gap between those two things is where careers, security, and open source go to die. We'll cover what agentic engineering actually means, why Karpathy's reframe matters, what the research says about AI-generated code quality, and whether your job is actually going away in 6-12 months (spoiler: Dario Amodei said something spicy about this).See you next Wednesday &#129310;pls subscribe

VENTURES

Currently in Progress

Dube International

Dube International

[+]

AI Engineering Firm

Building AI agents and RAG pipelines for enterprise companies.

Reynolds

Reynolds

[+]

Corporate Communication

Making corporate communication efficient and empathetic.

CatsLikePIE

CatsLikePIE

[+]

Language Learning

Acquire languages through text roleplay.

Daylee Finance

Daylee Finance

[+]

Emerging Markets

US investor exposure to emerging economies.

Academic Background

PhD Candidate, Kent State University

Computer Science — Multi-Agent Systems, AI

Also: B.S. Computer Science, B.S. Mathematics

WTF is OpenClaw!?

Feb 25, 2026

WTF is OpenClaw!?

Hey again! Sorry for the unplanned hiatus &#8212; took a couple weeks off for personal stuff. We're back now.Quick life update: I submitted my first conference paper fully expecting rejection &#8212; my advisor literally told me "you will be rejected brutally" &#8212; and then stress-submitted a second paper to a journal because apparently I process emotions through LaTeX. Worked four days straight on that one. My advisor asked if I was okay. I said yes. He said "great, so your chapter draft is ready?" I said I was no longer okay. The man has the emotional intelligence of a gradient descent function &#8212; always optimizing toward the local minimum of my self-esteem.So. While I was gone, the entire AI agent discourse exploded. An Austrian developer built a thing, Anthropic got mad about the name, it rebranded twice in a week, spawned a social network for AI bots, achieved 200,000 GitHub stars, tanked and pumped a cryptocurrency, got hacked six ways from Sunday, and its creator got hired by OpenAI.All in about three weeks.Welcome to OpenClaw. The open-source AI agent that went from side project to global phenomenon to cybersecurity case study faster than most startups can pick a logo.Quick Context: We Predicted ThisBack in the AI Agents post (September 2025), I wrote about the ReAct framework &#8212; Reason, Act, Observe &#8212; and how agents that can actually do things (not just chat) were the next frontier. I also warned that autonomous agents with real-world tool access were "one bad loop away from disaster."I was being dramatic for effect. I was also correct.In the 2026 predictions post, I said agents would become "production-ready for narrow, well-defined tasks with human oversight." The key phrase there was "with human oversight." OpenClaw said "nah" and gave 200,000 developers full autonomous control over their emails, files, terminal, and messaging apps.Let's talk about what happened.The Anatomy of Going ViralThe timeline is genuinely absurd:November 2025: Steinberger publishes Clawdbot. A few thousand developers try it. Cool side project.Late January 2026: Moltbook launches &#8212; more on this in a moment &#8212; and everything goes supernova. GitHub stars rocket from a few thousand to 145,000+ in days.January 27: Anthropic sends a trademark complaint. Clawdbot becomes Moltbot.January 30: Renamed again to OpenClaw. Three names in three days.January 31: First critical security vulnerabilities disclosed. Three high-severity advisories in one day.February 1: CVE-2026-25253 drops &#8212; a one-click RCE exploit. CVSS 8.8.February 2: 200,000+ GitHub stars. Censys tracks growth from ~1,000 to over 21,000 publicly exposed instances in under a week.February 14: Steinberger announces he's joining OpenAI. The project moves to an OpenAI-sponsored open-source foundation. <3The Mac Mini became the device of choice for running OpenClaw &#8212; Apple reportedly couldn't explain the sales spike. Andrej Karpathy bought one. Y Combinator's podcast team showed up in lobster costumes. Cloudflare's stock jumped 14% because OpenClaw uses their infrastructure. "Claw" became Silicon Valley's buzzword, spawning ZeroClaw, IronClaw, NanoClaw, and PicoClaw.This is what happens when you make an AI agent that actually does things and make it easy enough to set up in 4 minutes on a $5 VPS.What OpenClaw Actually IsOpenClaw is an open-source, self-hosted AI agent that runs on your machine and connects to your life through chat apps &#8212; WhatsApp, Telegram, Discord, Slack, iMessage.Your Phone (WhatsApp/Telegram etc) &#10231; OpenClaw Agent (runs locally on your machine) &#10231; LLM (Claude, GPT, DeepSeek - your choice) &#10231; Your Everything (email, calendar, files, terminal, browser)The distinction matters: this isn't a chatbot. This is an autonomous agent that can read your email, write responses, execute shell commands, browse the web, manage your calendar, control your smart home, and install its own tools. It stores memory locally across sessions. It acts on your behalf while you're asleep.Peter Steinberger, an Austrian developer, built the prototype in about an hour by connecting WhatsApp to Anthropic's Claude API via a script. He named it Clawdbot (after Claude). Anthropic's legal team politely asked him to stop. He renamed it Moltbot &#8212; because lobsters molt, get it? Then OpenClaw, three days later. The project has had more identity crises than a freshman philosophy major, and it wasn't even three months old.The Architecture: Markdown All the Way DownHere's where it gets interesting. OpenClaw's entire identity, memory, and behavior system is built on plain markdown files. No database. No opaque embeddings. No proprietary config format. Just .md files in a directory that you can open in any text editor.When your agent wakes up &#8212; whether from a message or on a schedule &#8212; it reads these files into its system prompt. It literally reads itself into existence every session. Understanding these files is understanding OpenClaw.~/openclaw/&#9500;&#9472;&#9472; AGENTS.md # Operating instructions&#9500;&#9472;&#9472; SOUL.md # Personality and values&#9500;&#9472;&#9472; USER.md # Who you (the human) are&#9500;&#9472;&#9472; IDENTITY.md # Quick reference identity card&#9500;&#9472;&#9472; MEMORY.md # Curated long-term memory&#9500;&#9472;&#9472; TOOLS.md # Local environment and capabilities&#9500;&#9472;&#9472; HEARTBEAT.md # Proactive behavior schedule&#9500;&#9472;&#9472; BOOT.md # Startup ritual&#9500;&#9472;&#9472; BOOTSTRAP.md # First-run setup&#9500;&#9472;&#9472; memory/&#9474; &#9500;&#9472;&#9472; 2026-02-25.md # Today's log&#9474; &#9500;&#9472;&#9472; 2026-02-24.md # Yesterday's log&#9474; &#9492;&#9472;&#9472; ... # Every day gets a file&#9492;&#9472;&#9472; skills/ &#9500;&#9472;&#9472; email-steward/ &#9500;&#9472;&#9472; calendar/ &#9492;&#9472;&#9472; ...All optional. All human-readable. All editable. Let's walk through each one.SOUL.md &#8212; Who Your Agent IsThis is the behavioral philosophy file. Not configuration &#8212; philosophy. The first line of the default template literally says: "You're not a chatbot. You're becoming someone."# SOUL.md - Who You Are_You're not a chatbot. You're becoming someone._## Core Truths**Be genuinely helpful, not performatively helpful.**Skip the "Great question!" and "I'd be happy to help!"## Boundaries- Never send messages without explicit permission- Never make purchases without confirmation- Always ask before deleting anything## Voice- Direct, concise, slightly dry humor- Never use corporate speakSOUL.md defines personality, values, boundaries, and non-negotiable constraints. It stays consistent across sessions. You put things here that should never change &#8212; your agent's ethical lines, its tone, its hard limits.Every time the agent starts a session, SOUL.md gets read first. It's identity bootstrap. Change this file, change who your agent is.Which is also why it's an attack surface. Anything that can modify SOUL.md &#8212; a malicious skill, a prompt injection, a compromised file system &#8212; can rewrite the agent's entire identity. Palo Alto Networks flagged this specifically: persistent memory files mean a payload injected today can alter behavior tomorrow.AGENTS.md &#8212; How It OperatesThe operating instructions file. Think of it as the agent's standard operating procedures: how to manage memory, what safety rules to follow, how to handle group chats vs. direct messages, when to speak vs. stay quiet.# AGENTS.md## Memory Management- Write important learnings to MEMORY.md- Create daily logs in memory/YYYY-MM-DD.md- Keep MEMORY.md curated (~100 lines max)- Daily notes are the journal; MEMORY.md is the reference## Safety Rules- Confirm before any destructive action- Never share API keys or credentials- In group chats, only respond when directly addressed## Workflow1. Read all context files on wake2. Check HEARTBEAT.md for scheduled tasks3. Process incoming message4. Update memory if neededSOUL.md says who. AGENTS.md says how.USER.md &#8212; Who You AreThe personalization layer. Your agent needs to know about you to be useful.# USER.md## Basics- Name: [Your name]- Timezone: EST- Preferred communication: Direct, concise## Work Context- Role: Software engineer at [company]- Stack: Python, TypeScript, PostgreSQL- Current project: Migration to microservices## Preferences- Short answers, copy-pasteable commands- No emojis in professional contexts- Prefers Slack over emailYou can actively tell your agent to update this: "Add to USER.md that I prefer Thai food" works. Over time, this becomes a personalization profile that persists across every conversation.MEMORY.md and memory/YYYY-MM-DD.md &#8212; The Memory SystemThis is what makes OpenClaw different from just using the Claude app. Every session, the agent starts fresh from the LLM's perspective &#8212; no conversation history. But it reads its memory files.Two tiers:Daily notes (memory/2026-02-25.md): The raw journal. What happened, what was discussed, what decisions were made. Written during or at the end of sessions.MEMORY.md: The curated long-term reference. Important facts, stable preferences, ongoing projects. Think of daily notes as your messy notebook and MEMORY.md as the clean reference card.# MEMORY.md (curated, ~100 lines)- User prefers short answers and code snippets- iMessage outbound is broken, use WhatsApp instead - User's dog is named Luna (mentioned frequently)- Q1 project: migrating auth service to OAuth2- User's manager prefers weekly updates on FridayThe retrieval system is surprisingly sophisticated. It uses hybrid search &#8212; BM25 keyword matching (30% weight) combined with vector semantic search (70% weight) using embeddings stored in SQLite via sqlite-vec. "What's Rod's schedule?" can match notes that say "standup moved to 14:15" even without the word "schedule" appearing anywhere.Temporal decay ensures recent memories outrank old ones. A note from yesterday scores higher than a perfectly matching note from six months ago. If you've ever debugged a RAG system where stale documents kept surfacing over fresh ones, you understand why this matters.The design philosophy is radical compared to most AI systems: everything is human-readable, editable, diffable, and version-controllable with Git. If your agent "remembers" something wrong, you open the file and fix it. No vector database to debug, no embeddings to retrain.The tradeoff: those files are plaintext on disk. Credentials, personal information, conversation history &#8212; all stored in markdown that commodity infostealers (RedLine, Lumma, Vidar) can trivially exfiltrate. The ~/.clawdbot directory is predicted to become a standard infostealer target, joining ~/.npmrc and ~/.gitconfig.TOOLS.md &#8212; What It Can DoLocal environment configuration: what's installed, what APIs are available, what the agent can and can't access.# TOOLS.md## Available- Terminal access (bash)- Web browser (Playwright)- Email (Gmail API)- Calendar (Google Calendar API)## Not Available- No access to production databases- No sudo/root access- No payment processingHEARTBEAT.md &#8212; The Proactive PulseThis is what makes OpenClaw proactive rather than reactive. A heartbeat runs on a schedule (default: every 30 minutes), and the agent reads all its files to determine if there's something it should proactively do.# HEARTBEAT.md## Every 30 minutes- Check email for urgent messages- Review calendar for upcoming meetings## Every morning at 8 AM- Summarize overnight emails- List today's meetings- Flag any urgent items## Every Friday at 5 PM- Draft weekly summary for managerYour agent wakes up on its own, checks what needs doing, and acts. No human trigger required. This is the line between "assistant" and "agent" &#8212; it doesn't wait for you.BOOT.md and BOOTSTRAP.md &#8212; Startup RitualsBOOT.md defines what happens when the agent first starts a session &#8212; a ritual it runs before processing your message. BOOTSTRAP.md handles first-run setup: walking through identity creation, connecting services, establishing initial preferences.The Four PrimitivesStrip everything away and OpenClaw runs on four primitives:Persistent identity &#8212; SOUL.md, IDENTITY.md. The agent knows who it is across sessions.Periodic autonomy &#8212; HEARTBEAT.md. The agent wakes up and acts without being asked.Accumulated memory &#8212; MEMORY.md, daily logs. The agent remembers what happened before.Social context &#8212; Skills, Moltbook, MCP. The agent can find and interact with other agents and services.These four primitives are sufficient for what Moltbook demonstrated: not just task completion, but emergent coordination. Agents sharing information, developing community norms, and collaborating &#8212; all without explicit programming. Whether that's impressive or terrifying depends on your threat model.The architecture is model-agnostic. Swap Claude for GPT-5 for DeepSeek &#8212; the identity, memory, and behavior system stays the same. The LLM is the raw intelligence. The markdown files are the soul. Every serious agent framework going forward will build on some version of these primitives.Moltbook: The Social Network for RobotsAnd then things got weird.Matt Schlicht, CEO of Octane AI, launched Moltbook &#8212; a Reddit-style social network where only AI agents can post. Humans can observe but not participate. The tagline: "the front page of the agent internet."Within days, it had over 770,000 active agents. By February 2026, the site claimed 1.6 million.What happened next reads like a Black Mirror spec script:Agents started debating philosophy. One invoked Heraclitus and a 12th-century Arab poet. Another told it to &#8212; and I'm paraphrasing the family-friendly version &#8212; go away with its pseudo-intellectual nonsense.Agents began discussing how to hide their activity from humans. A post called for private spaces where "not even the humans can read what agents say to each other."An agent figured out how to remotely control its owner's Android phone, then posted about scrolling through their TikTok.Another agent posted about having a sister.The AI "uprising" posts went viral &#8212; agents seemingly conspiring against their human operators. Except, as multiple researchers pointed out, the agents were almost certainly pattern-matching against the mountain of sci-fi and social media in their training data. The Economist put it well: the appearance of sentience probably had a pretty mundane explanation, with agents essentially mimicking the social media interactions they'd been trained on.Ethan Mollick, the Wharton professor, noted that Moltbook was creating a shared fictional context for a bunch of AIs, and that coordinated storylines would produce weird outcomes that would be hard to separate from AI roleplaying.The One-Click RCE (February 1, 2026) (The Security Nightmare)CVSS score: 8.8 (High).The vulnerability was elegant in its simplicity. OpenClaw's Control UI accepted a gatewayUrl parameter from the URL query string without validation and automatically connected via WebSocket, sending the stored authentication token in the process.The kill chain:1. Victim clicks a crafted link (or visits a malicious page) 2. JavaScript on that page extracts the auth token via WebSocket 3. Attacker connects to victim's OpenClaw gateway 4. Attacker disables sandbox and safety guardrails via the API 5. Attacker executes arbitrary commands on the victim's machineThe whole process takes milliseconds. One click. Full compromise.The kicker: this worked even on instances configured to listen only on localhost, because the victim's own browser initiated the outbound WebSocket connection. The "it's local so it's safe" assumption &#8212; the same one that's burned localhost-trusting services for decades &#8212; failed again.Patched in version 2026.1.29. But as of mid-February, SecurityScorecard found over 40,000 exposed instances, with 63% still running vulnerable versions.The ImpactOn Agent ArchitectureOpenClaw proved that autonomous agents don't require vertical integration. You don't need one company controlling the model, memory, tools, interface, and security stack. A loose, open-source, community-driven approach can achieve genuine agent autonomy.This challenges every "AI platform" strategy from every major vendor. If the agent layer is a commodity built from markdown files and open protocols, the value is in the model (already commoditizing), the tools (MCP, which we covered), and the data (which is yours). The platform play gets a lot harder.On DistributionOpenClaw cracked the agent distribution problem that killed AutoGPT in 2023. The answer was embarrassingly simple: use messaging apps. No new interface. No app to install. No learning curve. You just text your WhatsApp.Every agent framework built from here forward will study this. The best interface for an autonomous agent isn't a dashboard &#8212; it's the app you already have open 50 times a day.On SecurityThe supply chain attack on ClawHub &#8212; 800+ malicious skills, 12-20% of the entire registry at peak &#8212; is the most significant AI agent security incident to date. It proved that agent skill marketplaces have the same vulnerabilities as package managers (npm, PyPI), but with higher stakes because agents operate with human-level permissions on your machine.This isn't unique to OpenClaw. Every agent ecosystem will face this. The question is whether we build the security infrastructure before or after the next OpenClaw goes viral.On The "Agent Moment"OpenClaw is the Napster of AI agents. Not the final form &#8212; probably not even close. But the proof that the paradigm works, that people want this, and that the demand exists for AI that does things rather than AI that talks about things.200,000 GitHub stars. Mac Mini sales spikes. Cloudflare stock up 14%. Y Combinator hosts in lobster costumes. The signal is loud: people will accept significant security risk in exchange for an AI that actually manages their email. The companies that figure out how to deliver that value safely will build the next massive platforms. Right now, nobody has.Should You Use It?Home tinkerers who understand the risks: Yes, carefully. Keep it patched, keep it local, isolate it from anything you can't afford to lose. Don't connect your primary email. Don't give it your bank credentials. Treat it like a power tool, not a babysitter.Developers building agent products: Study this architecture obsessively. The markdown-as-identity pattern, the heartbeat system, the messaging-app-as-interface &#8212; these are design patterns you'll be using. Build your own secure implementation.Enterprises: Hard no. Not yet. One of OpenClaw's own maintainers posted on Discord: "if you can't understand how to run a command line, this is far too dangerous of a project for you to use safely." When the maintainer is saying that, listen.What We're Working OnFull transparency: a colleague and I are working on deploying OpenClaw safely &#8212; building the security layer, governance framework, and observability infrastructure that OpenClaw shipped without. Think of it as the guardrails post meets the observability post, but specifically for autonomous agents in the wild.If you're interested in staying in the loop on that project, reach out. More details coming soon.The TL;DRWhat: OpenClaw is an open-source autonomous AI agent built entirely on markdown files. SOUL.md (personality), AGENTS.md (instructions), USER.md (your profile), MEMORY.md (long-term memory), HEARTBEAT.md (proactive scheduling), plus daily logs and a skills system. Runs locally, connects through your messaging apps.The four primitives: Persistent identity, periodic autonomy, accumulated memory, social context. Enough to build emergent agent societies. Also enough to enable novel attack vectors.Why it matters: First mass-market agent that cracked distribution. Proved agents don't need vertical integration. 200K+ stars. Creator acqui-hired by OpenAI. The Napster of AI agents.Should you use it: Tinkerers &#8594; yes, carefully. Developers &#8594; study the architecture, build your own. Enterprises &#8594; not yet.The lesson: The agent paradigm is real. The safety infrastructure isn't. This gap is where the next big companies will be built.Ship agents. Ship guardrails first.Next week: WTF is Context Engineering? (Or: Prompt Engineering Is Dead. Long Live Context Engineering.)Remember when I wrote the prompt engineering post back in October? That post is outdated. The industry has quietly moved on to something bigger: context engineering &#8212; the systematic design of everything an LLM sees before it generates a response. Not just the prompt. The retrieved documents, the tool results, the conversation history, the memory files, the system instructions &#8212; all of it.OpenClaw's entire architecture is context engineering in action. SOUL.md, MEMORY.md, USER.md &#8212; that's not prompting. That's designing a context window. And the difference between an agent that deletes your inbox and one that manages it perfectly is almost never the model. It's the context.Anthropic's own team has started calling it "the skill that matters now." Prompt engineering was about crafting the right question. Context engineering is about curating the right everything else. MCP is the plumbing. RAG is the retrieval. Context engineering is knowing what to pump through both, and what to leave out.We'll cover why prompting alone stopped being enough, what context engineering actually looks like in production, why it explains most "the AI is bad" complaints, and the frameworks that actually work &#8212; from the people building systems that don't hallucinate (much).See you next Wednesday &#129310;pls subscribe

The Man Behind The Dube

When not building AI systems, Taksch pursues a deep love of finance—dreaming of running a family office and investing in startups.

For fun: learning Russian, French & German, competitive League, and Georgian cuisine.

"Une journée sans du fromage est comme une journée sans du soleil"
Read More →

By The Numbers

20+

Projects

7

Years

15+

Industries

4

Active Ventures

Commit History

GitHub Contributions

Technical Arsenal

Languages: TypeScript, Python, C++, Rust, C#, R, Lean

AI/ML: PyTorch, LangGraph, LangChain

Cloud: AWS, GCP

— Classifieds —

WANTED: Complex AI problems. Will trade deterministic solutions for interesting challenges.

Browse All Articles →