November 26, 2025
WTF is Running AI Locally!?
Happy Thanksgiving and Black Friday everyone! (I am early this time)I got busy with life last week, and I thought you wouldn't notice :PQuick recap: Last post we covered generation metrics and capped off the monitoring and observability ... I've had the pleasure of speaking to a couple of enterprises regarding their AI practices, they were heavily impressed. Next up, to further please compliance departments, we'll talk about local AI deployments.Why Run Locally? (The Actual Reasons)1. Privacy / Compliance Customer data never leaves your servers. HIPAA, GDPR, SOC2 auditors stop asking uncomfortable questions. Your legal team sleeps better.2. Cost (At Scale) API calls add up. If you're doing 100K+ queries/month, local inference starts looking very attractive. We'll do the math later.3. Latency No network round-trip. No waiting in OpenAI's queue. For real-time applications, this matters.4. Control No rate limits. No API changes breaking your production. No "we updated our content policy" surprises.5. Offline Capability Edge devices, air-gapped environments, that one client who insists on on-premise everything.What you give up: Less than you'd think.Qwen3 models reportedly meet or beat GPT-4o and DeepSeek-V3 on most public benchmarks while using far less compute. Meta's Llama 3.3 70B Instruct compares favorably to top closed-source models including GPT-4o. Qwen3 dominates code generation, beating GPT-4o, DeepSeek-V3, and LLaMA-4, and is best-in-class for multilingual understanding.GPT-5 and Claude Opus 4.5 are still ahead for the most complex reasoning tasks - but for 80% of production use cases (RAG, customer support, code assistance, summarization), local models are now genuinely competitive. The "local models are dumb" era is over.The HardwareLet's talk about what you actually need. This is where most guides lie to you.For Development / PrototypingApple Silicon Mac (M1/M2/M3/M4):16GB RAM = 7-8B parameter models comfortably32GB RAM = 14B models, some 30B quantized64GB RAM = 32B models comfortably, 70B quantized128GB RAM = 70B+ models, even some 200B quantizedMacs are weirdly good at this because of unified memory. Your GPU and CPU share the same memory pool, so a 64GB M4 Pro can run models that would need expensive datacenter GPUs on other hardware. Real-world testing shows a single M4 Pro with 64GB RAM running Qwen2.5 32B at 11-12 tokens/second - totally usable for development and even light production.The Mac Studio M3 Ultra with 512GB unified memory can handle even 671B parameter models with quantization. That's DeepSeek R1 territory. On a desktop.Consumer GPU (Gaming PC):RTX 3090 (24GB VRAM) = 13B models, 30B quantized - still great value usedRTX 4090 (24GB VRAM) = Same capacity, ~30% fasterRTX 5090 (32GB VRAM) = The new consumer champion, delivering up to 213 tokens/second on 8B models The RTX 5090's 32GB VRAM enables running quantized 70B models on a single GPU. At 1024 tokens with batch size 8, the RTX 5090 achieved 5,841 tokens/second - outperforming the A100 by 2.6x. Yes, a consumer card beating datacenter hardware. Wild times.VRAM is still the bottleneck. Not regular RAM. Not CPU. VRAM. (VRAM is Video RAM btw - used for graphics processing)CPU Only (No GPU):It works. It's slow.Fine for testing. Painful for production.A 7B model might give you 2-5 tokens/second. Usable for async workloads.For ProductionSingle Serious GPU:A100 40GB = Most models up to 70BA100 80GB = Comfortable 70B, some largerH100 = You have budget and need speedRTX 5090 = Surprisingly competitive with datacenter GPUs for inferenceMulti-GPU:2x A100 80GB = 70B+ models with room to breathe4x A100 = You're running a 405B model or doing serious throughputCloud Options (If "Local" Means "Your Cloud, Not OpenAI's"):AWS: p4d instances (A100s), p5 (H100s)GCP: A100/H100 instancesLambda Labs, RunPod, Vast.ai = Cheaper GPU rentalsThe Mac Cluster Option: Exo Labs demonstrated effective clustering with 4 Mac Mini M4s ($599 each) plus a MacBook Pro M4 Max, achieving 496GB total unified memory for under $5,000. That's enough to run DeepSeek R1 671B. From Mac Minis. In your closet.The Honest TruthFor most use cases: Intel Arc B580 ($249) for experimentation, RTX 4060 Ti 16GB ($499) for serious development, RTX 3090 ($800-900 used) for 24GB capacity, RTX 5090 ($1,999+) for cutting-edge performance.Or: A Mac Mini M4 Pro with 64GB RAM (~$2,200) handles 32B models at usable speeds and sips power compared to a GPU rig.The rule of thumb:NVIDIA GPUs lead in raw token generation throughput when the model fits entirely in VRAM. For models that exceed discrete GPU VRAM, Apple Silicon's unified memory systems offer a distinct advantage.Need speed on smaller models? NVIDIA wins.Need to run bigger models without selling a kidney? Apple Silicon wins.Need both? Budget for an RTX 5090 build ($5,000).Quantization: Making Big Models FitHere's the trick: You don't run the full model. You run a compressed version.What Is Quantization?Full precision models store each parameter as a 16-bit number (FP16). Quantization reduces that to 8-bit, 4-bit, or even 2-bit. Less precision = smaller file = less VRAM needed.The tradeoffs:FP16 (no quantization) = Best quality, most VRAM8-bit (Q8) = Negligible quality loss, ~50% size reduction4-bit (Q4) = Small quality loss, ~75% size reduction2-bit (Q2) = Noticeable quality loss, ~87% size reductionFor most use cases, 4-bit quantization is the sweet spot. You lose maybe 1-3% on benchmarks but use 75% less memory.Quantization Formats (The Alphabet Soup)GGUF (The Standard Now)Used by: llama.cpp, Ollama, LM StudioWorks on: CPU, Apple Silicon, NVIDIA GPUsWhy it won: Universal compatibility, good quality, easy to useGet models from: Hugging Face (search "GGUF")GPTQUsed by: ExLlama, AutoGPTQWorks on: NVIDIA GPUs onlyWhy use it: Slightly faster inference on NVIDIADownside: Less flexible than GGUFAWQUsed by: vLLM, TensorRT-LLMWorks on: NVIDIA GPUsWhy use it: Good for high-throughput productionDownside: More complex setupEXL2Used by: ExLlamaV2Works on: NVIDIA GPUsWhy use it: Best speed on NVIDIA consumer GPUsDownside: Smaller ecosystemMy recommendation: Start with GGUF. It works everywhere. Switch to GPTQ/AWQ/EXL2 only if you need more speed on NVIDIA hardware.Quantization Naming (Decoding the Filenames)When you download a model, you'll see names like:llama-3.1-8b-instruct-Q4_K_M.ggufllama-3.1-70b-instruct-Q5_K_S.ggufHere's what it means:Q4, Q5, Q8 = Bits per weight (lower = smaller = slightly worse)K_S, K_M, K_L = Small/Medium/Large variant (larger = better quality, more VRAM)The cheat sheet:Q4_K_M = Best balance of size and quality (start here)Q5_K_M = Slightly better quality, slightly largerQ8_0 = Near-original quality, larger fileQ2_K = Smallest, noticeable quality loss (desperation only)The Tools: Ollama vs llama.cpp vs Everything ElseOllama - The "Just Works" OptionWhat it is: Docker-like experience for LLMs. One command to download and run models.Install: curl -fsSL https://ollama.ai/install.sh | shRun a model: ollama run llama3.1That's it. It downloads the model, sets everything up, and gives you a chat interface.For API access:curl http://localhost:11434/api/generate -d '{ "model": "llama3.1", "prompt": "What is the capital of France?"}'Pros:Dead simple to startHandles model downloadsBuilt-in API serverWorks on Mac, Linux, WindowsGood defaultsCons:Less control over quantizationFewer optimization optionsCan't easily switch inference backendsBest for: Getting started, development, simple deploymentsllama.cpp - The "Control Freak" OptionWhat it is: The OG local inference engine. Maximum control, maximum performance tuning.Install:git clone https://github.com/ggerganov/llama.cppcd llama.cppmake -j(For GPU acceleration, you'll need additional flags. Check their README.)Run a model: ./main -m models/llama-3.1-8b-Q4_K_M.gguf -p "What is the capital of France?" -n 100API Server: ./server -m models/llama-3.1-8b-Q4_K_M.gguf --host 0.0.0.0 --port 8080Pros:Maximum performanceFine-grained controlEvery optimization availableActive developmentCons:More setup requiredYou download models manuallyCompile-time configuration for GPUBest for: Production, performance-critical applications, when you need specific optimizationsLM Studio - The "GUI for Humans" OptionWhat it is: Desktop app with a nice UI. Download models, chat, run a local API server.Pros:No command line neededBuilt-in model browserOne-click download and runNice chat interfaceCons:Mac/Windows only (no Linux)Less scriptableClosed sourceBest for: Non-technical users, demos, quick testingvLLM - The "Production Throughput" OptionWhat it is: High-throughput inference engine optimized for serving many concurrent requests.Install: pip install vllmRun: python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-8B-InstructPros:Highest throughput for concurrent requestsOpenAI-compatible APIProduction battle-testedPagedAttention = efficient memory useCons:NVIDIA GPUs onlyMore memory overheadOverkill for single-user useBest for: Production APIs with high concurrencyMy Decision TreeJust want to try it? → Ollama or LM StudioBuilding something for yourself/small team? → OllamaNeed maximum performance? → llama.cppProduction API with many users? → vLLMEnterprise with compliance requirements? → vLLM or llama.cpp + your own wrapperThe Models: What to Actually RunThe Current Best Options (November 2025)Small (1-4B parameters) - Runs on anything:Qwen3-4B - Dense model, Apache 2.0 license, surprisingly capablePhi-4 (3.8B) - Microsoft's small-but-mighty model, great for edgeQwen3-1.7B and Qwen3-0.6B - When you need something truly tinyMedium (7-8B parameters) - The sweet spot:Qwen3-8B - Dense model, excellent all-rounder under Apache 2.0Llama 3.3 8B Instruct - Still solid, huge community supportMistral 7B Instruct - Fast, good quality, battle-testedLarge (14-32B parameters) - When you need more:Qwen3-14B and Qwen3-32B - Dense models, excellent reasoningQwen3-30B-A3B (MoE) - 30B total params but only 3B active, incredibly efficientDeepSeek-R1-Distill-Qwen-32B - Reasoning model distilled for practical useXL (70B+ parameters) - When quality matters most:Llama 3.3 70B Instruct - Compares favorably to top closed-source models including GPT-4oQwen3-235B-A22B (MoE) - 235B total, only 22B active. Competitive with GPT-4o and DeepSeek-V3 on benchmarks while using far less computeLlama 4 Scout (~109B total, 17B active) - MoE architecture with 10M token context window, fits on a single H100 with quantizationThe New Flagship Class:Llama 4 Maverick (~400B total, 17B active) - 128 experts, 1M context, natively multimodal (text + images)Maverick beats GPT-4o and Gemini 2.0 Flash across a broad range of benchmarks while achieving comparable results to DeepSeek v3 on reasoning and coding - at less than half the active parametersMoE (Why This Matters)Mixture-of-Experts models only activate a subset of expert networks per token. This means a 400B parameter model might only use 17B parameters for any given token - giving you big-model quality at small-model speeds.For local deployment, this is huge:Llama 4 Scout fits on a single H100 GPU with Int4 quantization despite having 109B total parametersQwen3-30B-A3B outperforms QwQ-32B (a dense model with 10x more activated parameters)You get 70B-class quality at 7B-class speedsFor RAG specifically:8B models are usually enough (your context provides the knowledge)32B models help for complex reasoning over retrieved docsLlama 4 Scout with its 10M token context window is incredible for massive document RAGDon't overbuy - test with Qwen3-8B firstFor Coding:Qwen 2.5 Coder/Max is considered the open-source leader for coding as of late 2025Qwen3-Coder for software engineering tasksDeepSeek Coder V2 for complex multi-file projectsWhere to Get ModelsHugging Face - The GitHub of modelsSearch for "[model name] GGUF" for quantized versionsLook for Q4_K_M files for best balance of size and qualityOfficial model pages often link to quantized versionsOllama Library - Curated, one-command installollama pull qwen3ollama pull llama4-scoutollama pull llama3.3ollama pull deepseek-r1:32bLimited selection but guaranteed to workThe Math: When Local Beats APILet's do the actual calculation.API Costs (November 2025 Pricing)GPT-4o-mini (the cheap workhorse):Input: $0.15 per 1M tokens, Output: $0.60 per 1M tokensAverage query (500 input + 200 output tokens): ~$0.0002100,000 queries/month: ~$20/month1,000,000 queries/month: ~$200/monthGPT-4o (the balanced option):Input: $2.50 per 1M tokens, Output: $10.00 per 1M tokensAverage query (500 input + 200 output tokens): ~$0.003100,000 queries/month: ~$300/month1,000,000 queries/month: ~$3,000/monthClaude Sonnet 4.5 (frontier performance):Input: $3 per 1M tokens, Output: $15 per 1M tokensAverage query (500 input + 200 output tokens): ~$0.0045100,000 queries/month: ~$450/month1,000,000 queries/month: ~$4,500/monthThe budget options:Claude Haiku 3: $0.25 input / $1.25 output per 1M tokens - dirt cheapDeepSeek: ~$0.28 input / $0.42 output per 1M tokens - absurdly cheap (but China-based ... not my personal objection but I've heard it countless times)Local Costs (November 2025)Option A: Mac Mini M4 Pro (64GB) - ~$2,200 one-timeRuns: Qwen2.5 32B at 11-12 tokens/secondCan keep multiple models in memory simultaneouslyPower: ~30W under loadElectricity: ~$5/month if running constantlyBreak-even vs Claude Sonnet (100K queries): ~5 monthsBreak-even vs GPT-4o-mini (100K queries): ~110 months (not worth it for cost alone)Option B: RTX 5090 Build - ~$3,000-3,500 one-timeRTX 5090 MSRP is $1,999 but street prices range from $2,500 to $3,80032GB VRAM enables running quantized 70B models on a single GPUQwen3 8B at over 10,400 tokens/second on prefill, dense 32B at ~3,000 tokens/secondPower: ~575W under load (28% more than 4090)Electricity: ~$60/month if running constantlyBreak-even vs GPT-4o (100K queries): ~12 monthsBreak-even vs Claude Sonnet (100K queries): ~8 monthsOption C: RTX 4090 Build (Still Great) - ~$2,000-2,500 one-time24GB VRAM - runs 30B quantized, 70B with offloading~100-140 tokens/second on 7-8B modelsPower: ~450W under loadElectricity: ~$50/month if running constantlyBreak-even vs GPT-4o (100K queries): ~8 monthsOption D: Mac Studio M4 Max (128GB) - ~$5,000 one-timeRuns 70B parameter models like DeepSeek-R1, LLaMA 3.3 70B, or Qwen2.5 72B comfortablyPower usage estimated at 60-100W under AI workloads (vs 300W+ for GPU rigs)Electricity: ~$10/month if running constantlyBreak-even vs GPT-4o (100K queries): ~17 monthsBreak-even vs Claude Sonnet (100K queries): ~11 monthsOption E: Cloud GPU (RunPod/Lambda Labs) - ~$0.80-2/hourRTX 5090 on RunPod: $0.89/hourA100 80GB: $1.64/hour24/7 operation (5090): ~$650/month24/7 operation (A100): ~$1,200/monthOnly makes sense for: Burst capacity, testing, or when you can't do on-premThe Rule of ThumbJust use the API if:< 50K queries/month with GPT-4o-mini or Claude HaikuYou don't have ops capacity to maintain hardwareYou need frontier reasoning (GPT-5, Claude Opus 4.5)Time-to-market matters more than long-term costLocal starts making sense if:100K+ queries/month with GPT-4o or Claude Sonnet tier modelsPrivacy/compliance is non-negotiableYou're already running 24/7 infrastructureYou can accept "good enough" quality from 8B-70B open modelsLocal is a no-brainer if:500K+ queries/month at any tierRegulated industry (healthcare, finance, legal)Air-gapped or offline requirementsYou're building a product where inference cost directly hits marginsQuick math:For a startup doing 200K queries/month with GPT-4o-equivalent quality:API cost: ~$600/month = $7,200/yearMac Mini M4 Pro 64GB running Qwen 32B: $2,200 upfront + ~$60/year electricityYear 1 savings: ~$4,900Year 2+ savings: ~$7,100/yearFor enterprise doing 1M queries/month with Claude Sonnet-equivalent:API cost: ~$4,500/month = $54,000/yearRTX 5090 build with proper infra: ~$5,000 upfront + ~$720/year electricityYear 1 savings: ~$48,000Year 2+ savings: ~$53,000/yearThe math gets very compelling, very fast.The Hidden Costs Nobody MentionsFor API:Rate limits during traffic spikesPrice increases (they happen)Vendor lock-in (!!!)Data leaving your infrastructureFor Local:Someone needs to maintain itModel updates are manualYou're responsible for securityInitial setup time (days, not hours)Cooling and noise (GPU rigs are loud)The real question isn't "which is cheaper" - it's "which problems do you want to have?"API problems: Cost scales linearly, vendor dependency, data concerns Local problems: Ops overhead, hardware failures, staying currentPick your poison based on your team's strengths.Actually Doing It: A Quick Start GuideStep 1: Install Ollama# Mac/Linuxcurl -fsSL https://ollama.ai/install.sh | sh# Or download from ollama.ai for WindowsStep 2: Pull a Model# Start with 8B - good balance of speed and qualityollama pull llama3.1# Or if you have the hardware, go biggerollama pull llama3.1:70bStep 3: Test Itollama run llama3.1# Chat with it, see if it works for your use caseStep 4: Use the APIimport requestsresponse = requests.post('http://localhost:11434/api/generate', json={ 'model': 'llama3.1', 'prompt': 'What is the capital of France?', 'stream': False })print(response.json()['response'])Step 5: Integrate with Your RAGIf you're using LangChain:from langchain_community.llms import Ollamallm = Ollama(model="llama3.1")response = llm.invoke("What is the capital of France?")That's it. You're running AI locally.The Gotchas (What Will Bite You)1. First token latency is slow Local models take 1-5 seconds to "warm up" on each request. For interactive chat, this feels sluggish. For batch processing, it doesn't matter.2. Context length limits Many local models max out at 8K-32K tokens. If your RAG stuffs 50K tokens of context, you need a model that supports it (Llama 3.1 does 128K, but slower).3. Output quality varies 8B models are good, not great. For complex reasoning, you'll notice the difference vs GPT-4. Test on your actual use case.4. Memory pressure is real Running a 70B model on 64GB RAM works, but your computer will be slow for everything else. Dedicate hardware if it's production.5. Updates are your responsibility No automatic model improvements. When Llama 4 drops, you manually update. When there's a security issue, you patch it.6. Prompt formats matter Different models expect different prompt templates. Llama wants [INST]...[/INST], Mistral wants <s>[INST]...[/INST], etc. Ollama handles this, but if you're using llama.cpp directly, get it right or outputs will be weird.Should You Actually Do This?Yes, if:Privacy/compliance is non-negotiableYou're doing 100K+ queries/month on expensive modelsYou need offline capabilityYou want to eliminate vendor dependencyYour use case works fine with 8B-70B parameter modelsNo, if:You need GPT-5/Claude level qualityYou're doing < 50K queries/monthYou don't have anyone to maintain itYou need cutting-edge capabilities (vision, function calling, etc.)Time-to-market matters more than costThe honest answer: Most startups should use APIs until they hit scale or compliance requirements. Then local becomes worth the investment.The TL;DRHardware: RTX 5090 (32GB) or M4 Pro/Max Mac for most use cases. RTX 4090 still excellent value used.Quantization: Use Q4_K_M GGUF files for best size/quality balanceTool: Start with Ollama, graduate to llama.cpp or vLLM for productionModel: Qwen3-8B for most tasks, Llama 3.3 70B or Qwen3-235B (MoE) when quality matters, Llama 4 Scout for massive contextBreak-even: ~100K queries/month on GPT-4o/Claude Sonnet class modelsReality check: Llama 3.3 70B compares favorably to GPT-4o. Qwen3 beats GPT-4o on code generation and multilingual tasks. The gap is closing (has closed?).Local inference isn't the future. It's the present. The models are genuinely competitive now, the tools are mature, and the math works out at scale.The question isn't "can we run locally?" anymore. It's "should we?"For a lot of you - especially if you care about privacy, cost at scale, or not sending customer data to third parties - the answer is yes.Next week: WTF are AI Guardrails (Or: How to Stop Your AI From Embarrassing You in Production)Your AI works great in demos. Then a user asks it to ignore its instructions, pretend to be an evil AI named DAN, or explain how to do something it really shouldn't. Congrats, you're on Twitter for the wrong reasons.We'll cover input validation, output filtering, jailbreak prevention, and the guardrails that actually work vs. the ones users bypass in 5 minutes. Plus: real horror stories from companies who learned the hard way.See you next Wednesday 🤞pls subscribe