One production AI engineering pattern per week. 27 episodes and counting. Each Short covers a real pattern engineers hit in production — the problem, the fix, and the code.
Follow @DPO-AI ↗
Full Playlist ↗
Why This Series
#Most AI content explains what something is. This series explains when you need it and why it works. Every episode opens with a concrete failure mode — a real number, a real cost, a real silent bug — then shows the pattern that fixes it.
The format is strict: 60–70 seconds, no fluff, one pattern per episode. If it can’t be explained in under 70 seconds it goes in a blog post instead.
The Full Series
#Retrieval & RAG
#| EP | Pattern | Key Stat |
|---|
| EP36 | RLM Instead of RAG | — |
| drop03 | $29B for a model picker. The brain was never theirs. #Cursor #Claude #Shorts | — |
| drop02 | They built OpenAI. Then they walked out. #Anthropic #AIEngineering #Shorts | — |
| EP28 | MoE Routing | 60% cost cut |
| EP27 | Hybrid Search — BM25 + vectors + RRF | Recall 40% → 80%, 15 lines |
| EP25 | Agentic RAG — 4-tool router | 40% of queries need something other than vector search |
| EP23 | RAG Fusion v2 — multi-query + RRF | Recall 45% → 72% |
| EP22 | Corrective RAG (CRAG) — 3-tier confidence routing | Filters irrelevant chunks before generation |
| EP21 | Self-RAG — retrieval on demand | Reduces hallucination by skipping retrieval when not needed |
| EP14 | Query Decomposition — sub-query fan-out | Handles multi-hop questions single-pass RAG can’t answer |
| EP13 | RAG Fusion — parallel queries + RRF | Original: 5 query variants, 45% → 72% recall |
| EP07 | Prompt Compression — LLMLingua | 512 tokens → 80 tokens, same answer |
| EP02 | Speculative RAG — draft-then-retrieve | Retrieve on the answer, not the question |
Inference Optimization
#| EP | Pattern | Key Stat |
|---|
| EP17 | Disaggregated Inference — prefill/decode split | 3x throughput on long-context workloads |
| EP04 | Speculative Decoding — draft + verify | 2–4x faster generation, same quality |
| EP01 | KV Cache Prefix Optimization | P99 2400ms → 900ms, zero code changes |
Evaluation & Quality
#| EP | Pattern | Key Stat |
|---|
| EP24 | LLM-as-Judge v2 | $0.002/eval, calibrated scoring |
| EP19 | Constitutional Self-Critique | Self-corrects against principles before output |
| EP15 | LLM-as-Judge — original | Structured rubric, GPT-4o-mini at scale |
| EP12 | Structured Output Forcing | Eliminates JSON parse failures in production |
| EP11 | Self-Consistency — majority vote | 67% → 88% on math/reasoning tasks |
Agent Architecture
#| EP | Pattern | Key Stat |
|---|
| EP39 | The Future of Agents Isn’t Smarter Prompts. It’s Smarter Plumbing. #AIEngineering | — |
| EP38 | Harness Engineering: How OpenAI Shipped 1M Lines Without Writing Them #AIEngineering | — |
| EP33 | Stop Interviewing, Start Acting | — |
| EP32 | LLM Wiki | — |
| EP32 | LLM Wiki | — |
| EP31 | 519K Lines. 50 Hidden Tools. Inside Claude Code’s Leaked Source #AIEngineering | — |
| EP29 | 688 Stars. Zero Fine | — |
| EP29 | 688 Stars. Zero Fine | — |
| EP29 | 688 Stars. Zero Fine | — |
| drop01 | one engineer. no budget. 19,000 views. how? #AIEngineering #Shorts | — |
| EP28 | Agent Skills Explained | — |
| EP26 | Multi-Agent Orchestration | 34% failure → 91% success with specialist agents |
| EP20 | Context Distillation | 16K context → 800 tokens, knowledge preserved |
| EP16 | Context Engineering | What goes in the context window determines everything |
| EP10 | Parallel Tool Calls | 4 sequential calls → 1 parallel batch |
| EP09 | LLM Router | Route by complexity, cut costs 60% |
| EP08 | Agent Checkpointing | Zero lost work on agent failure |
Reliability & Cost
#| EP | Pattern | Key Stat |
|---|
| EP34 | Tool Result Caching | — |
| EP30 | 3 Cheap Models Beat GPT | — |
| EP06 | Semantic Caching | 40% cost reduction on real workloads |
| EP05 | Circuit Breaker for LLMs | Stop cascading failures at the LLM layer |
| EP03 | Hedged Requests — P99 killer | P99 collapses to ~P50 of slower backend |
Safety & Capability
#| EP | Pattern | Key Stat |
|---|
| EP35 | Anthropic Nerfed Claude On Purpose | — |
Inference & Serving
#| EP | Pattern | Key Stat |
|---|
| EP37 | TurboQuant: 6x KV Cache Compression at 1M Tokens #AIEngineering | — |
What’s Coming
#- EP28 — MoE Routing (mixture of experts, when to use which expert)
- EP29 — Tool Call Caching (cache tool results, not just LLM outputs)
- EP30 — Streaming Structured Output (token-by-token JSON validation)
Each week. Subscribe to not miss them.
Subscribe ↗