01 / Blog · PoVs · Working notes

The blog

Long-form thinking on the engineering and economics of AI-Native systems, published when I have something worth saying, not on a schedule.

Principles of Building AI Agents That Hold in Production

The principles of building AI agents do not live in any framework: bound the autonomy, name what you never delegate, evaluate continuously, and design honest memory.

AgentsEngineering

Blog
Jun 18, 2026
12 min

How to Build an AI Agent (the Loop That Holds)

How to build AI agents that hold: spec the task, give it bounded tools, add guardrails in code, wire evals, and ship behind a human gate.

AgentsEngineering

Blog
Jun 17, 2026
12 min

Agentic AI Frameworks Compared (From Production)

There is no single best agentic AI framework. Compare LangGraph, CrewAI, and the OpenAI Agents SDK by what each costs you in control, observability, and lock-in - not by the feature list.

AgentsEngineering

Blog
Jun 16, 2026
10 min

Agentic AI Examples: What's Genuinely Shipping

The agentic AI shipping in 2026 clusters in four categories: coding, research, customer-ops, and data/ops automation. Concrete, dated examples here.

Agents

Blog
Jun 15, 2026
12 min

Offline vs Online LLM Evaluation: Why You Need Both

Offline evaluation gates a deploy against a frozen set; online evaluation measures real behavior after release. You need both.

EvalsEngineering

Blog
Jun 14, 2026
9 min

Memory Systems for AI Agents: Remember Without Inventing

AI agent memory is what an agent retains across steps and sessions. The hard part is honesty: a system that misremembers beats nothing and harms plenty.

AgentsEngineering

Blog
Jun 13, 2026
10 min

LLM Evaluation: Measuring What Will Break

LLM evaluation is the harness that gates a real deploy. Learn what to measure, which metrics lie, when to trust an LLM judge, and who should own it.

EvalsEngineering

Blog
Jun 12, 2026
20 min

Human-in-the-Loop Evaluation That Scales

Human-in-the-loop evaluation scales only when people review the flagged tail - the low-confidence, high-stakes, adversarial slice - not every output.

EvalsLeadership

PoV
Jun 11, 2026
9 min

The Best AI Agents in 2026 (An Honest Roundup)

The best AI agents in 2026 are coding agents, deep-research agents, customer-ops agents, and orchestration frameworks - each strong in a narrow band.

Agents

Blog
Jun 10, 2026
13 min

Agentic RAG: When Your Agent Needs to Retrieve

Agentic RAG lets the agent decide when and what to retrieve, iterate, and verify. It wins on multi-hop and ambiguous queries, and it costs you.

RAGAgents

Blog
Jun 9, 2026
11 min

Agentic Coding: What Changes When the Machine Writes Code

Agentic coding is the AI-Native SDLC in practice: the machine writes the implementation, the engineer specifies intent and evaluates the diff.

AgentsAI-Native

Blog
Jun 8, 2026
12 min

Agentic AI Use Cases and the Constraint That Picks One

The best agentic AI use cases are repetitive, tool-bounded, and high-volume with a checkable outcome. Match the use case to the constraint, not the hype.

AgentsStrategy

Blog
Jun 7, 2026
11 min

How to Build an LLM Evaluation Framework

A good LLM evaluation framework tests what will break in production: a golden set from real traffic, task metrics, blinded rubrics, and a drift cadence.

EvalsEngineering

Blog
Jun 6, 2026
10 min

AI-Native means the machine does the job

Not assisted. Not augmented. The model does the whole job, and our role narrows to a single thing: judgment.

AI-NativeStrategy

Blog
Jun 5, 2026
11 min

AI Agents and Agentic Workflows: An Honest Field Guide

Agentic workflows let AI agents take actions toward a goal in a loop. They earn their keep in a narrow band - here is exactly where, and where they fail.

AgentsAI-Native

Blog
Jun 4, 2026
18 min

Agentic Design Patterns That Actually Work

The agentic design patterns that survive production are the bounded ones: tool-use with guardrails, plan-then-execute, reflection, and HITL at named decisions.

AgentsEngineering

Blog
Jun 3, 2026
11 min

Agentic AI vs Generative AI: What's Actually Different

Generative AI produces content from a prompt. Agentic AI plans and acts toward a goal - and actions carry consequences generation never does.

AgentsAI-Native

Blog
Jun 2, 2026
11 min

LLM Evaluation Metrics That Matter (and the Ones That Lie)

The LLM evaluation metrics that matter measure what breaks in production. The ones that lie measure what looks good in a deck. Here is how to tell them apart.

Evals

Blog
Jun 1, 2026
10 min

Evals that predict production, not vanity

Most eval suites measure the wrong thing and pass right up until launch. Here is the harness I actually trust before I ship.

EvalsEngineering

Blog
May 31, 2026
11 min

The CRO's case for shipping smaller models

Revenue rarely rewards the biggest model. It rewards the one you can afford to run, ship, and explain to a customer.

InferenceStrategy

PoV
May 30, 2026
10 min

LLM-as-a-Judge: When to Trust It

LLM-as-a-judge is reliable for cheap, scaled, relative grading on tight rubrics. It breaks wherever its own biases contaminate the call. When to trust it.

EvalsEngineering

Blog
May 29, 2026
10 min

RAG Evaluation: Measuring Retrieval Before It Collapses

RAG evaluation works only when you score retrieval and generation separately on a frozen golden set. Here is how to catch recall decay before it ships.

RAGEvals

Blog
May 28, 2026
10 min

When doing is cheap, deciding is everything

If generation costs approach zero, value migrates to whoever can tell good output from bad. What that does to a company.

EconomicsStrategy

Blog
May 27, 2026
11 min

LLM Evaluation Tools Compared (From Production)

The right LLM evaluation tool depends on whether you need offline suites, online monitoring, or human labeling. Most teams need a thin layer they control.

EvalsEngineering

Blog
May 26, 2026
10 min

'A human reviews it' is not a plan

Putting a person in the loop feels safe and scales terribly. The reviewer becomes a bottleneck, then a rubber stamp, then a liability.

EvalsAgents

PoV
May 25, 2026
10 min

Why most RAG pipelines fail in month three

The demo retrieves perfectly. Then the corpus grows, the queries drift, and recall quietly collapses. Here is the gap, and how I close it.

RAGEvals

Blog
May 24, 2026
12 min

How to Evaluate an AI Agent (Evals for Agents)

AI agent evals score the whole trajectory: tool calls, step efficiency, recovery, and goal state, not just the final answer. The harness that gates a deploy.

EvalsAgents

Blog
May 23, 2026
11 min

How to Measure (and Reduce) Hallucination

Measure hallucination as faithfulness against a source on a frozen set, then reduce it with grounding, constrained decoding, and calibrated abstention.

EvalsEngineering

Blog
May 22, 2026
10 min

An honest accounting of what agents can do today

Between the demos and the disappointment lies a narrow band of tasks where agents genuinely earn their keep.

Agents

Blog
May 21, 2026
11 min

The spec is the program now

When the model writes the implementation, the specification becomes the artifact you actually version and defend.

EngineeringAI-Native

PoV
May 20, 2026
10 min

Eval-Driven Development: The Test Suite Leads

Eval-driven development is TDD for probabilistic systems: write the eval first, gate every deploy on a frozen eval set, and treat the suite as the spec.

EvalsEngineering

PoV
May 19, 2026
10 min

Selling AI to people who have been burned by AI

Three years of inflated claims left buyers skeptical. That skepticism is an asset if you sell to it honestly.

GTMStrategy

PoV
May 18, 2026
10 min

How to Build a Golden Eval Set From Production

A golden dataset for LLM evaluation is a frozen, versioned slice of real traffic with trusted reference answers, over-weighted toward the adversarial tail.

EvalsEngineering

Blog
May 17, 2026
9 min

What a team is for after the machine does the work

When generation is cheap, the org chart built for production is the wrong shape. Re-drawing it around judgment.

LeadershipAI-Native

Blog
May 16, 2026
10 min

How to Reduce LLM Inference Cost Without Wrecking Quality

Reduce LLM inference cost by right-sizing the model, caching what repeats, quantizing, trimming tokens, and batching. Here is the order to pull those levers, and what each one actually saves.

EngineeringAI-Native

Blog
May 15, 2026
12 min

RAG vs Fine-Tuning: When Each Wins in 2026

RAG vs fine-tuning is the wrong fight. RAG handles knowledge that changes; fine-tuning shapes behavior that persists. Here is when each wins, and why most teams end up shipping both.

RAGFine-Tuning

Blog
May 14, 2026
11 min

Prompt Caching: What It Is and When It Saves Money

Prompt caching reuses the already-computed prefix of a prompt so repeated tokens get billed at a deep discount. Here is when it saves money, and when it does not.

InferenceEngineering

Blog
May 13, 2026
11 min

LLM Model Routing: Cheapest Model That Can Do the Job

LLM model routing sends each request to the cheapest model that can handle it, escalating only when needed. Here is how it cuts cost without cutting quality.

InferenceStrategy

Blog
May 12, 2026
11 min

LLM Quantization: When 4-Bit Pays (and When It Bites)

LLM quantization stores a model at fewer bits per weight, cutting memory and cost. The trade-off: quality holds on most tasks and quietly breaks on a few.

InferenceStrategy

Blog
May 11, 2026
11 min

Semantic Caching for LLMs: When It Saves Money

Semantic caching reuses a past LLM answer for a question that means the same thing, even when the words differ. Here is when it saves money, and how it differs from exact prompt caching.

InferenceEngineering

Blog
May 10, 2026
11 min

LLM Token Optimization: Cut Token Cost, Keep Quality

LLM token optimization means cutting the tokens you send and generate, in that order of payoff. Start with output, because output is priced 5x to 6x higher than input.

EngineeringInference

Blog
May 9, 2026
11 min

Hiring AI Engineers: The Definitive 2026 Guide

AI engineers are the hardest role on the market to fill. Here is what good actually looks like, what it costs, and how the bad hires fail.

The blog

Principles of Building AI Agents That Hold in Production

How to Build an AI Agent (the Loop That Holds)

Agentic AI Frameworks Compared (From Production)

Agentic AI Examples: What's Genuinely Shipping

Offline vs Online LLM Evaluation: Why You Need Both

Memory Systems for AI Agents: Remember Without Inventing

LLM Evaluation: Measuring What Will Break

Human-in-the-Loop Evaluation That Scales

The Best AI Agents in 2026 (An Honest Roundup)

Agentic RAG: When Your Agent Needs to Retrieve

Agentic Coding: What Changes When the Machine Writes Code

Agentic AI Use Cases and the Constraint That Picks One

How to Build an LLM Evaluation Framework

AI-Native means the machine does the job

AI Agents and Agentic Workflows: An Honest Field Guide

Agentic Design Patterns That Actually Work

Agentic AI vs Generative AI: What's Actually Different

LLM Evaluation Metrics That Matter (and the Ones That Lie)

Evals that predict production, not vanity

The CRO's case for shipping smaller models

LLM-as-a-Judge: When to Trust It

RAG Evaluation: Measuring Retrieval Before It Collapses

When doing is cheap, deciding is everything

LLM Evaluation Tools Compared (From Production)

'A human reviews it' is not a plan

Why most RAG pipelines fail in month three

How to Evaluate an AI Agent (Evals for Agents)

How to Measure (and Reduce) Hallucination

An honest accounting of what agents can do today

The spec is the program now

Eval-Driven Development: The Test Suite Leads

Selling AI to people who have been burned by AI

How to Build a Golden Eval Set From Production

What a team is for after the machine does the work

How to Reduce LLM Inference Cost Without Wrecking Quality

RAG vs Fine-Tuning: When Each Wins in 2026

Prompt Caching: What It Is and When It Saves Money

LLM Model Routing: Cheapest Model That Can Do the Job

LLM Quantization: When 4-Bit Pays (and When It Bites)

Semantic Caching for LLMs: When It Saves Money

LLM Token Optimization: Cut Token Cost, Keep Quality

Hiring AI Engineers: The Definitive 2026 Guide

AI Engineer Skills: What Actually Separates the Good Ones

AI Engineer Interview Questions That Reveal the Real Ones

AI Engineer Cost: What It Really Takes to Hire One

AI Engineer Job Description: What to Put In It

How to Vet AI Engineers: The Process That Predicts

Senior vs Junior AI Engineer: The Real Difference

In-House vs Outsourced AI Development: The Decision

Staff Augmentation vs Consulting: Who Owns the Outcome

AI Team Structure: The Roles You Need in 2026

When to Hire an AI Engineer (and When to Wait)

AI Engineer Red Flags: How to Spot a Bad Hire

AI Hiring Mistakes That Cost the Most (and the Fixes)

Building an AI Team: The Order You Actually Build It In

What Is an AI Engineer? The Role, Explained by a Hirer

AI Engineer vs ML Engineer: What Actually Differs

AI Engineer vs Data Scientist: Who to Hire When

AI Engineer vs Software Engineer: The Real Difference

What Is an LLM Engineer? The Role, Explained for Hirers

How to Hire an LLM Engineer (and What to Look For)

How to Hire an ML Engineer (and What to Look For)

How to Hire an MLOps Engineer (Without Getting Burned)

How to Hire a RAG Engineer Who Survives Production

How to Hire an AI Agent Developer (and Vet One)

How to Hire a Generative AI Engineer (What to Screen For)

How to Hire a Computer Vision Engineer: What to Look For

How to Hire an NLP Engineer (and What to Look For)

Hire a Prompt Engineer? When You Actually Need One

How to Hire an AI Solutions Architect (Without Regret)

How to Hire an AI Product Manager (What to Look For)

How to Hire a Python Developer for AI (What to Look For)

How to Hire a React Developer for AI Products

How to Hire a Node Developer for AI Products

How to Hire a Full-Stack AI Developer (Without Guessing)

How to Hire a DevOps Engineer for AI Workloads

How to Hire a Data Engineer (the AI Foundation)

How to Hire a Forward Deployed Engineer

How to Choose an AI Development Company