Agentic Coding: What Changes When the Machine Writes Code

Agentic coding is the AI-Native SDLC in practice: the machine writes the implementation, the engineer specifies intent and evaluates the diff.

Agentic coding is the AI-Native software development lifecycle in practice. The machine writes the implementation. The engineer specifies intent and evaluates the diff. Everything that changes about the job follows from that one swap: you stop authoring mechanism and start authoring constraint, then judging whether the generated code honors it.

I have run this loop in production long enough to be specific about what it does to an engineer's day, what it does to the P&L, and where it breaks. This is not the vendor version where you paste a prompt and ship the output. It is the harder, more durable version, and it has costs people are not pricing in. If you want a team that already runs it this way, you can hire AI engineers who have shipped agentic coding in production.

Agentic coding is not the machine helping you code. It is the machine doing the coding while your job contracts to intent and evaluation.

The reflexes that made you a good engineer in 2022 are now partly liabilities. Reading code for syntax. Writing the for-loop yourself because it is faster than explaining it. Trusting that working code is correct code. Each of those reflexes has to be rebuilt for a world where a model produces plausible code far faster than you can validate it.

Key takeaways

If you read nothing else, read these.

Agentic coding moves your job from writing to specifying and evaluating. The agent plans, writes, runs, and revises; you author intent and audit the diff against it.
The spec is the program, and the bottleneck. A coding agent fills every gap you leave with a plausible wrong answer, so bugs migrate into what you forgot to specify.
Validation, not generation, is the new constraint. Commit volume rises three to four times while review stays serial, so the queue backs up and the quality bar quietly drops.
Faster generation is not faster delivery. A 2025 METR trial found experienced developers were about 19% slower with AI tools while believing they were faster.
Evaluation has to be a funded role. "A senior engineer will skim it" is not a plan when the agent ships candidate bugs at machine speed.

What agentic coding actually means

Agentic coding is software development where an AI coding agent plans, writes, runs, and revises code across multiple steps, while a human specifies the intent and evaluates the result. The word that matters is agentic. The model is not autocompleting your next line. It is taking a goal, breaking it into steps, calling tools, reading errors, and iterating until it believes the task is done.

That is a different thing from AI-assisted coding, where you still drive and the model suggests. The distinction is the whole point of the AI-Native thesis: AI-Native means the machine does the job, not that it helps you do yours. Agentic coding is that thesis applied to the one domain where it has matured fastest, and it inherits the same reliability rules as the broader pattern I lay out in the guide to AI agents and agentic workflows.

The progress is real and worth stating plainly. On SWE-bench Verified, a benchmark of real GitHub issues, top coding agents now resolve well over 80% of tasks per the public SWE-bench leaderboard, up from a 1.96% baseline when the original SWE-bench paper landed in late 2023. The labs have already moved on to harder benchmarks because the old one is partly saturated.

The machine can write the code. The open question is whether you can evaluate it fast enough to ship it safely.

The spec is the program, and now it is the bottleneck

When the machine writes the implementation, the artifact you actually author is the specification. I have argued this at length in the spec is the program now: the thing you write before the model runs becomes the source of truth you version and defend. Agentic coding is where that stops being a thesis and starts being a daily constraint.

Here is the mechanism that surprises people. A coding agent will produce a plausible implementation for any gap you leave in the spec. It does not pause to ask; it fills in.

So the bugs migrate. They are no longer in the code the model wrote for things you specified. They are in the code the model wrote for things you forgot to specify.

That moves the hard work earlier. The creative act is now the precise statement of intent, including the edge cases and the explicit list of what not to build. A spec written for a model implementer has to be more exact than one written for a senior human, because the human brings context and the model brings priors.

# Intent (what the agent must implement)

Validate a webhook payload before it touches the queue.

# Constraints the agent must honor

-- Reject any payload without a verified HMAC signature.

-- On a malformed body, return 400 and log; never enqueue.

-- Signature check must run before JSON parsing.

# What NOT to build

-- Do not add a retry loop here; retries live in the consumer.

-- Do not log the raw payload body (it carries PII).

That last block, the explicit exclusions, is the part most people skip and most regressions come from. The model has strong opinions about what a webhook handler "usually" includes. If you do not say "do not log the body," it will helpfully log the body, and you will have shipped a privacy incident with a green test suite.

Evaluating the diff is the new core skill

The reflex that changes most is how you read a diff. You are no longer hunting for syntax errors. The model does not make those. You are auditing whether plausible-looking code actually does what the spec says, clause by clause. That is a spec audit, not a code review, and it is a harder cognitive task than most engineers expect.

This is the judgment economy showing up in your editor. When generation is cheap, the scarce, defensible skill is telling good output from bad. When doing is cheap, deciding is everything, and in agentic coding the deciding is reading machine-authored diffs with the right posture. The teams that win are not the ones generating the most code. They are the ones who can evaluate it fastest without lowering the bar.

The mechanical move that makes this tractable is invariant-first testing. Write tests against the behavior that must always hold, not against the mechanism. The model regenerates the mechanism constantly. If your tests are coupled to implementation, you rewrite them on every run. If they encode invariants, they survive regeneration and actually tell you something when they go red.

You are not reviewing whether the code looks correct. You are auditing whether plausible code does what the spec said, clause by clause.

Where agentic coding breaks

I will be blunt about the failure modes, because the honest accounting is the differentiator. Agentic coding does not break the way the demo suggests. It breaks downstream, where you are not looking.

Validation becomes the bottleneck. The agent writes code faster than your team can review it. AI-generated pull requests now sit in review far longer than human ones, and code review is already the slowest stage of most pipelines. You did not remove the constraint. You moved it from writing to reviewing, and review does not parallelize the way generation does.

This is the trap that catches most teams in their first quarter. The dashboard looks great and commit volume is up three to four times. Then the review queue backs up, reviewers start rubber-stamping to clear it, and the quality bar quietly drops at exactly the moment throughput went up.

A rubber-stamp review is worse than no review, because it launders unevaluated code as approved. The fix is not more reviewers. It is automated evals that catch the failure modes before a human ever opens the diff.

A staff engineer I worked with, Priya, ran the numbers on her team after a quarter on coding agents. Merged pull requests had roughly tripled, from about 40 a week to 120. The open-review backlog had also tripled, and the median PR now sat 31 hours in review instead of 9. The team felt slower, not faster, and they were right.

We stopped measuring throughput and built a pre-review eval gate: schema and invariant checks plus a security lint that blocked any diff carrying a known vulnerability class. Within three weeks the backlog cleared, because reviewers only opened diffs that had already passed the machine.

Security and quality regress quietly. Independent testing has found that a large share of AI-generated code introduces vulnerabilities. Veracode's 2025 GenAI code security evaluation of over 100 models found that about 45% of generated samples introduced an OWASP Top 10 flaw. The agent optimizes for "passes the test," not "is safe to run." If your evals do not cover the failure mode, the agent will sail right past it.

The fast lane can be slower. A 2025 METR randomized trial found that experienced open-source developers were about 19% slower with AI tools, while believing they were 20% faster. The perception gap is the dangerous part. Time saved generating gets spent, and then some, cleaning up. Agentic coding pays off in specific conditions, not universally.

Skill atrophy is a real second-order cost. When the machine writes the implementation, junior engineers lose the reps that used to build judgment. The two-track outcome is visible already: senior engineers capture most of the gains because they can evaluate output; newer engineers ship agent code they cannot yet judge. That is an organizational problem, and it lands later, because the pipeline that built senior judgment ran on exactly the reps the machine now absorbs.

The revenue lens: faster generation, slower validation

Here is the both-seats version, because the engineering decision is a P&L decision. Agentic coding lowers the cost of producing code toward zero. It does not lower the cost of being wrong. A privacy leak, a security CVE, a silent data-corruption bug each costs the same as it always did, and you are now producing candidate bugs at three to four times the rate.

So the unit economics flip. The expensive resource is no longer engineer-hours spent typing. It is senior judgment spent validating, plus the rework and incidents from whatever judgment you skipped. A team that treats agentic coding as "we ship more features now" without funding the validation layer is borrowing against its own reliability. The bill arrives as technical debt and production incidents, on a delay.

A fintech founder named Daniel learned this on a delay. His team shipped an agent-written reconciliation job that passed its tests and ran clean for six weeks, then silently mis-rounded a fraction of cents on about 2,000 transactions a day until a customer caught it. The generation had cost almost nothing; the cleanup, the audit, and the trust repair cost weeks.

The bug lived in a rounding rule nobody specified and no test asserted, exactly the gap a coding agent fills with a confident guess. After that, every money-touching diff went through an invariant test and a named reviewer, and the incident did not recur.

The teams getting real ROI do the unglamorous thing. They invest the saved generation time into evaluation, observability, and tighter specs, and they keep a senior person accountable for every diff that touches money, customer data, or auth. If you would rather have that observability built in from day one than bolted on after an incident, that is the work Devlyn does on AI observability and monitoring. It is the same discipline I argue for across all agent work in my honest accounting of what agents can do today, and at book length in Agents That Actually Work.

How to run the loop so it holds

The agentic coding loop that survives production is short and strict: specify intent, generate a diff, evaluate against the spec, tighten, repeat. Four practices keep it honest.

Write the spec to the model, not past it. State the invariants and the explicit exclusions. Assume the model fills every gap with a plausible wrong answer.
Bound the agent's blast radius. Least-privilege tools, no production write access, a human gate on anything irreversible. The same discipline that keeps agentic workflows safe applies inside the editor.
Test invariants, not mechanism. Your suite should stay green across regenerations and go red only when behavior actually breaks.
Make evaluation a funded role, not a hope. "A senior engineer will skim it" is not a plan at three to four times the throughput.

One more reflex worth naming: stop treating a long, fully autonomous agent run as the goal. A short loop with a human checkpoint between steps beats a long one that wanders, because the compounding error across many unsupervised steps is what produces the confident, plausible, wrong result. Bound the run, check the diff, then let it continue. Demos reward the long autonomous run. Production rewards the bounded one.

The full lifecycle version of this, from intake through incident response, runs on one core move that stays the same everywhere: author intent, generate mechanism, evaluate against intent. The deeper production version of that discipline is the argument of my book Agents That Actually Work. Master the loop and the specific coding agent you use becomes an implementation detail you can swap when a better one ships.

Frequently asked questions

What is agentic coding?

Agentic coding is software development where an AI coding agent plans, writes, runs, and revises code across multiple steps, while a human specifies the intent and evaluates the result. It differs from AI-assisted coding, where the human still drives and the model only suggests. In agentic coding the machine does the implementation; the engineer's job contracts to writing a precise spec and auditing the generated diff against it.

Is agentic coding the same as vibe coding?

No. Vibe coding usually means accepting model output without rigorous evaluation, which is exactly how you ship the 45% of AI-generated code that carries a known vulnerability class. Agentic coding done well is the opposite: a disciplined loop of explicit intent, bounded tools, invariant-based tests, and a senior human evaluating every diff that touches money, data, or auth.

Does agentic coding actually make engineers faster?

Sometimes, and not universally. Benchmarks show agents resolving most real GitHub issues, but a 2026 METR trial found experienced developers were roughly 19% slower with AI tools while believing they were faster. The gain is real for well-specified, well-bounded tasks and disappears when validation and cleanup eat the time you saved generating. Faster generation is not faster delivery unless your evaluation layer keeps up.

What skills matter most in agentic software development?

Writing precise specifications and evaluating machine-authored diffs. When the model writes the mechanism, value migrates to whoever can state intent exactly and tell correct output from plausible-but-wrong output. Reading diffs against a spec clause by clause, and encoding invariants as tests, are now core engineering skills rather than nice-to-haves.

If you are turning agentic coding into a system that has to hold under real load, with evaluation and security built in rather than bolted on, that is the work my team does. See how a Devlyn AI engineering team ships AI-Native software with specs, bounded agents, and evals from day one. The machine writes the code. Making it safe to ship is still the job.