How to Hire a RAG Engineer Who Survives Production

Most RAG engineers can demo retrieval. Few can keep recall from collapsing in production. Here is how to hire the second kind, what they own, and what it costs.

When you hire a RAG engineer, you are not hiring someone to wire up a vector database and a chat box. You are hiring the one person on your team who owns whether the system retrieves the right evidence before the model ever opens its mouth. That is the job. Everything else, the framework names, the embedding model, the orchestration library, is downstream of it, and most candidates who interview well have it backwards.

I have hired and deployed more than 80 senior AI engineers at Devlyn and shipped over 200 products on top of them. A large share were retrieval systems, and the single most expensive hiring mistake in this role is hiring someone who can build a demo on a clean corpus, then discovering in month three that they have no idea why recall is collapsing on real traffic. The demo is the easy 80%; the 20% that keeps a RAG system working after contact with a messy corpus and real users is the entire reason the role exists. This piece is about how to tell the two apart before you sign an offer, and if you want the broader role first, start with my definitive guide to hiring AI engineers.

Key takeaway: A RAG engineer owns retrieval quality, not the demo. The job is keeping the right evidence in front of the model on real, drifting traffic, not making a clean corpus look good in a sprint review.
Screen for the recall-collapse instinct. Anyone can stand up a vector search. The hire you want is the one who reaches for an eval harness and a recall number before they touch the prompt.
Chunking, embeddings, retrieval, and evaluation are the four surfaces. A strong candidate can reason about trade-offs in each and tell you which one is failing from a symptom.
Retrieval is the lever, not the prompt. When answers are weak, the fix is almost always in what got retrieved, and a strong candidate knows that without being led to it.
Cost varies more by sourcing model than by seniority. A senior in-house hire, staff augmentation, and a scoped pod are different price-and-speed trade-offs, not different quality tiers.

What a RAG engineer actually owns

The clearest way to scope this role is by the surfaces a retrieval engineer is accountable for. There are four, and a candidate who cannot speak fluently to all four is a generalist who has done some RAG, not a retrieval specialist.

Chunking. How the corpus gets split before it is embedded. This sounds trivial and is not. Chunking choice alone can swing retrieval recall by roughly 8 to 9 percentage points, and the default chunkers in popular libraries often underperform purpose-built strategies (Chroma's chunking research). A strong RAG engineer treats chunking as a tunable decision tied to document structure and query patterns, not a setting they accept from a tutorial.

Embeddings. Which model turns text into vectors, and how that choice degrades at the extremes of chunk size and vocabulary drift. The hire you want knows that an embedding model is not a permanent decision, that the corpus will move underneath it, and that re-embedding cadence is an operational question with a cost attached.

Retrieval. The actual lookup: dense vector search, sparse keyword matching like BM25, the hybrid of the two, and a reranker to reconcile them. Pure dense retrieval is a fine baseline that degrades sharply as the corpus grows. A retrieval engineer who only knows cosine similarity is a retrieval engineer who has never watched a system age.

Evaluation. The part everyone skips and the part that separates the role from a weekend project. Recall@k, context recall, faithfulness, a frozen golden set of query-chunk pairs run on a schedule. Without this, nobody can tell you whether retrieval is working, only whether the demo looked good. I go deeper on the measurement side in how to evaluate RAG.

The skills and signals that separate strong from weak

Resumes in this role are nearly useless, because the vocabulary is cheap. Everyone lists vector databases, LangChain, embeddings, RAG. None of those words tell you whether the person can keep a system alive. The signals that matter are about judgment under failure, and you have to dig for them.

The strongest signal is an eval-first instinct. Describe a RAG system giving weak answers and watch where the candidate goes. The weak candidate reaches for the prompt, suggests a bigger model, or proposes adding more context. The strong candidate asks what recall looks like, whether the right chunks are even in the retrieved set, and how you are measuring it. That reflex, to interrogate retrieval before generation, is the whole job in one reaction.

The second signal is comfort with the unglamorous operational layer. Parsing is part of retrieval; a PDF that extracts as garbage will never retrieve well no matter how good the embeddings are. Freshness, reindexing, deletion, permission-aware retrieval so users cannot pull documents they should not see, these are the parts that break in production and never show up in a demo. A candidate who lights up about parsing and reindexing has shipped real systems. A candidate who finds them beneath them has not.

The demo is the easy 80%. The 20% that keeps recall from collapsing after contact with a messy corpus is the entire reason the role exists.

The third signal is honesty about trade-offs. Hybrid retrieval is more robust but more expensive to tune; reranking improves quality but adds latency; re-embedding the corpus improves recall but costs money and engineering time. A strong candidate names the cost on the other side of every improvement, while a weak one talks about retrieval as if quality were free. If you want the framework I use to separate observable judgment from resume keywords across the whole market, it is in the AI engineer skills breakdown.

How to vet one: signal, test, and what good sounds like

Here is the screening matrix I actually use. Each row is a signal that matters, the test that surfaces it, and the difference between a strong and a weak answer. Run these in a working session against a real or realistic corpus, not as trivia.

Signal	How to test it	Strong answer	Weak answer
Eval-first instinct	"Answers are wrong. What do you check first?"	Looks at recall and whether the right chunks were retrieved before touching the prompt	Suggests a bigger model or a better prompt
Chunking judgment	"How would you chunk these mixed documents?"	Ties chunk strategy to document structure and query patterns; expects to measure it	Names a fixed token size from a tutorial and stops
Retrieval depth	"Dense retrieval is missing results. Now what?"	Reaches for hybrid (BM25 + dense) and a reranker, explains the trade-offs	Adds more dense results or raises the top-k blindly
Operational ownership	"The corpus changes weekly. What breaks?"	Talks reindexing, freshness, deletion, re-embedding cadence, permissions	Assumes the index is set-and-forget
Recall under drift	"Recall was 0.9, now it is 0.7. Diagnose it."	Walks corpus drift, embedding staleness, chunking mismatch, eval gaps methodically	Has never seen recall move and improvises

The bottom row is the one that matters most, and it is the hardest to fake. Someone who has operated a real RAG system has watched recall degrade and had to diagnose it under pressure. Someone who has only built demos has never seen the number move, because demos run on frozen corpora. That gap is visible in seconds once you ask the right question.

The failure mode to screen hardest for: they can demo RAG but cannot keep recall from collapsing

This is the section I would tattoo on a hiring manager's wrist if I could. The most common, most expensive RAG hiring failure is hiring someone who builds an impressive demo and cannot keep it working three months later. It is so common because the demo is genuinely the easy part now. Tooling has made standing up a retrieval pipeline a weekend exercise. The hard part is everything that happens after real users and a real corpus arrive.

Here is an illustrative composite, NDA-safe, of how it goes wrong. A team hires a sharp engineer who ships a RAG assistant in three weeks, and in the demo it answers internal-knowledge questions cleanly, so everyone is thrilled. Then the corpus grows, documents get edited and deleted, new query patterns show up, and answers quietly get worse, but nobody notices for weeks because there is no eval harness, only vibes. By the time a customer complains, recall has drifted from strong to mediocre, and the engineer's only move is to keep editing the prompt, because retrieval was never the thing they actually understood.

The fix is upstream, in the hire. The engineer who survives this builds the retrieval eval harness before shipping, not after. They put recall@5 on a dashboard next to latency and error rate. They expect the corpus to drift and they instrument for it. This is the central argument of my book on RAG that survives contact with production, and it is also exactly the problem Devlyn's retrieval engineers are hired to solve when an in-house demo has quietly stopped working.

Someone who has only built demos has never watched recall move, because demos run on frozen corpora. That gap is visible in seconds.

If you screen for nothing else, screen for this: has this person operated a retrieval system long enough to watch it degrade, and do they reach for measurement instead of the prompt when it does? The Anthropic team's own work shows how much retrieval engineering moves the needle, combining contextual embeddings, hybrid search, and reranking cut their top-20 retrieval failure rate by 67% (Anthropic's contextual retrieval research). That is the work your hire either does or does not know how to do. The prompt is rarely the lever. Retrieval almost always is.

Where to find a RAG engineer, and what it costs

There are three sourcing models, and they are different trade-offs of price, speed, and commitment, not different quality tiers. The mistake is treating cost as a quality signal. It is not.

In-house senior hire. A full-time senior retrieval engineer in the US market is a substantial commitment, and the loaded cost is well beyond the salary line once you add benefits, equity, and the months of sourcing in the tightest hiring market there is. This is right when retrieval is core to your product and will be for years. It is wrong when you need the system fixed this quarter, because hiring takes longer than the problem will wait. I break down the real loaded numbers in what an AI engineer actually costs.

Staff augmentation. A senior retrieval engineer embedded in your team on a monthly basis. Faster to start, no long-term liability, and you keep direction. The trade-off is that you own the management overhead and the architecture decisions stay yours. This fits when you have a clear plan and need senior hands to execute it.

A scoped pod. A small team that owns the retrieval problem end to end, architecture through eval harness through maintenance cadence. More expensive per month than a single contractor, far cheaper than a bad in-house hire who ships a demo and leaves you with a system nobody can diagnose. This fits when retrieval quality is urgent and you would rather buy the outcome than manage the inputs. It is the model behind Devlyn's retrieval engineering, and it exists precisely because the in-house version of this role takes too long to fill when production is already slipping.

When you actually need one, and when you do not

Not every team that thinks it needs a RAG engineer does. The honest filter is whether retrieval quality is on your critical path to revenue or trust. If a wrong answer costs you a customer, a deal, or a compliance problem, retrieval is core and you need someone who owns it. If you are prototyping and nobody is depending on the answers yet, you do not need a specialist, you need a working baseline and the discipline to instrument it before it matters.

There is also a build-versus-buy question hiding inside this. If your retrieval need is generic, document search over a clean corpus with forgiving accuracy requirements, an off-the-shelf tool may carry you further than a hire. The moment your corpus is messy, permission-sensitive, or high-stakes, generic tools stop being enough and the specialist earns their cost. The decision between owning the retrieval stack and reaching for a model's parametric knowledge is its own question; I cover the adjacent version in RAG versus fine-tuning.

One more honest case: sometimes the right move is neither hire nor tool but a short audit. A two-week retrieval audit by someone senior can tell you whether your problem is chunking, embeddings, retrieval, or evaluation, and that diagnosis is worth more than a year of guessing. It is also a low-commitment way to see how a candidate or a vendor actually thinks before you commit to a longer engagement.

The mistakes that cost the most

The first mistake is hiring for framework names. A resume full of vector databases and orchestration libraries tells you the person has read the docs, not that they can keep a system alive. The frameworks are learnable in a week. The judgment about why recall collapsed is not, and that is the thing you are actually paying for.

The second mistake is hiring someone who blames the prompt. When a RAG system gives weak answers, the reflexive fix is to rewrite the prompt or upgrade the model, because those are the visible knobs. But the evidence the model was handed is upstream of both, and if retrieval handed it the wrong chunks, no prompt will save it. A candidate who instinctively reaches for the prompt is telling you they do not understand where the failure lives.

The third mistake is hiring without an eval plan in place. If you cannot measure recall, you cannot tell whether your new hire is helping or quietly making things worse, and you will find out only when a customer does. Build the golden set and the recall dashboard as part of onboarding, not as a someday project. The cost of skipping it compounds, and I have watched that compounding turn a fine month-one system into a month-three liability, the pattern I unpack in why RAG breaks in month three.

The fourth mistake is optimizing the wrong thing. Teams obsess over which embedding model is two points better on a benchmark while ignoring that their chunking is naive and their corpus is stale, or they pour effort into latency tricks like semantic caching before retrieval quality is even stable. Get retrieval right first; the optimizations matter, but only on top of a system that retrieves the right evidence in the first place. For the architectural version, where the system decides how to query rather than firing one fixed lookup, see agentic retrieval.

Frequently asked questions

What does a RAG engineer do?

A RAG engineer owns whether a retrieval-augmented system finds the right evidence before the model generates an answer. Concretely, that means chunking the corpus, choosing and maintaining embeddings, designing retrieval (dense, sparse, hybrid, reranking), and building the evaluation harness that proves recall is holding up. The role is defined by retrieval quality on real traffic, not by the demo.

How do I tell a strong RAG engineer from a weak one in an interview?

Describe a RAG system giving weak answers and watch where they go first. A strong candidate interrogates retrieval, asking about recall and whether the right chunks were even retrieved, before touching the prompt or the model. A weak candidate reaches for a bigger model or a prompt rewrite. The reflex to measure retrieval before blaming generation is the single most reliable signal.

How much does it cost to hire a RAG engineer?

It depends more on the sourcing model than on seniority. A full-time senior in-house hire carries a high loaded cost and a long time-to-fill in a tight market. Staff augmentation is faster and lower-commitment but leaves management and architecture to you. A scoped pod costs more per month than a single contractor but buys the outcome end to end, and is usually cheaper than a bad in-house hire who leaves you with an undiagnosable system.

Do I need a dedicated retrieval engineer, or can a generalist handle RAG?

If retrieval quality is on your critical path to revenue or trust, hire the specialist. A generalist can stand up a working baseline, but the operational layer, freshness, reindexing, permission-aware retrieval, and the discipline to keep recall from drifting, is where generic experience runs out. The messier and higher-stakes your corpus, the more a specialist earns their cost.

If you have a retrieval system that demos well and is quietly getting worse, or you are about to build one and want it instrumented from day one, that is exactly what Devlyn's retrieval engineers do. And if you want the full operating manual for the role first, RAG That Survives walks through the eval harness, the failure modes, and the maintenance cadence end to end. Hire for the month-three system, not the demo.