AI Engineer Red Flags: How to Spot a Bad Hire

The AI engineer red flags that predict a bad hire: no evals, a demo that never shipped, a resume of buzzwords. Here is how to surface each one before you sign.

The AI engineer red flags that actually predict a bad hire are not the ones recruiters screen for. The strongest single signal is a candidate who has never run an eval and cannot tell you how they knew their system worked. Close behind it: a portfolio that is all demo and no production, a resume that lists every framework and explains none of them, and an inability to walk you honestly through a failure. None of those show up in a keyword match, yet all of them surface in twenty minutes if you ask the right question.

I sit in two seats: I hire AI engineers, and I deploy them on live customer-facing systems, so I see both the interview and the eighteen months after it. That second seat is the one most hiring guides are missing. A candidate who interviews beautifully and then cannot ship a system that survives p95 latency and a real cost budget is not a near-miss; they are the most expensive hire you will make, because you pay the salary, the opportunity cost, and the cleanup. This piece is the negative space around my hiring AI engineers guide: not what good looks like, but the warning signs that tell you to stop.

A resume is a list of things a person has stood near. The red flags are the gap between standing near a thing and being able to make it work when it breaks.

Key takeaways

If you read nothing else, these are the load-bearing red flags when hiring AI engineers:

No evals is the highest-signal red flag. An engineer who cannot describe how they measured a system's quality has been shipping on vibes, and vibes do not survive production.
Keyword density is not competence. A resume thick with RAG, agents, and vector databases tells you what they have read, not what they can build. Make them explain a tradeoff.
Demos lie; production does not. A candidate who has only ever shipped notebooks has never met latency, cost-per-task, or a 3 a.m. rollback.
The failure story is the interview. An engineer who blames the model, the data, or the deadline for every failure will blame them on your payroll too.
Balance the flags with green flags. A quiet portfolio with an honest test set beats a loud one with none. Do not reject the right person for bad packaging.

Resume and keyword red flags: when the buzzwords are doing the talking

The first red flag when hiring AI engineers shows up before the interview, on the resume itself. You will see a skills section that reads like a glossary: RAG, agentic workflows, vector databases, prompt chaining, fine-tuning, LangChain, four model providers, two orchestration frameworks. It looks like depth. It is usually breadth pretending to be depth, because the market has taught candidates that the keyword is the qualification.

This is not a fringe concern. ManpowerGroup's 2026 survey of 39,000 employers found that AI model and application development is now the single hardest capability to hire for worldwide, harder than traditional engineering (ManpowerGroup). When a skill is that scarce and that hyped, resumes inflate to match the demand. The keyword soup is a rational response to a hot market, which is exactly why it carries no signal.

The way to surface it is to pick one keyword and go three questions deep. If the resume says "RAG," ask how they chose the chunk size, then ask how they measured whether retrieval was actually returning the right passages, then ask what they did when recall was bad. A real practitioner has scars on every layer of that answer. A keyword candidate gets vague by the second question, because they have read the architecture diagram but never had to make it work on messy data.

GitHub stars and the length of the framework list belong in the same bucket. A repo with three hundred stars and no test set is a marketing artifact. A quiet repo with a frozen evaluation set and an honest failure log is the actual hiring signal, which is the same thing my piece on how to vet AI engineers argues you are really buying. Read past the polish to the part that shows judgment.

The biggest red flag: an AI engineer who has never run an eval

If I could keep only one screen, this would be it. Ask the candidate to describe a system they built and then ask the follow-up that decides the interview: how did you know it was working? An engineer who has shipped real AI answers immediately and concretely, talking about a held-out test set, an accuracy or faithfulness number, a threshold they had to clear, a failure mode they tracked. An engineer who has been shipping on vibes goes quiet, or worse, says "it looked good in testing."

This matters because AI systems fail silently. A traditional bug throws an exception; a bad AI output is fluent, confident, and wrong, and it sails straight past anyone who is only eyeballing the demo. The discipline that catches it is evaluation, and a candidate who has never built an eval harness has no instrument for the exact failure mode that will hurt you most. I have written the full version of this argument in my work on the skills that actually separate AI engineers, and evaluation judgment is the one at the top.

Here is what a missing-evals red flag sounded like in a real loop, with the details changed to stay NDA-safe. A senior candidate could draw a clean retrieval pipeline on the whiteboard and name every component. When I asked how he measured retrieval quality, he said the team "spot-checked a few queries each week," and there was no frozen set, no recall number, no record of what had regressed. He was not lying; he genuinely did not know that the thing he had built was unmeasured, which meant he had no way to tell a good change from a bad one.

# The question that surfaces the no-evals red flag

You: "How did you know the system was working?"

# Green-flag answer

Them: "Frozen set of 300 real queries. Faithfulness 0.91,

human-disagreement under 8%. Anything below blocked the deploy."

# Red-flag answer

Them: "It looked good in testing. We kept an eye on it."

If you want the language to run this part of the loop cleanly, the eval and judgment section of my AI engineer interview questions gives you the exact prompts and what a strong answer sounds like.

The "can't explain a failure" red flag

Ask any AI engineer to walk you through a time their system failed in production. The answer tells you more than the rest of the interview combined. You are listening for a specific shape: a clear description of what broke, an honest account of why they did not catch it sooner, and a concrete change they made so it would not happen again. That shape is the signature of someone who has actually owned a system through its bad days.

The red flag is the candidate who cannot produce a real failure, or who produces one and blames it entirely on something outside their control. "The model just wasn't good enough." "The data was messy." "The deadline was unrealistic." Each of those can be true, and a strong engineer will name them as factors, but they will still tell you what they owned in the failure. The candidate who owns nothing has either never been close enough to a production system to feel a failure, or has a habit of externalizing blame that will not improve once they are on your team.

I weight this heavily because AI work is failure-dense by nature. You are building probabilistic systems on shifting model behavior and imperfect data; things break constantly, and the job is largely about catching and containing those breaks. An engineer with no honest failure narrative is telling you they have not done the part of the job that actually is the job.

The demo-not-production red flag

This is the one my second seat sees most clearly. A candidate shows up with an impressive demo: a polished chat interface, a slick retrieval app, an agent that books a meeting on stage. It works, and it is genuinely good work. And it tells you almost nothing about whether they can build the thing you actually need, because a demo and a production system are different disciplines that happen to share a vocabulary.

A demo runs once, for one user, on clean input, with no budget. A production system runs ten thousand times an hour, on adversarial input, against a latency target and a cost ceiling, with a human on call when it breaks. The skills that make a great demo, fast iteration and a good eye for the happy path, are not the skills that keep a system alive at p95 latency on a real bill. I made this case at length in my argument for shipping smaller models, where the entire point is that the model that wins the demo rarely wins the margin.

Surface it by asking production questions about the demo: what was your p95 latency, what did each call cost, and how did you handle the request the system got wrong in front of a user? A candidate who has lived in production has crisp answers and probably a war story. A demo-only candidate treats these as someone else's problem, which is the tell, because in the job you are hiring for, they are the engineer's problem.

The illustrative version: a candidate's take-home agent worked flawlessly in a notebook and fell over the moment we pointed two hundred concurrent requests at it, because every step made a fresh model call with no batching, no caching, and no timeout. The logic was correct. The engineering for production simply was not there, and he had never been forced to notice because a notebook never asked him to.

Communication and ownership red flags

Some of the most expensive red flags are not technical at all. The first is the engineer who cannot scope. You describe a fuzzy problem and ask how they would approach it, and instead of narrowing it into something shippable, they either freeze or sprint straight to the most complex possible architecture. Scoping is the daily work of an AI engineer, because almost every real task arrives vague, and an engineer who cannot turn vague into a first shippable slice will stall every project they touch.

The second is the engineer who cannot explain their work to a non-engineer. In an AI-native product, the model's behavior is a business decision, not just a technical one. When the system makes a call a customer does not trust, someone has to be able to say why in plain language. A candidate who can only describe their system in jargon will not be able to do that, and you will discover it at the worst possible moment, in front of a customer or a board.

The third, and the one I treat as nearly disqualifying, is the candidate who is certain about everything. AI engineering is a field where the honest answer is frequently "it depends, and here is how I would find out." A candidate who has a confident, definitive answer to every question, including the genuinely ambiguous ones, is showing you they cannot tell the difference between what they know and what they are guessing. That is the exact failure mode that ships a confidently wrong system.

Interview behaviors that should make you stop

Beyond the answers, the behaviors in the room carry signal. Watch for the candidate who name-drops models and papers but cannot connect any of them to a decision they made. Reciting the frontier is easy; choosing between two options under a real constraint is the job, and the gap between the two is where weak candidates live.

Watch, too, for how they handle being wrong: push gently on one of their answers and see what happens. A strong engineer engages, reconsiders, and either defends the position with better reasoning or updates it cleanly. A red-flag candidate gets defensive, doubles down, or quietly abandons the point without acknowledging the change. You are not testing whether they are right; you are testing whether they can think in front of you when the ground moves, which is what every hard production day asks of them.

One more, and it is subtle: the candidate who has clearly used an AI assistant to pre-bake answers and cannot go off-script. The tooling is fine, expected even, but if every answer is polished and none of them survives a follow-up, you are interviewing the assistant, not the engineer. The fix is the same as everywhere else in this piece: ask the second and third question, the ones no canned answer covers.

The red-flag table you can run your loop from

Here is the full set in one place: each red flag, why it costs you in production, and the question or check that surfaces it. Run these in the same order on every candidate and you have a structured screen instead of a vibe.

Red flag	Why it matters in production	How to surface it
Never run an eval	No instrument for silent, confident-wrong failures	"How did you know it was working?" Listen for a frozen set and a number.
Keyword-stuffed resume	Breadth posing as depth; no real tradeoff experience	Pick one keyword, go three questions deep on it.
Demo-only portfolio	Never met p95 latency, cost-per-task, or a rollback	Ask the demo's latency, per-call cost, and failure handling.
No honest failure story	Has not owned a system through its bad days	"Walk me through a production failure you owned."
Blames the model or data	Externalizes blame; will not improve on your team	Probe what they owned vs. what they blamed.
Cannot scope a fuzzy task	Stalls every vague project, which is most of them	Give a fuzzy problem; watch for a shippable first slice.
Certain about everything	Cannot separate knowledge from guessing	Ask a genuinely ambiguous question; listen for "it depends."
Cannot explain to a non-engineer	Model behavior is a business decision someone must defend	"Explain your system to a customer who does not trust it."

Green flags: what to weight up so you don't reject the right person

Red flags are only half a hiring instrument. Used alone they turn into a reason to reject everyone, and the cost of a false rejection in a market this tight is real. ManpowerGroup puts overall hiring difficulty at 72%, and CIO has reported a roughly 75% fail rate on basic AI skills assessments, partly because the assessments cannot tell different kinds of AI work apart (CIO). Screen too hard on the wrong signals and you will reject good engineers for bad packaging.

So weight these up. An engineer who says "I do not know, but here is how I would find out" is showing you judgment, not a gap. A quiet portfolio with one project that has a frozen test set and a documented failure beats a loud one with five demos and no measurement. A candidate who narrows your fuzzy problem into a small shippable slice in real time has just demonstrated the single most useful daily skill in the job.

And weight up honesty about limits. The engineer who tells you a model is not good enough for a task yet, and can say precisely how they would measure when it becomes good enough, is worth more than the one who promises everything works. The deeper version of how these signals compound across a whole team lives in my book The AI-Native Team. The short version: hire for judgment, discount the polish, and do not mistake confidence for competence in either direction.

If you would rather not run this gauntlet yourself, every one of these red flags is screened out before you meet a candidate through Devlyn's pre-vetted AI application engineers, who have already cleared a harder version of this loop on live production work.

Frequently asked questions

What is the single biggest red flag when hiring an AI engineer?

An engineer who cannot tell you how they measured whether their system worked. If the answer to "how did you know it was working" is "it looked good in testing," they have been shipping on vibes, and vibes do not survive production. A real practitioner names a frozen test set, a metric, and a threshold without hesitating.

How do I spot a fake AI engineer from their resume?

Look for keyword density with no depth behind it: a long list of RAG, agents, vector databases, and frameworks, with no explanation of a tradeoff they made. Pick one keyword and go three questions deep in the interview. A real engineer has scars on every layer; a keyword candidate gets vague by the second follow-up.

Are GitHub stars a good way to judge an AI engineer?

No. Stars measure marketing reach, not engineering judgment. A repo with hundreds of stars and no test set is a demo; a quiet repo with a frozen evaluation set and an honest failure log is the real signal. Read past the polish to the part that shows how they knew it worked.

What interview questions reveal AI engineer red flags fastest?

Three do most of the work: "how did you know it was working" surfaces the no-evals flag, "walk me through a production failure you owned" surfaces the ownership flag, and asking the latency and cost of their demo surfaces the demo-only flag. The full set lives in my guide to AI engineer interview questions.

The bad AI hire is rarely the one who fails the technical screen. It is the one who passes the demo, talks fluently, and cannot tell a correct output from a confidently wrong one once the system is live. Screen for that, balance it with the green flags, and you will avoid the most expensive mistake in AI hiring. If you would rather skip the loop entirely, the engineers placed through Devlyn's AI application engineer hiring have already been screened against every red flag here, and the fuller picture sits in the hiring AI engineers guide.