What Is an LLM Engineer? The Role, Explained for Hirers

What is an LLM engineer? The specialist who turns foundation models into reliable production features. Here is the role, what they do, and when to hire.

An LLM engineer is the person who takes a large language model someone else trained and turns it into a feature your customers can rely on. They do not build the model. They build everything around it: the prompts that constrain it, the retrieval that grounds it, the evals that catch it when it drifts, and the serving layer that keeps it fast and affordable under real load. That is the role in one sentence, and most of what gets written about it buries that sentence under a pile of framework names.

I am writing this from the hiring side of the table. I started as an engineer and now run conversion and revenue, which means I read the traces and I read the P&L in the same week. At Devlyn I have hired specialists for exactly this work and shipped products on top of them. So when someone asks me what an LLM engineer is, I am not answering from a job-board template. I am answering from the cost of getting the definition wrong, because the wrong definition is how a company pays a senior salary for the wrong skill set.

The title is new, it overlaps with two other roles that have "AI" or "ML" in them, and the field changes weekly. Hire an LLM engineer expecting a research scientist, or a data scientist expecting an LLM engineer, and the work does not get done while the money is already spent. This piece is the definition I wish more hiring managers had before they opened the req: what the role is, what it does all day, how it differs from the adjacent roles, and when your company actually needs one. If you already know you need the role and just want people who have shipped it, Devlyn places LLM engineers vetted on production judgment rather than a tool list.

An LLM engineer builds on models, they do not train them. Their output is a working production feature on top of an existing foundation model, not a new model and not a paper.
The job is defined by what ships, not by tools. The model names, the vector store, the orchestration library are all learnable in a week. Production judgment is the scarce part.
It is a distinct role from AI engineer and ML engineer. Narrower than "AI engineer," and almost the opposite of "ML engineer." Confusing the three is the most expensive hiring mistake in this category.
The core work is prompting, retrieval, evals, fine-tuning, and serving. Most of the value lives in retrieval and evals, not in clever prompts.
You do not need one until an LLM feature goes in front of customers and the cost of it being wrong is real. Before that, you are hiring ahead of the problem.

What an LLM engineer actually is

Start with what the model already gives you and what it does not. A foundation model arrives knowing a great deal about language and almost nothing about your business, your data, or your customers. It will answer confidently when it should refuse, it will invent facts it was never given, and it will cost you money on every token whether the answer was right or not. None of those problems are solved by picking a bigger model. They are solved by engineering.

That engineering is the job. An LLM engineer wraps the model in the machinery that makes it safe to put in front of a paying customer: instructions that constrain its behavior, a retrieval layer that feeds it your real data, a set of evals that measure whether it is getting better or worse, and an inference path tuned for latency and cost. The model is an input to that system. The system is the product, and the LLM engineer is the person who owns it.

This is why the role is defined by what ships and not by which tools the person has touched. I have interviewed candidates who could recite every orchestration framework released in the last year and could not tell me how they would know their RAG system had quietly started returning stale documents. I have hired others who had used only two libraries and immediately asked what the failure cost was if the model was wrong in front of a customer. The second kind ships products that hold up. If you want the longer version of that argument, the LLM engineers we place at Devlyn are selected on exactly that judgment, not on a tool checklist.

What an LLM engineer does all day

The day-to-day breaks into five kinds of work, and a good LLM engineer moves between all of them. None of them is the glamorous "talk to the AI" part that the title implies.

Prompting and context engineering. This is writing the instructions and assembling the context the model sees on every call. It sounds soft and it is not. A production prompt is an interface contract: it defines what the model is allowed to do, what it must refuse, and what shape the output takes so the rest of the system can parse it. The skill is in constraining behavior and handling the failure cases, not in clever phrasing.

Retrieval, or RAG. Most LLM products are retrieval products wearing a chat interface. Retrieval-augmented generation is, in NVIDIA's words, "a technique for enhancing the accuracy and reliability of generative AI models with information fetched from specific and relevant data sources" (NVIDIA). The engineer chunks your documents, indexes them in a vector store, retrieves the relevant ones at query time, and feeds them to the model. When RAG goes wrong, it goes wrong quietly: the retrieval returns the wrong passage and the model answers fluently from it. Catching that is half the job, and it is the half that deciding between RAG and fine-tuning turns on.

Evals. An eval is a test suite for a system that does not give the same answer twice. The engineer builds a frozen set of representative inputs, defines what a good output looks like, and scores every model and prompt change against it. Faithfulness is a typical eval metric: RAGAS defines it as the ratio of claims in an answer that are actually supported by the retrieved context (Ragas docs). Without evals, "the model got better" is a vibe. With them, it is a number you can defend. This is the discipline I cover in depth in my guide to LLM evaluation.

Fine-tuning, when it earns its place. Fine-tuning adapts a pre-trained model to a narrower task on your data. It is powerful and it is overused. A strong LLM engineer reaches for retrieval and better prompts first and fine-tunes only when the eval numbers say the cheaper options have run out. Knowing the difference is judgment, not a default.

Serving and cost. The feature has to be fast and affordable at the volume you actually run. The engineer manages latency, caching, and the inference bill, because a feature that is correct but slow or unaffordable does not ship. Prompt caching is one of the levers here, and it matters more than most teams realize once traffic climbs.

Most LLM products are retrieval products wearing a chat interface. The clever prompt gets the demo. The retrieval and the evals get the renewal.

The responsibilities, in one table

Here is the same work as a table, so you can match a job description or a candidate against it. The left column is the responsibility. The right column is what it actually looks like when the person is doing it well, not the abstract version.

Responsibility	What it looks like in practice
Prompt & context engineering	Writes prompts as interface contracts: constrains behavior, defines output shape, handles refusals and edge cases
Retrieval (RAG)	Chunks and indexes your data, tunes retrieval quality, catches stale or wrong-passage answers before customers do
Evaluation	Builds a frozen eval set from real traffic, scores every change, reports faithfulness and accuracy as numbers, not vibes
Fine-tuning	Reaches for it only when evals show prompts and retrieval have run out; owns the data and the regression risk
Serving & cost	Manages latency at p95, caching, and the inference bill so the feature is affordable at real volume
Production judgment	Knows the cost of being wrong, decides what "good enough" means, and which 5% of cases will hurt you

LLM engineer vs AI engineer vs ML engineer

This is the question that costs the most when it goes unanswered, so here is the clean version. An ML engineer builds and trains models. Their world is data pipelines, training runs, feature engineering, and model architecture. They produce a model. An AI engineer builds products on top of models someone else trained; I have written the full definition in what an AI engineer is. An LLM engineer is an AI engineer whose models are specifically large language models, and whose daily work centers on prompting, retrieval, and evals rather than, say, computer vision or recommendation systems.

So the relationship is: LLM engineer is a specialization inside AI engineer, and both are nearly the opposite of ML engineer on the build-versus-train axis. An ML engineer who has never shipped a RAG system and an LLM engineer who has never trained a model from scratch are both doing their jobs correctly. They are not interchangeable, and a job post that lists "train LLMs from scratch" alongside "build our chatbot" is describing two different hires. The AI engineer versus ML engineer split covers the build-versus-train distinction in more depth.

I watched a team burn most of a quarter on this. They posted for an "LLM engineer" but the req was written by someone who pictured a research scientist, full of training-from-scratch language. The strong application builders self-selected out, the candidates who stayed oversold their research credentials, and the person they hired spent three months trying to justify a fine-tuning project the product never needed. The fix was one rewrite of the job description to describe building, not inventing. The cost was the quarter.

The skills that actually matter

The technical floor is real and most competent candidates clear it: fluency with LLM APIs, RAG, eval design, prompt and context engineering, and enough software engineering to ship a system that does not fall over. That floor is necessary and it is not what separates a good hire from an expensive one. The skill that separates them is judgment, and it is the same judgment that runs through the AI engineer skills that actually matter.

Judgment shows up as three questions the good ones ask without prompting. When is the model wrong, and how would I know? When is "good enough" actually good enough to ship? Which slice of cases, usually a small one, will hurt the business if it fails? A candidate who frames their work around the cost of being wrong has the judgment. A candidate who frames it around the model they used does not, no matter how impressive the model.

This is also why seniority is not about years. A junior engineer can know every framework and still not know which 5% of cases matter. A senior one has been burned enough times to instrument for them first. If that distinction is load-bearing for your req, the senior-versus-junior breakdown is the one to read before you set the level.

When you actually need an LLM engineer

You need an LLM engineer when an LLM-powered feature is going in front of customers and the cost of it being wrong is real. That is the whole test. Before that point, you are hiring ahead of the problem and paying a scarce salary to a person who does not yet have a production problem to solve.

The signal is not "we want to use AI." Everyone wants to use AI. The signal is that you have a specific feature, a real user touching it, and a concrete cost when the model misbehaves: a refund, a compliance exposure, a support ticket, a churned account. When the wrongness has a price tag, you need someone whose entire job is to drive that price down. When it does not yet, a strong generalist engineer with API access will get you to the demo, and you hire the specialist when the demo becomes a product.

I have also watched the opposite mistake, which is cheaper but still real. A team waited too long, ran a customer-facing LLM feature on a generalist's spare time for two quarters, and accumulated a backlog of quiet failures nobody owned because nobody owned evals. The feature looked fine in every demo and leaked trust in production. The day they hired someone to own it, the first month was just measuring how wrong it already was. Hire when the cost is real, not before, and not a year after.

How to hire one

Once you know you need the role, hiring it well is its own discipline, and it is the subject of my definitive guide to hiring AI engineers. The short version: write the req around what ships, not around tools; interview for judgment by asking how the candidate would know the system was wrong; and check that their stated pay expectations match the role you are actually filling rather than the inflated market headline. On that last point, what an AI engineer actually costs is more reliable than the salary aggregators, which vary wildly because they lump three different roles under one title.

If you would rather not run that gauntlet, that is a legitimate choice. Hiring a permanent specialist for a role this new and this fast-moving is a real bet, and a lot of teams are better served by bringing in proven LLM engineers on an engagement and converting later. That is exactly what Devlyn's LLM engineer hiring is for: people who have already shipped retrieval, evals, and serving in production, vetted on judgment rather than a tool list. The framework I use to think about building the team around them is in Building an AI-Native Team.

Frequently asked questions

What is an LLM engineer in simple terms? An LLM engineer takes a large language model someone else trained and turns it into a reliable product feature. They build the prompts, the retrieval, the evals, and the serving layer around the model. They do not train the model itself, and they do not write research papers. Their output is a working feature, not a new model.

What is the difference between an LLM engineer and an AI engineer? An LLM engineer is a specialization within the AI engineer role. Both build products on top of models someone else trained, but the LLM engineer works specifically with large language models and spends their days on prompting, retrieval, and evals. An AI engineer is the broader title that can also cover vision, recommendation, or other model types.

Do LLM engineers need to know machine learning? They need to understand how LLMs behave well enough to debug and constrain them, but they do not need to train models from scratch, which is the ML engineer's job. The most valuable LLM engineering skills are retrieval, eval design, prompt engineering, and the software engineering to ship a system. Deep model-training expertise is a bonus, not a requirement.

When should I hire an LLM engineer? When an LLM-powered feature is going in front of customers and the cost of it being wrong is real, such as a refund, a compliance risk, or a churned account. Before that, a strong generalist engineer with API access can get you to a working demo. You hire the specialist when the demo becomes a product people depend on.

If you have a feature in front of customers and the wrongness now has a price tag, that is the moment the role pays for itself. Devlyn places LLM engineers who have already done this work in production, and the hiring guide walks through running the search yourself. Build on the model. Measure what breaks. Hire the person who does both.