Hiring for Judgment You Can Observe
The candidate's portfolio looked spectacular. Every example used the latest tools.
Key Takeaways
- AI-native interviews should test supervision of generated work, not blank-page speed alone.
- The strongest signal is calibrated doubt: evidence use, assumptions, risk classification, and verification plan.
- Junior candidates can show coachable judgment even when they do not yet have senior judgment.
- Flawed AI artifacts make better interview material than polished portfolio demos.
Hiring for observable judgment means giving candidates plausible AI output and watching how they verify, reject, revise, or escalate it.
The candidate's portfolio looked spectacular. Every example used the latest tools. Every demo had smooth narration. Every screen showed speed. The hiring panel was impressed until the staff engineer asked a simple question: "Show us a time you rejected an AI-generated answer that looked good." The candidate paused. The answer was vague. They knew how to produce; they could not yet demonstrate judgment.
AI-native hiring must make judgment observable.
This chapter gives hiring managers a practical method for evaluating judgment without falling back on intuition. The key is to test candidates on ambiguous artifacts, not blank-page production. Give them a flawed AI output. Give them incomplete context. Give them a risk boundary. Ask them what they would trust, what they would reject, what they would verify, and how they would operationalize the lesson.
Research spine
This chapter uses: Forsgren et al., The SPACE of Developer Productivity; Edmondson, The Fearless Organization / psychological safety research; Brynjolfsson, Li, Raymond, Generative AI at Work, NBER Working Paper 31161; GitHub Research, Quantifying GitHub Copilot's impact on developer productivity and happiness.
The hiring signal changes
In older workflows, a take-home exercise often tested whether a person could produce a complete artifact under time pressure. That signal is weaker now. A candidate can use tools to produce something coherent quickly, and banning tools in the interview gives you less information about how they will actually work. The better signal is how they supervise, constrain, evaluate, and revise tool output.
This requires interview design. Ask for critique, not only creation. Ask for failure diagnosis, not only success demonstration. Ask for assumptions, trade-offs, and risk classification. A strong candidate will name what they do not know and create a verification plan. A weak candidate will polish the artifact and call it done.
The judgment interview
A judgment interview has four parts. First, give the candidate an artifact generated by a model: a product spec, code patch, customer response, data analysis, policy summary, or sales proposal. Second, provide partial context and make the missing context realistic. Third, ask the candidate to review the artifact and produce a decision memo: ship, revise, reject, escalate, or test. Fourth, debrief their reasoning.
The interviewer should score the candidate on evidence use, assumption naming, risk awareness, domain reasoning, evaluation design, and communication. This is not a trick. It is a simulation of actual AI-native work.
Avoiding seniority bias
Judgment is easier to observe in experienced candidates, but AI-native teams cannot become senior-only organizations. The right junior signal is not perfect judgment; it is coachable judgment. Does the candidate notice uncertainty? Can they compare alternatives? Can they explain why an answer might be wrong? Can they accept correction and update the rubric? Can they distinguish personal preference from external standard?
Apprenticeship becomes a hiring criterion. If the organization has no learning loop, it should not pretend juniors will automatically develop judgment by using tools. The interview should reveal both the candidate's current capacity and the team's responsibility to teach.
Operating table
| Interview exercise | What it reveals | Strong signal | Weak signal |
|---|---|---|---|
| Critique a generated spec | Product and system judgment | Names missing user, eval, risk, and owner | Rewrites wording only |
| Review generated code | Engineering supervision | Checks behavior, tests, security, maintainability | Accepts because tests pass |
| Rank customer replies | Revenue/support judgment | Balances resolution, accuracy, tone, policy | Optimizes for politeness alone |
| Design an eval | Operational maturity | Creates sample set and failure taxonomy | Says human review is enough |
Artifact example: a judgment interview rubric
judgment_interview_scorecard:
dimensions:
evidence_use: 1-5
assumption_naming: 1-5
risk_classification: 1-5
evaluation_design: 1-5
domain_reasoning: 1-5
communication_clarity: 1-5
required_candidate_output:
- decision
- reasons
- missing_context
- verification_plan
- rollback_or_escalation_path
automatic_reject_flags:
- "treats model output as authority"
- "cannot name uncertainty"
- "ignores stated risk boundary"
Checklist
- Test candidates with AI tools allowed, but score supervision rather than raw output.
- Use flawed artifacts as interview material.
- Separate current judgment from coachability.
- Require a verification plan in every work-sample exercise.
- Train interviewers to reward calibrated doubt, not theatrical confidence.
Takeaway
The best AI-native hiring exercises ask candidates to judge a plausible artifact, not merely create one.
Internal map
For the larger argument, keep this chapter connected to the AI-Native thesis, Building an AI-Native Team, The Judgment Economy, and Human in the Loop Is Not a Plan.
