AN Alpesh Nakrani
BlogBooksPraiseAbout Work with me →
Book overview
Appendix A / The AI-Native Canon

Appendix A: Source Index

Designing evaluation that scales with autonomy

Research spine: this chapter stays grounded in NIST AI Risk Management Framework and NIST Secure Software Development Framework, then applies that evidence to the operating judgment in the book. Read this alongside the Human In The Loop book, the AI-Native thesis, and the full book library when you want the surrounding argument.

Key Takeaways

  • Designing evaluation that scales with autonomy
  • The practical test is whether a team can name the evidence, owner, and failure mode before it changes behavior.
  • Read this with Human in the Loop Is Not a Plan and the adjacent chapters when you need the wider Evals and Evaluation frame.

Human in the Loop Source Index

This human in the loop source index collects the research base behind the book's argument: oversight has to be designed as an operating system, not treated as an unlimited human safety blanket. The sources below support three recurring claims. First, humans are weakest when automation leaves them passive, late, or under-informed. Second, evaluation has to scale with the autonomy of the system being reviewed. Third, the review loop needs tools, rubrics, sampling, escalation paths, and incident learning rather than heroic manual attention.

Use this appendix as the evidence register for the preceding chapters. Bainbridge explains why automation can make human supervision harder instead of easier. The Microsoft Human-AI guidance and toolkit translate that lesson into interaction design. OpenAI Evals, HELM, lm-evaluation-harness, RAGAS, and DeepEval provide evaluation mechanics. NIST AI RMF gives the governance frame. Anthropic's agents guidance connects oversight to tool use, boundaries, and autonomy.

The practical test is simple: if a team cites "human in the loop" as a control, it should be able to point to which source below informs the control, which failure class the control catches, and how the control is measured after deployment.

Share