AN Alpesh Nakrani
BlogBooksPraiseAbout Work with me →
Book overview
Chapter 2 / Field Manuals

The Reference System

The smallest system design that can be owned, tested, and improved.

The Moving Parts

The Reference System is about the mechanics behind the promise. A usable data feedback loops system has more than a model call. It has event design, review queues, labeling rules, privacy constraints, and retraining triggers. If any of those pieces are implicit, the team will debug behavior by opinion.

The mechanism matters because it tells you where control exists. Some behavior can be improved by better context. Some needs a stricter policy. Some needs a different workflow. Treating every weakness as a model problem wastes time.

The Control Surface

The control surface is the set of choices the team can change without waiting for a research breakthrough. In this book, that surface includes task definition, input quality, context assembly, routing, review, and release criteria.

The practical move is to document the control surface before debating tools. Once the team can name the levers, it can test which lever changes outcomes and which only changes the story told in meetings.

Where Judgment Enters

Judgment enters at the boundary between evidence and action. A system can suggest, rank, classify, draft, retrieve, or route. It cannot own the consequence. That ownership has to remain visible in the product and in the organization.

The operating question is not whether people stay in every loop. It is where a person must define policy, review exceptions, and decide whether the current evidence is enough for the next release.

A Better Default

A better default is to start with the smallest workflow that proves the mechanism. The workflow should have known inputs, a known output, a known failure state, and a known owner.

Once that workflow is stable, the team can widen the domain. Without that order, complexity arrives before learning, and every later decision becomes harder to isolate.

Research Lens

The research base for The Data Flywheel matters because data feedback loops sits between capability and consequence. Papers, benchmarks, and risk frameworks can show what is possible, but production teams still have to translate that evidence into decisions. This chapter treats research as a constraint on judgment, not as decoration.

The most useful research habit is to separate mechanism from outcome. A paper can show that a method improves a benchmark. It does not prove that the same method improves improvement per reviewed interaction in your product. That gap is where evaluation, sampling, and release discipline belong.

For this chapter, read external sources as pressure tests. If a source describes a known weakness, ask whether your system can observe that weakness. If a source describes a benchmark gain, ask whether your users send the same kind of work. If a source describes a risk, ask who owns it after launch.

Mechanism method

Start with a written task statement. It should name the user, the input, the expected output, the source of truth, and the action that follows. If any of those pieces are missing, data feedback loops is not ready for broad automation because the team cannot tell whether the result is good enough.

Next, define the control surface. For this topic, the control surface includes event design, review queues, labeling rules, privacy constraints, and retraining triggers. Each control should have a reason to exist and a way to be tested. A control that cannot be tested becomes process theater. A control that can be tested becomes part of the operating system.

Finally, decide what the system does when the answer is not ready. The mature options are ask for more context, return a partial answer with evidence, route to a person, or stop. The immature option is to keep generating until the output sounds confident.

Mechanism evidence

Evidence should be collected at the same grain as the decision. If the decision is which interactions deserve permanent memory, the review set should contain examples that force that decision. A broad score is useful only after the team has inspected the cases that carry the most cost.

The strongest evidence combines observed user work, known edge cases, recent incidents, and synthetic pressure tests. Synthetic examples are useful when they fill a known gap. They are dangerous when they replace the real distribution the system must serve.

A good review record includes the input, the relevant context, the output, the expected answer, the judgment, and the fix. Without that record, quality work becomes memory work. With it, the team can see whether the system is learning, drifting, or merely changing shape.

Implementation Notes

Implementation should begin with the smallest useful workflow. The first version should be narrow enough that the team can replay every important failure. If replay is not possible, the system is not observable enough for serious use.

The second version should add volume without changing the promise. This is where improvement per reviewed interaction should be watched closely. If the metric improves while support tickets, corrections, or handoffs rise, the measurement is missing something important.

The third version can expand scope only after the team knows which failures are acceptable, which failures require escalation, and which failures require rollback. Expansion without that knowledge creates a system that appears productive while quietly moving risk to the customer.

Decision Review

At the end of the chapter, the team should be able to answer four questions. What promise are we making? What evidence supports it? What happens when the promise fails? Who has authority to change the promise? These questions are simple, but they expose most weak deployments.

The answer should not live only in a meeting note. It should appear in the evaluation suite, the release checklist, the incident process, and the product experience. Users do not need to see the internal machinery, but they do need to feel its discipline.

The Data Flywheel is ultimately about replacing vague confidence with accountable practice. The point is not to slow teams down. The point is to make speed repeatable, explainable, and safe enough to build a business on.

Operating table

The Reference System operating table

AreaWhat to inspectDecision evidence
MechanismDocument the control surface for event design, review queues, labeling rules, privacy constraints, and retraining triggers.improvement per reviewed interaction
BoundarySeparate product policy from model behavior.which interactions deserve permanent memory
ReviewInspect the smallest workflow that proves the mechanism.separate learning data from vanity analytics
Chapter notes

What to carry forward

  • Draw the boundary between product behavior and model behavior.
  • Use improvement per reviewed interaction as the anchor metric.
  • Make this decision explicit: Which interactions deserve permanent memory.
  • Separate learning data from vanity analytics.
  • Name the control surface before debating tools.
  • Separate model behavior from product responsibility.
Share