Moving Governance and Evaluation Below the Application Layer

If you build on OpenAI or Anthropic today, your application code handles three things that have nothing to do with your product logic: governance (copying compliance rules into system prompts), evaluation (building a separate pipeline to score outputs), and state continuity (managing conversation history between calls). Every team reimplements these. Every implementation drifts.

These are not application concerns. They are infrastructure concerns that ended up in application code because the API layer never claimed them.

We built an AI API that claims them. Governance is a versioned, addressable object the API routes and deduplicates across steps. Evaluation is a toggle that runs a second, independent model on every response. Conversation state is an HMAC-signed payload the client carries, so the server stores nothing. The LLM reasoning itself uses standard function calling; tools execute on your servers, same as every other API. That part is table stakes.

This article describes the three pieces we moved below the application layer, why they belong there, and what changes when they're treated as infrastructure.

TL;DR

Governance, evaluation, and state continuity are infrastructure concerns currently reimplemented in every AI application. We moved them into the API layer.
Governance becomes versioned objects referenced by ID, with context-aware routing and session-aware deduplication that reduces token usage by 50% in multi-step workflows.
Evaluation becomes a boolean toggle: a second, structurally independent model scores every response against configurable criteria. Quality monitoring is built in, not bolted on.
Conversation state becomes an HMAC-signed payload carried by the client. The server is stateless. No sessions, no Redis, no TTL management.
Client-side tool execution (standard function calling) is the foundation, not the innovation.

Where governance lives today

Most AI APIs treat governance as a system prompt. You write safety instructions, brand voice guidelines, and compliance rules into a system message. When the rules change, you update the prompt in every place it's used. When different teams need different rules, they maintain different system prompts.

This is governance as copy-paste. It has three failure modes.

Drift. Ten endpoints referencing the same brand voice guidelines will have ten slightly different versions within a month. No one audits system prompts the way they audit config files.

Waste. A five-step orchestrated workflow injects the same 800-token governance preamble on every step. The model has already seen it. You're paying for tokens that carry zero new information.

Rigidity. A sales team's outreach pipeline and a support team's ticket responder share some rules (brand voice) but not others (outreach compliance vs. escalation policies). In the system-prompt model, you either duplicate the shared rules or build your own routing logic. Neither scales.

These are symptoms of governance living in the wrong layer.

Governance as addressable infrastructure

In our API, governance is an addressable parameter:

const result = await personize.responses.create({
  steps: [{ prompt: 'Draft a cold outreach email' }],
  personize: {
    governance: {
      guideline_ids: ['brand-voice', 'outreach-compliance', 'gdpr-rules']
    }
  }
});

brand-voice, outreach-compliance, and gdpr-rules are not strings. They are versioned governance objects stored in your organization's account. Each one is a structured document with sections, priorities, and scoping rules, managed through the dashboard or API.

When the request arrives, the governance layer retrieves the specified guidelines, selects the relevant sections based on the task context (using SmartGuidelines routing), and injects them into the model's context. The developer references guidelines by ID. The system handles delivery.

Three properties follow from this design.

Single source of truth. Update a guideline once. Every API call that references it picks up the change. No redeployment, no prompt engineering across fifty endpoints, no coordination between teams.

Session-aware deduplication. Within a session (a series of related requests sharing a session ID), the governance layer tracks which guidelines have already been delivered. Step 3 of a five-step workflow doesn't re-inject the brand voice guidelines that were delivered in step 1. Each step gets only the governance content that's new or newly relevant.

The governance layer knows what the model has already seen and doesn't repeat itself. This alone reduces token usage by 50% in multi-step workflows.

Scoped composition. The sales team's outreach pipeline references brand-voice and outreach-compliance. The support team's ticket responder references brand-voice and support-escalation-rules. The brand voice is shared. The domain-specific rules are scoped. No duplication, no divergence.

Pre-hoc, not post-hoc

This is a distinction worth making explicit. Content filters are post-hoc: they run after the model generates output and decide whether to block it.

Governance as we implement it is pre-hoc. The guidelines are injected into the model's context before generation. The model generates within the constraints, rather than generating freely and being filtered after the fact.

The difference matters in practice. A filtered model told "don't mention competitors" writes an awkward response that dances around the topic. A governed model with the actual competitive positioning guidelines in context handles the topic correctly, because it knows what to say, not just what to avoid.

When guidelines conflict

Guidelines carry priority weights. When two rules apply to the same context and contradict, the routing layer resolves the conflict deterministically based on priority, not by hoping the model figures it out. This is the same principle as CSS specificity: explicit precedence rules, not implicit LLM judgment.

Evaluation as a protocol primitive

Most teams bolt evaluation on after building the pipeline. They write test harnesses, sample outputs, and manually review quality. This works during development. It breaks in production, where you need continuous quality monitoring across thousands of requests.

We made evaluation a property of the API call, not a separate system.

const result = await personize.responses.create({
  steps: [{ prompt: 'Draft a cold outreach email for {{company}}' }],
  personize: {
    governance: { guideline_ids: ['brand-voice'] }
  },
  evaluate: true,
  evaluation_criteria: 'brand-voice-adherence, personalization-depth, call-to-action-clarity'
});
 
// result.evaluation:
// {
//   finalScore: 82,
//   criteriaScores: [
//     { name: 'brand-voice-adherence', score: 9, maxScore: 10, reason: '...' },
//     { name: 'personalization-depth', score: 7, maxScore: 10, reason: '...' },
//     { name: 'call-to-action-clarity', score: 8, maxScore: 10, reason: '...' }
//   ]
// }

When evaluate: true is set, the API runs a second LLM call after the primary generation. This second call uses a separate model on a separate invocation. It sees the original prompt, the generated response, the tool calls that were made, and the evaluation criteria. It scores each criterion independently using structured output.

Structural independence

This is not the model evaluating itself. A model asked to evaluate its own output consistently rates itself higher than an independent evaluator. Our evaluation is structurally independent: different model, different call, no shared state. It's a second opinion, not a self-assessment.

The closed loop

Here is where governance and evaluation connect. Governance rules shape generation. Evaluation scores the output against configurable criteria, which can include adherence to those same governance rules. The scores accumulate over time in a queryable dataset. When brand-voice-adherence drops from 9.1 to 7.3 after a guideline revision, you see it. When a model upgrade improves personalization-depth but degrades compliance scores, you see that too.

Evaluation is not a feature. It's the feedback loop that makes governance improvable.

This loop does not exist when governance lives in system prompts and evaluation lives in a separate pipeline maintained by a different team. Moving both into the same infrastructure layer is what makes the loop possible.

Stateless conversation continuity

Multi-turn workflows require the server to know where it left off. Which step was executing? What was the conversation history? What tool calls are pending?

The standard solution is server-side sessions. Store conversation state in a database, return a session ID, look it up on the next request. This works, but it introduces storage costs, TTL management, cache invalidation, and a scaling bottleneck. Every request hits the session store before reasoning can begin.

We went a different direction: the client carries the state.

When the server returns a requires_action response (indicating a tool needs to be executed), it includes the full conversation history and an HMAC-SHA256 signature. On the next request, the client sends both back. The server verifies the HMAC before processing. If the conversation was modified (messages injected, history rewritten, tool calls altered), the signature check fails and the request is rejected.

This is JWT for conversation state. The server stores nothing. The client carries the full state. The signature guarantees integrity.

The HMAC is computed over the conversation content and a request fingerprint that includes the original step definitions and tool schemas. This prevents a subtle attack: a client cannot take a conversation from one request context and replay it against a different set of steps or tools. The signature binds the conversation to the original request parameters.

The result: the server is stateless and horizontally scalable. Add more instances, put a load balancer in front, and any instance can handle any request. No sticky sessions, no shared state store, no Redis. The SDK handles signing and verification transparently. The developer never thinks about it, but the security property is real.

The foundation: client-side tool execution

The LLM calls tools using standard function calling. The developer defines tool schemas. The model decides when to invoke them. The SDK executes the tool locally and sends the result back. This is the same pattern OpenAI, Anthropic, and Google use. It is not what differentiates this architecture.

What it does is establish the boundary that makes everything above possible. Because tools execute on the developer's servers, sensitive data never transits the API. Because the API never runs tools, it has no reason to hold state for long-running executions. Because the server is stateless, governance deduplication and evaluation scoring can operate as pure functions of the request, not as side effects of a session.

The compute split is table stakes. The infrastructure it enables is not.

What moves and what stays

The thesis is simple: governance, evaluation, and state integrity behave like protocol-level concerns, not application-level concerns. When every application reimplements them, you get drift, waste, and fragility. When the infrastructure layer owns them, you get versioned governance with smart routing, built-in quality monitoring with structural independence, and stateless servers with cryptographic integrity.

The right question is not "what should the AI API do?" It's "what layer should each responsibility live in?"

Tool execution, data storage, and business logic belong in the application layer. Governance routing, output evaluation, and state integrity belong below it. The API layer should handle what every application needs and nothing that any single application owns.

The assumption behind most AI API design is that the provider should do more. More tool hosting, more state management, more middleware. We found the opposite: the provider should own fewer things, but own them at the right layer. Governance, evaluation, state integrity. Move these below the application, and the applications above get simpler, more consistent, and easier to improve.