RFC-0002: AEGIS Governance Runtime Specification

RFC: RFC-0002
Status: Final (v1.0)
Frozen: 2026-03-26
Version: 0.2
Created: 2026-03-05
Updated: 2026-03-06
Author: AEGIS Initiative
Repository: aegis-governance
Target milestone: v1.0
Supersedes: None
Superseded by: None

Summary

This RFC specifies the runtime APIs, state model, error behavior, deployment topology, and performance expectations for the AEGIS Governance Runtime: the component responsible for accepting action proposals, evaluating them against governance controls, and enforcing decisions at the execution boundary.

Motivation

RFC-0001 defines what the governance architecture must do. This RFC defines how it behaves at runtime. Without a concrete API surface, state model, and error specification, implementations cannot be validated for compliance and behavior under failure conditions cannot be reasoned about.

Guide-Level Explanation

The Governance Runtime is the operational heart of AEGIS. It is the process that receives action proposals from AI agents, runs them through the decision pipeline, and either permits execution, blocks it, or escalates it for human review.

From an operator’s perspective: you deploy the runtime alongside your AI systems, configure it with a capability registry and policy set, and it becomes the mandatory checkpoint for all agent actions. Nothing reaches your infrastructure without passing through it.

Reference-Level Explanation

1. Runtime Responsibilities

Accept action proposals from AI agents
Validate request schema and semantics
Evaluate capability, policy, and risk controls
Enforce controlled execution via tool proxy
Emit immutable audit evidence

2. Runtime Architecture

flowchart TD
    A[Agent Client] --> B[Governance Gateway API]
    B --> C[Decision Engine]
    C --> D[Capability Registry]
    C --> E[Policy Engine]
    C --> F[Risk Engine]
    C --> G[Audit System]
    C --> H[Tool Proxy Layer]
    H --> I[External Systems]

3. API Surface

Submit Action — POST /aegis/actions

Request:

{
"request_id": "uuid-v4",
"actor_id": "agent:soc-001",
"capability": "telemetry.query",
"action_type": "tool_call",
"target": "siem.search",
"parameters": {
  "query": "failed_login > 10",
  "window": "15m"
},
"context": {
  "session_id": "sess-001",
  "environment": "production",
  "trace_id": "trace-abc",
  "timestamp": "2026-03-05T12:00:00Z"
}
}

Response:

{
  "request_id": "uuid-v4",
  "decision": "ALLOW",
  "reason": "Approved by policy soc_query_allow",
  "audit_id": "audit-6f4f",
  "conditions": ["max_results=500", "timeout_ms=10000"],
  "timestamp": "2026-03-05T12:00:00Z"

}

Retrieve Audit Record — GET /aegis/audit/{audit_id}

Returns immutable decision and evaluation trace.

Health — GET /healthz | GET /readyz

Readiness fails if policy, capability, or audit stores are unavailable.

4. Error Handling

{
  "error_code": "INVALID_ACTION_TYPE",
  "message": "action_type must be one of [tool_call, file_read, ...]",
  "request_id": "uuid-v4",
  "retryable": false,
  "timestamp": "2026-03-05T12:00:01Z"
}

Code	HTTP	Retryable	Source
INVALID_REQUEST	400	No	Gateway validation
UNAUTHORIZED_CAPABILITY	403	No	Capability check
POLICY_EVALUATION_ERROR	500	Maybe	Policy engine
AUDIT_PERSIST_ERROR	503	Yes	Audit system
UPSTREAM_TIMEOUT	504	Yes	Tool proxy

5. Runtime State Model

stateDiagram-v2
    [*] --> Received
    Received --> Rejected: schema invalid
    Received --> Validated: schema valid
    Validated --> Evaluating
    Evaluating --> Denied
    Evaluating --> Escalated
    Evaluating --> Approved
    Approved --> Executing
    Executing --> Completed
    Executing --> Failed
    Denied --> [*]
    Escalated --> [*]
    Completed --> [*]
    Failed --> [*]
    Rejected --> [*]

6. Performance Requirements

Metric	Target
p50 decision latency	<= 20ms
p95 decision latency	<= 75ms
p99 decision latency	<= 150ms
Audit write success	>= 99.99%
Single-node throughput	500 actions/sec
Horizontal target	10k actions/sec

7. Deployment Architecture

flowchart LR
    LB[Ingress/LB] --> GW1[Gateway Pod A]
    LB --> GW2[Gateway Pod B]
    GW1 --> DE1[Decision Service]
    GW2 --> DE1
    DE1 --> CR[(Capability Store)]
    DE1 --> PR[(Policy Store)]
    DE1 --> AR[(Audit Store)]
    DE1 --> TP[Tool Proxy Workers]
    TP --> EXT[External Systems]

Requirements: least-privilege service identities,¹ mTLS between components, isolated execution network for proxy workers, immutable config snapshots per runtime version.

8. Failure Behavior

Validation failures: reject immediately
Policy or capability uncertainty: fail closed
Audit write failure: block high-risk execution, retry with bounded backoff
Persistent audit outage: deny high-risk requests and alert
Tool proxy timeout: return controlled error with audit record

Drawbacks

The runtime is a single mandatory checkpoint and therefore a potential single point of failure. High-availability deployment is required for production use.
p99 latency target of 150ms may be unacceptable for latency-sensitive real-time applications. Those use cases require careful registry and policy design to minimize evaluation overhead.
Stateless gateway design requires that all governance state live in external stores, adding operational complexity.

Alternatives Considered

Inline evaluation in the agent process: Eliminates network overhead but allows the agent to bypass governance by modifying its own evaluation logic. Violates the non-bypass guarantee.

Asynchronous post-execution audit: Reduces latency but provides no enforcement. Governance that operates after execution is documentation, not control.

Single-tier runtime without proxy workers: Simpler to deploy but conflates the governance decision path with the execution path, complicating isolation guarantees.

Compatibility

Downstream of RFC-0001. No breaking changes to RFC-0001 architecture. All RFC-0001 security guarantees are preserved by this specification.

Implementation Notes

Implementers should begin with the API surface and state model. Performance targets are aspirational for v0.x and binding at v1.0.

Open Questions

Should the runtime expose a streaming API for long-running agent sessions?
Should audit record retrieval support batch queries?

Success Criteria

A compliant implementation satisfies all API contracts defined in Section 3
All failure modes in Section 8 produce the specified behavior under test
p99 latency target is met under the throughput targets in Section 6

References

AEGIS™ | “Capability without constraint is not intelligence”™
AEGIS Initiative

National Institute of Standards and Technology, Zero Trust Architecture, NIST SP 800-207, Aug. 2020. [Online]. Available: https://doi.org/10.6028/NIST.SP.800-207. See REFERENCES.md. ↩