Skip to content

Introduction

Sentinel is a fully autonomous SRE (Site Reliability Engineering) pipeline built entirely on Cloudflare’s Developer Platform. It watches your Cloudflare Workers for errors and fixes them — without human intervention.

When a Worker starts throwing 5xx errors at 3 AM, the typical response is:

  1. An alert fires
  2. A human wakes up, reads logs, reproduces the issue
  3. They identify the root cause, write a fix, run tests
  4. They submit a PR, wait for review, merge, deploy

Steps 2–4 can take hours. Sentinel automates all of them.

Sentinel runs as a set of six Durable Object agents on Cloudflare Workers, connected through Queues:

AgentRole
LogTailerPolls Workers Observability every 30s, fingerprints errors, deduplicates
TestGenGenerates a bun:test reproduction test via LLM, runs it in a Sandbox
CodeTriageReads source code, performs multi-turn LLM root cause analysis
FixAgentTDD cycle: confirm bug → generate fix → verify → regression check → PR
OrchestratorDashboard API for monitoring and manual overrides
TracingAgent(Phase 3) Extends pipeline to non-Cloudflare origins

Sentinel is not fully autonomous end-to-end. The pipeline stops at the PR boundary:

  • Sentinel creates a GitHub Pull Request with a detailed description
  • A human reviews and merges (or rejects) the PR
  • If an agent lacks confidence, the incident is marked needs_human

This design ensures that AI-generated code is always reviewed before reaching production.

Sentinel currently targets:

  • 5xx server errors — uncaught exceptions, timeout errors, resource exhaustion
  • 4xx client errors — malformed route handlers, missing validation, incorrect status codes
  • Unhandled exceptions — promise rejections, type errors, null reference errors

Each error is fingerprinted using SHA-256 over normalized error messages, route patterns, and stack traces. This means the same root cause generates the same fingerprint regardless of dynamic values like timestamps, UUIDs, or request IDs.

Sentinel runs entirely on Cloudflare:

  • Workers (compute) — all agents run as Durable Objects
  • D1 (database) — incident tracking, audit trail, configuration
  • R2 (storage) — test cases, patches, PR descriptions
  • Queues (messaging) — reliable inter-agent communication with retries and DLQ
  • Sandbox (containers) — isolated execution of untrusted code
  • Workers AI (inference) — test generation, root cause analysis, fix generation