KES Outbox E2E Runbook (15-min Onboarding)
Last updated: 2026-03-06
Purpose:
- Bring up full local KES outbox flow.
- Run smoke checks.
- Execute DLQ and poison replay workflows.
- Resolve common local failures quickly.
Target DoD:
- A new engineer can run end-to-end in <= 15 minutes without external help.
0) Prerequisites
- Docker Desktop running.
- Node.js + npm installed.
- Repo opened at root (
kvary.network).
Windows note:
- In PowerShell, do not use Linux-style inline env assignments (
FOO=bar cmd). - Prefer existing npm scripts (they already use
cross-envwhere needed).
1) Start (fresh local)
Run from repo root:
npm run db:up
npm run kafka:up
npm run monitoring:up
npm run migrate:all
Then start app stack:
npm run dev:one
Open a second terminal and start outbox relay:
npm run relay:kes-outbox:dev
Expected healthy endpoints:
curl -sS http://127.0.0.1:4100/health
curl -sS http://127.0.0.1:4020/health
curl -sS http://127.0.0.1:4060/health
curl -sS "http://127.0.0.1:9090/api/v1/query?query=kes_outbox_relay_up"
Expected:
- auth/tenders health return
{"ok":true} - relay health returns
{"ok":true,...} - Prometheus query returns value
1forkes_outbox_relay_up
2) Smoke
Run outbox live smoke:
npm run tenders:outbox:live-smoke
Expected:
- JSON output with
"ok": true - relay dispatch delta > 0
Optional Kafka smoke:
npm run kafka:kes-smoke
Grafana dashboard:
- URL:
http://localhost:3002 - Login:
admin/admin - Dashboard:
KES Outbox Overview
3) Replay Operations
DLQ replay (consumer DLQ):
# Dry-run
npm --prefix services/svc-tenders run replay:kes-dlq -- --from-beginning --max-messages 50
# Execute
npm --prefix services/svc-tenders run replay:kes-dlq -- --execute --max-messages 50
Outbox poison replay:
# Dry-run
npm --prefix services/svc-tenders run replay:kes-outbox-poison -- --max-rows 50
# Execute
npm --prefix services/svc-tenders run replay:kes-outbox-poison -- --execute --max-rows 50
4) Stop
- Stop
dev:oneterminal (Ctrl+C). - Stop relay terminal (
Ctrl+C). - Stop infra:
npm run monitoring:down
npm run kafka:down
docker compose -f docker-compose.postgres.yml down
5) Common Failures
A) EADDRINUSE (port already in use)
Symptoms:
listen EADDRINUSE ... :3000/:4001/:4010/:4020/:4060/:4100
Fix:
node scripts/free-port.js 3000
node scripts/free-port.js 4001
node scripts/free-port.js 4010
node scripts/free-port.js 4020
node scripts/free-port.js 4060
node scripts/free-port.js 4100
Then rerun startup command.
B) Auth service down
Symptoms:
- frontend
401 /api/v1/auth/merepeats - gateway
502on auth/oidc routes
Checks:
curl -sS http://127.0.0.1:4100/health
Fix:
- Ensure
dev:oneis running. - Or start auth only:
npm run dev:auth
C) Migration missing (relation ... does not exist)
Symptoms:
- relay fails with
relation "kes_outbox_events" does not exist - service errors for missing tables
Fix:
npm run migrate:all
If only tenders schema is missing:
npm --prefix services/svc-tenders run migrate
D) Grafana shows No data
Checks:
curl -sS http://127.0.0.1:4060/metrics
curl -sS "http://127.0.0.1:9090/api/v1/query?query=kes_outbox_relay_up"
Fix:
- Start relay (
npm run relay:kes-outbox:dev) - Confirm Prometheus is up (
npm run monitoring:up)
6) 15-minute DoD Checklist
- [ ]
db + kafka + monitoringstarted. - [ ]
migrate:allcompleted with no errors. - [ ]
dev:onerunning. - [ ]
relay:kes-outbox:devrunning and/healthisok:true. - [ ]
tenders:outbox:live-smokereturns"ok": true. - [ ] Grafana
KES Outbox Overviewshows relay/pending/dispatch metrics. - [ ] Dry-run replay commands execute without crash.