Self-Healing Agents
AI workflows that detect failures and fix themselves — automatically.
"The best incident response is no incident at all."
Self-healing systems don't just handle failures — they anticipate and prevent them. By monitoring runtime signals and applying corrective actions automatically, Nika keeps workflows running without human intervention.
From Reactive to Proactive
Traditional: Reactive
- Wait for failure
- Alert human operators
- Manual investigation
- Manual fix
- Manual restart
MTTR: Hours to Days
Nika: Proactive
- Monitor health signals
- Detect anomalies early
- Apply automatic fixes
- Verify recovery
- Resume seamlessly
MTTR: Seconds
The 4-Level Healing Ladder
L1: Retry
AutomatedSimple retry with backoff. Works for transient failures.
L2: Adapt
AutomatedModify execution parameters based on failure type.
L3: Reroute
AutomatedSwitch execution path entirely. Different provider or strategy.
L4: Escalate
Human RequiredHuman intervention required. System preserves state for resume.
Signal → Action Matrix
SHAKA monitors these signals and triggers automatic healing actions:
| Signal | Threshold | Healing Action | Type |
|---|---|---|---|
| Retry Count | > 3 | Switch model | health |
| Latency | > 30s | Reduce context | performance |
| Token Usage | > 90% | Compress history | budget |
| Parse Errors | > 2 | Patch prompt | quality |
| Collapse Risk | > 70% | Trim scope | epistemic |
| Tool Failures | > 50% | Disable tool | health |
In Practice: Healing Policies
shaka:
mode: live_gated_safe # Enable automatic healing
healing:
enabled: true
policies:
- signal: retry_count
threshold: 3
action: switch_model
fallback: claude-haiku
- signal: latency
threshold: 30000 # 30s
action: reduce_context
trim: 50%
- signal: collapse_risk
threshold: 7000 # 70% on 0-10000 scale
action: isolate_scope
preset: minimal
- signal: tool_errors
threshold: 0.5 # 50% failure rate
action: disable_tool
notify: true
tasks:
- id: analyze
agent:
prompt: "Analyze codebase"
tools: [Read, Grep, Glob]
maxTurns: 10
# SHAKA monitors and heals automaticallyPowered by SHAKA
Self-healing is implemented through the SHAKA System — Nika's runtime sidecar that observes, analyzes, and proposes actions. SHAKA detects issues before they cascade and applies L1/L2 healing actions automatically, while respecting the core invariant: "SHAKA proposes. NIKA disposes."
Implementation Status
- Retry with backoff — Implemented
- Timeout handling — Implemented
- SHAKA signal monitoring — In Design
- Automatic healing policies — In Design
- Learning from corrections — Future