Design Phase

Self-Healing Agents

AI workflows that detect failures and fix themselves — automatically.

"The best incident response is no incident at all."

Self-healing systems don't just handle failures — they anticipate and prevent them. By monitoring runtime signals and applying corrective actions automatically, Nika keeps workflows running without human intervention.

From Reactive to Proactive

Traditional: Reactive

  • Wait for failure
  • Alert human operators
  • Manual investigation
  • Manual fix
  • Manual restart

MTTR: Hours to Days

Nika: Proactive

  • Monitor health signals
  • Detect anomalies early
  • Apply automatic fixes
  • Verify recovery
  • Resume seamlessly

MTTR: Seconds

The 4-Level Healing Ladder

L1: Retry

Automated

Simple retry with backoff. Works for transient failures.

Immediate retryExponential backoffJitter

L2: Adapt

Automated

Modify execution parameters based on failure type.

Reduce context sizeSwitch to faster modelSimplify prompt

L3: Reroute

Automated

Switch execution path entirely. Different provider or strategy.

Provider failoverAlternative workflow pathCached fallback

L4: Escalate

Human Required

Human intervention required. System preserves state for resume.

Alert operatorsPreserve stateProvide diagnostics

Signal → Action Matrix

SHAKA monitors these signals and triggers automatic healing actions:

SignalThresholdHealing ActionType
Retry Count> 3Switch modelhealth
Latency> 30sReduce contextperformance
Token Usage> 90%Compress historybudget
Parse Errors> 2Patch promptquality
Collapse Risk> 70%Trim scopeepistemic
Tool Failures> 50%Disable toolhealth

In Practice: Healing Policies

self-healing.nika.yaml
shaka:
  mode: live_gated_safe        # Enable automatic healing

  healing:
    enabled: true
    policies:
      - signal: retry_count
        threshold: 3
        action: switch_model
        fallback: claude-haiku

      - signal: latency
        threshold: 30000        # 30s
        action: reduce_context
        trim: 50%

      - signal: collapse_risk
        threshold: 7000         # 70% on 0-10000 scale
        action: isolate_scope
        preset: minimal

      - signal: tool_errors
        threshold: 0.5          # 50% failure rate
        action: disable_tool
        notify: true

tasks:
  - id: analyze
    agent:
      prompt: "Analyze codebase"
      tools: [Read, Grep, Glob]
      maxTurns: 10

# SHAKA monitors and heals automatically

Powered by SHAKA

Self-healing is implemented through the SHAKA System — Nika's runtime sidecar that observes, analyzes, and proposes actions. SHAKA detects issues before they cascade and applies L1/L2 healing actions automatically, while respecting the core invariant: "SHAKA proposes. NIKA disposes."

Implementation Status

  • Retry with backoff — Implemented
  • Timeout handling — Implemented
  • SHAKA signal monitoring — In Design
  • Automatic healing policies — In Design
  • Learning from corrections — Future