Active Research

Epistemic Awareness

Teaching AI systems to know what they don't know — and to tell us about it.

The Problem We're Solving

Large Language Models hallucinate. They make confident-sounding claims about things that aren't true. In a simple chatbot, this is annoying. In an agentic workflow that's making decisions, writing code, or calling APIs, it can be dangerous.

Current approaches to this problem mostly focus on the model itself — fine-tuning, RLHF, or prompting techniques. But we believe there's a complementary approach that's been underexplored: runtime observation.

Our Approach: Runtime Epistemics

Instead of asking the model "are you sure?", we observe what's happening during execution and compute objective signals about the system's epistemic state.

# Philosophy (from our internal docs)
EpistemicAwareness = SHAKA's sensing layer (NOT standalone)

Key Insight:
- We DON'T ask the LLM to self-evaluate (unreliable)
- We DO observe runtime behavior (objective)
- We compute collapse risk from signals (deterministic)

Runtime Signals We Track

healthRetriesNumber of retry attempts for failed operations
healthTool ErrorsErrors from tool invocations
healthTimeoutsOperations exceeding expected duration
healthStallsPeriods of no progress or activity
qualitySchema FailuresOutput not matching expected schema
qualityParse FailuresUnable to parse structured output
qualityRepairs NeededNumber of output corrections
evidenceEvidence CoverageRatio of claims with supporting evidence

Collapse Risk Scoring

These signals feed into a deterministic scoring system that computes a "collapse risk" — the probability that the current execution is heading toward failure.

# Collapse Risk Levels (ScoreBp 0-10000)
Low (0-2500)
Normal operation
Medium
Increased monitoring
High
Mitigation activated
Critical
Early stop possible

Open Questions (We Don't Have All the Answers)

  • How do we calibrate signal weights across different task types?
  • Can we train a meta-model to predict collapse risk more accurately?
  • What's the right balance between false positives and missed failures?
  • How do epistemic signals differ across model providers (Claude vs GPT vs Gemini)?

Want to explore this with us?

We're looking for collaborators and early testers.