Context Engineering
The art and science of making every token count in AI agent workflows.
The Context Window Problem
From JetBrains Research (Dec 2024):
"LLMs struggle to utilize full context windows effectively, often performing well only on 10-20% of advertised capacity due to quadratic attention scaling and poor recall in extended sequences."
You're paying for 200K tokens, but your model might only effectively use 20-40K. In agentic workflows, this problem compounds as conversations grow.
Quadratic attention scaling
Cost grows exponentially with context length
Recall degradation
Models forget information in long contexts
Only 10-20% effective utilization
Paying for tokens that don't help
Provider differences
Claude, GPT, Gemini handle context differently
Our Multi-Layered Approach
Context engineering in Nika happens at three levels:
# Level 1: Scope Presets (implemented)
# Control what context each agent starts with
scopePreset: minimal # 200K fresh, no inherited context
scopePreset: default # Position-aware, ancestors only
scopePreset: full # Full accumulation
# Level 2: Smart Allocation (planned)
# Automatic token budgeting per task type
agent:
prompt: "Analyze code"
contextBudget: 50000 # Reserve for this task
# Level 3: Trajectory Management (research)
# Compress or summarize long conversationsTechniques We're Exploring
Replace older observations with placeholders while retaining recent turns
JetBrains 2024: Matches LLM summarization in cost savings
Keep only the latest N turns in full detail
Optimal: 10 turns based on SWE-agent benchmarks
Different token budgets per scope preset
Our approach: minimal gets 200K fresh, full accumulates
Condense interaction history via specialized models
SWE-Compressor: 57.6% solve rate on SWE-Bench
Key Research Insights (2024-2025)
SWE-Compressor (arXiv, Dec 2024)
Trajectory-level supervision that injects context-management actions into agent interactions. Achieves 57.6% solve rate on SWE-Bench-Verified under bounded context.
Key insight: Proactively condensing history beats reactive truncation.
TITANS Architecture (Dec 2024)
Hybrid models combining recurrent architecture with neural memory modules. Scales to >2M tokens with higher accuracy than transformers or RAG-augmented models.
Key insight: Store "surprise" information, discard predictable content.
JetBrains Efficient Context Study
Observation masking with rolling windows (10 turns optimal) achieves >50% cost reduction without performance loss.
Key insight: Recent context matters more than complete context.
Why This Matters for You
50%+
Cost reduction possible
10x
Longer effective workflows
0
Performance degradation
Context engineering is a key focus area
We're implementing these techniques now.