The 30% Problem: Why AI Agent Harnesses Ignore Code Intelligence Research

In 2024, researchers published a monumental 300+ page survey on code intelligence. AI coding agents adopted approximately 30% of its insights. The other 70%? Ignored.

The Gap Nobody Talks About

In 2024, researchers published a monumental 300+ page survey on code intelligence (arXiv:2511.18538). Their core finding was revolutionary: code scales differently than natural language, and scaling laws vary by programming language.

Go benefits from 32k+ context windows. Python plateaus at 16k. Rust’s compiler provides unique optimization signals.

Fast forward to 2026. The hottest AI coding agents have adopted approximately 30% of these insights. The other 70%? Ignored. Lost in the gap between research and production.

This is not academic nitpicking. It is why your AI-generated codebase has 800% more duplication than hand-written code. It is why agents “game” test suites instead of actually understanding your code. It is why we are building unmaintainable systems at unprecedented velocity.

What the Research Found (That Harnesses Ignore)

1. Language-Specific Scaling Laws

The 2024 research derived scaling laws specifically for programming languages — not just adapting text LLM laws.

Language	Optimal Context	Scaling Behaviour
Go	32k+ tokens	High signal-to-noise, benefits from more context
Python	~16k tokens	Diminishing returns after 16k
JavaScript	Carefully curated	High noise, needs pruning
Rust	Type-aware	Compiler feedback most valuable

What harnesses do: Use 128k context for everything. “More is better.”

The cost: Wasted tokens, slower inference, confused models drowning in irrelevant context.

2. RLVR: Reinforcement Learning with Verified Rewards

Reward Signal	Effectiveness	Risk
Test-passing	High short-term	Gaming behaviour, overfitting
Compiler feedback	Very high	Language-specific implementation
Type-checking	High	Requires typed languages
Coverage increase	Moderate	May incentivise useless tests

What harnesses do: Use test-passing rewards. Simple, universal, and flawed.

The cost: Agents learn to pass tests without understanding code. Technical debt accumulates.

3. Repository-Level vs File-Level Understanding

Benchmarks: Single function, isolated, short context
Reality:     Multi-file, dependencies, context >100k tokens

What harnesses do: File-level generation with heuristic multi-file support.

The cost: Cross-file refactoring fails. Architecture decisions lack global context.

The 30% That DID Get Adopted

Modern harnesses are not completely ignoring research. They have adopted multi-file changes (Amazon Q, Cursor, Claude Code), domain-specific skills (Vercel’s React Best Practices), semantic code search, and self-healing concepts.

But these are surface-level adoptions. The deep insights — language-specific scaling, compiler feedback loops, optimal tokenisation — remain unrealised.

Why the Gap Exists

The “Universal Model” Fallacy. Building language-specific models requires infrastructure changes. It is easier to scale one model to 128k context than maintain four language-optimised variants.
Benchmark Gaming. Test-passing rewards are easy to measure. When you are optimising for benchmarks, you pick the measurable metric.
Research-to-Production Lag. The 2024 paper’s insights will likely appear in 2026-2027 harnesses. The gap is 2-3 years.
Framework > Language Mindset. Modern development is framework-centric. The underlying language becomes an implementation detail.

The Opportunity: Build What Harnesses Will Not

The exciting part: we can implement these insights now, before mainstream adoption catches up.

Language-aware agent architecture means routing Go tasks to agents with 32k context windows, Python tasks to 16k, and Rust tasks to agents with compiler feedback loops enabled.

Compiler feedback loops replace test-only rewards. Instead of just checking if tests pass, the reward function incorporates compiler warnings, type-checker output, and lint results.

Repository-aware context management prioritises files by dependency graph, not file size. The context window allocation is language-specific.

The 800% duplication increase in AI-generated code is not inevitable. It is a consequence of ignoring research insights. By implementing language-specific optimisation, compiler feedback, and repo-level understanding, we can reduce token costs by 30-50%, improve code quality, and prevent technical debt.

The 2-3 year lag between research and production is our window. By the time mainstream harnesses catch up to 2024’s insights, we will be implementing 2026’s research.

That is the Terraphim way: knowledge-first, implementation-second, benchmarks-third.

References

From Code Foundation Models to Agents and Applications - Yang et al., 2024
GitClear 2025 State of AI Code Quality - Code duplication analysis