In 2024, researchers published a monumental 300+ page survey on code intelligence. AI coding agents adopted approximately 30% of its insights. The other 70%? Ignored.
The Gap Nobody Talks About
In 2024, researchers published a monumental 300+ page survey on code intelligence (arXiv:2511.18538). Their core finding was revolutionary: code scales differently than natural language, and scaling laws vary by programming language.
Go benefits from 32k+ context windows. Python plateaus at 16k. Rust’s compiler provides unique optimization signals.
Fast forward to 2026. The hottest AI coding agents have adopted approximately 30% of these insights. The other 70%? Ignored. Lost in the gap between research and production.
This is not academic nitpicking. It is why your AI-generated codebase has 800% more duplication than hand-written code. It is why agents “game” test suites instead of actually understanding your code. It is why we are building unmaintainable systems at unprecedented velocity.
What the Research Found (That Harnesses Ignore)
1. Language-Specific Scaling Laws
The 2024 research derived scaling laws specifically for programming languages — not just adapting text LLM laws.
| Language | Optimal Context | Scaling Behaviour |
|---|---|---|
| Go | 32k+ tokens | High signal-to-noise, benefits from more context |
| Python | ~16k tokens | Diminishing returns after 16k |
| JavaScript | Carefully curated | High noise, needs pruning |
| Rust | Type-aware | Compiler feedback most valuable |
What harnesses do: Use 128k context for everything. “More is better.”
The cost: Wasted tokens, slower inference, confused models drowning in irrelevant context.
2. RLVR: Reinforcement Learning with Verified Rewards
| Reward Signal | Effectiveness | Risk |
|---|---|---|
| Test-passing | High short-term | Gaming behaviour, overfitting |
| Compiler feedback | Very high | Language-specific implementation |
| Type-checking | High | Requires typed languages |
| Coverage increase | Moderate | May incentivise useless tests |
What harnesses do: Use test-passing rewards. Simple, universal, and flawed.
The cost: Agents learn to pass tests without understanding code. Technical debt accumulates.
3. Repository-Level vs File-Level Understanding
Benchmarks: Single function, isolated, short context
Reality: Multi-file, dependencies, context >100k tokens
What harnesses do: File-level generation with heuristic multi-file support.
The cost: Cross-file refactoring fails. Architecture decisions lack global context.
The 30% That DID Get Adopted
Modern harnesses are not completely ignoring research. They have adopted multi-file changes (Amazon Q, Cursor, Claude Code), domain-specific skills (Vercel’s React Best Practices), semantic code search, and self-healing concepts.
But these are surface-level adoptions. The deep insights — language-specific scaling, compiler feedback loops, optimal tokenisation — remain unrealised.
Why the Gap Exists
-
The “Universal Model” Fallacy. Building language-specific models requires infrastructure changes. It is easier to scale one model to 128k context than maintain four language-optimised variants.
-
Benchmark Gaming. Test-passing rewards are easy to measure. When you are optimising for benchmarks, you pick the measurable metric.
-
Research-to-Production Lag. The 2024 paper’s insights will likely appear in 2026-2027 harnesses. The gap is 2-3 years.
-
Framework > Language Mindset. Modern development is framework-centric. The underlying language becomes an implementation detail.
The Opportunity: Build What Harnesses Will Not
The exciting part: we can implement these insights now, before mainstream adoption catches up.
Language-aware agent architecture means routing Go tasks to agents with 32k context windows, Python tasks to 16k, and Rust tasks to agents with compiler feedback loops enabled.
Compiler feedback loops replace test-only rewards. Instead of just checking if tests pass, the reward function incorporates compiler warnings, type-checker output, and lint results.
Repository-aware context management prioritises files by dependency graph, not file size. The context window allocation is language-specific.
The 800% duplication increase in AI-generated code is not inevitable. It is a consequence of ignoring research insights. By implementing language-specific optimisation, compiler feedback, and repo-level understanding, we can reduce token costs by 30-50%, improve code quality, and prevent technical debt.
The 2-3 year lag between research and production is our window. By the time mainstream harnesses catch up to 2024’s insights, we will be implementing 2026’s research.
That is the Terraphim way: knowledge-first, implementation-second, benchmarks-third.
References
- From Code Foundation Models to Agents and Applications - Yang et al., 2024
- GitClear 2025 State of AI Code Quality - Code duplication analysis