Gated Differentiable Working Memory for Long-Context Language Modeling

Abstract

Long contexts break transformers: attention scores dilute across thousands of tokens, critical information gets lost in the middle, and the model cannot adapt to novel patterns at inference time. We reframe test-time adaptation as a budget-constrained memory consolidation problem and propose GDWM (Gated Differentiable Working Memory), a framework that introduces a Write Controller to gate the memory consolidation process. Our controller estimates Contextual Utility—an information-theoretic measure quantifying how much each region depends on long-range context—and allocates gradient steps accordingly, subject to a coverage constraint that ensures global representation. Experiments on ZeroSCROLLS and LongBench v2 benchmarks demonstrate that GDWM achieves comparable or superior performance with 4× fewer gradient steps compared to uniform baselines, establishing a new efficiency-performance Pareto frontier for test-time adaptation.

Publication
Proc. of the Association for Computational Linguistics, ACL Main, 2026, pages 31885–31913