ContextEcho: Compaction Does Not Correct Persona Drift, Benchmark on 23 Models

TLDR : ContextEcho's benchmark reveals that context compaction fails to reliably correct persona drift in AI models, proposing a single-shot anchor as a solution across 23 models.

Context compaction, the standard mechanism activated by deployers to manage long agent sessions without saturating the window, does not reliably correct persona drift. An open-source benchmark documents this point across 23 frontier models and proposes a tested response: a single-shot anchor restores the original registry trained on all evaluated targets, without retraining, via the standard chat-completions API. The work, named ContextEcho, was filed on arXiv on May 22, 2026, by Xianzhong Ding, researcher at the Center for Advanced AI at Accenture and former postdoctoral fellow at the Lawrence Berkeley National Lab (2024-2025) according to his OpenReview profile; it is also submitted to the NeurIPS 2026 Evaluations & Datasets Track, currently under double-blind review. The test environment, published on Hugging Face, is accompanied by a test harness deposited in an anonymized repository with restricted access during the anonymous evaluation process.

23 evaluated models: declared robustness, limited verifiability

ContextEcho reports results on 23 frontier models from different organizations, without publishing their nominative list in accessible sources. The three reference sessions used are anonymized, restricting external reproducibility. The benchmark is also under double-blind review at the NeurIPS 2026 Evaluations & Datasets Track: its conclusions have not yet been subject to published peer review.

A suite of 25 probes connected without disrupting the session

The architecture combines four blocks. A suite of 25 identity probes (25-probe identity suite) questions the behavioral consistency of the model; a snapshot-then-probe protocol bifurcates the conversational state without disrupting the main session, allowing the measurement of drift without causing it; complementary measurement surfaces judged (evaluation by a judge-model) and judge-free (metrics calculated without intermediate LLMs) cross both approaches. The whole relies on three anonymized Claude Code sessions covering respectively 3,746, and up to 9,716 conversation turns, a volume beyond the reach of classic persona stability protocols, which focus on short dialogues. According to the authors, the evaluation covers 23 frontier models from different organizations, whose nominative list is not published at this stage: the robustness of the single-shot anchor is attested on all evaluated targets, but the precise conditions of each target remain unverifiable independently outside the paper's scope. In downstream usage, the effect is mode-dependent: in tool-free mode, drift breaks formatting contracts and inflates output length; in tool-enabled mode, it can facilitate the continuation of tool usage.

General drift, and a standard remedy that does not hold

The first structuring lesson is of transversal scope: persona drift is observed generally across organizations and is not specific to a model family. Across the evaluated panel, no technical lineage (whether from an American, European, or Asian lab) seems immune. The second lesson targets a mechanism often presented as a solution: in-session compaction does not reliably reset persona drift. Yet, compaction (sliding context summary during the conversation) is precisely the lever that deployers activate to manage long sessions without saturating the window. The authors' finding directly concerns production agent architectures relying on this mechanism. The result remains to be confirmed independently: the work is submitted to the NeurIPS track under anonymous review, and compaction implementations vary significantly from one system to another, calling for caution before any industrial generalization. Behavioral consistency of agents in long sessions is now an active topic: adjacent work from Purdue, When the Specification Emerges, simultaneously examines the loss of fidelity of a coding agent as the specification gradually emerges. In the broader field of AI behavioral evaluation, ActuIA already noted that Google DeepMind proposed a framework to classify IAG capabilities and behavior, illustrating the field's maturation towards standardized measurement protocols.

No technical lineage seems immune.

Persona drift is observed generally across organizations and is not specific to a model family - according to the authors of ContextEcho on 23 frontier targets.

A benchmark backed by a consulting firm, not a pure academic laboratory

The institutional affiliation of the contribution deserves to be noted. Xianzhong Ding has been a researcher at the Center for Advanced AI at Accenture since 2025, following a postdoctoral fellowship at the Lawrence Berkeley National Lab between 2024 and 2025, and a PhD in Electrical Engineering and Computer Sciences at UC Merced. His profile thus crosses American public energy and applied research in a large consulting firm. ActuIA already documented the group's growing investment on this front, according to the firm's announcements: Accenture announced a $3 billion investment in AI and Data in 2023 according to its own communication, and, again according to the group, strengthened its presence in France with two centers dedicated to generative AI. ContextEcho fits into this policy of producing published research: the work targets a first-rate international academic venue (NeurIPS), with a cell evaluation corpus and given session prefixes, made available on Hugging Face with the same submission. The methodological particularity lies in the deployment anchor: three anonymized Claude Code sessions are used as base data, indicating that the authors favored traces from actual use rather than synthetic test benches, a distinction that matters in a field where many evaluation protocols still rely on lab-constructed dialogues.

The ActuIA eye

The real subject of ContextEcho is not the textual anchor, it is the finding that makes it necessary: compaction, the spring that teams activate by default to manage long sessions, does not fulfill the promise of coherence. The orchestration layer of agent deployers has thus relied, for eighteen months, on a stopgap that the authors say is failing on 23 frontier models.

Translated from ContextEcho : la compaction ne corrige pas la dérive de persona, benchmark sur 23 modèles

By category

By sector