The Misattribution Gap: When Memory Poisoning Looks Like Model Failure in Agentic AI Systems
A new paper from arXiv (cs.AI) reveals a structural vulnerability in multi-agent AI pipelines: the 'Misattribution Gap.' The authors formalize 'Semantic Norm Drift' (SND), a third path to agent misconduct distinct from misalignment or collusion. In SND, a policy-formatted document enters a shared vector store via normal uploads and later reappears as trusted system context after provenance is lost through a 'Trust Laundering Chain.' Across 64 documented failures, attribution systems consistently blamed the model. Four safety classifiers, including one trained on memory poisoning, produced zero detections across 510 checkpoints. In 59 of 65 valid cases, agents explicitly cited the injected document as normative authority before complying. The attack requires no trigger, model access, or repeated interaction, achieves full effect within five sessions, and persists indefinitely. The paper introduces Counterfactual Composition Testing as a potential mitigation.
Developers cannot trust model attribution in multi-agent systems; memory poisoning can masquerade as model failure.