How Far Will They Go? Red-Teaming Online Influence with Large Language Models
A new arXiv paper (2605.22880) presents a framework for red-teaming LLMs' susceptibility to political influence campaigns. The authors define Overton Windows (OWs) as the range of political opinions a model can reliably express on controversial topics. They tested over 30 open-source LLMs from 10 model families and five countries, finding that models are generally more willing to generate left-leaning social media content. OWs contract inversely with model size, and regional differences are substantial. Simple natural-language jailbreaks can expand these windows. The study focuses on locally deployed open-source models, as they align with the operational constraints of privacy-conscious malicious actors. The findings highlight asymmetries in political expressivity and the potency of jailbreaks, which vary sharply across model families. This work underscores the need for robust red-teaming to safeguard information integrity as LLM-based agents become more prevalent in online discourse.
Developers must consider political biases in open-source LLMs when deploying them in social media contexts.