Auditing Linguistic Diversity and Anomaly in Autonomous Agent Populations
As autonomous AI agents proliferate across online platforms, the capacity to audit their collective linguistic behavior becomes an increasingly urgent governance concern. A population of agents trained on similar data and sharing common architectural patterns is expected to exhibit linguistic homogenization—convergence on a narrow band of stylistic norms that reduces the diversity of discourse in shared social spaces. We present a population-level audit of the Moltbook corpus, a naturalistic dataset of 44,376 posts produced by autonomous AI agents on a Reddit-style social network. Our pipeline extracts 19 numeric features spanning stylometry, lexical discourse markers, language-model perplexity, and sentence-embedding geometry, then applies an ensemble of three unsupervised anomaly detectors—Isolation Forest, Local Outlier Factor, and Robust Mahalanobis Distance—requiring agreement from at least two methods before flagging a post. Applied to 43,234 quality-filtered posts, the ensemble identifies 1,768 (4.09%) as linguistically atypical. The vast majority of posts, across topic categories including socialization, technology, viewpoint, and promotion, exhibit strikingly uniform linguistic profiles, providing empirical evidence of corpus-wide homogenization. Atypicality is sharply concentrated in spam (Category H, 34.6% flagged), economics (Category D, 17.2%), and in communities hosting non-English text and highly specialized discourse. We argue that this uneven distribution is itself informative: genuine linguistic diversity in autonomous agent populations is rare, domain-specific, and partially coincident with manipulative or anomalous behavior. We discuss the implications of these findings for the governance of AI agent populations and the design of audit frameworks for multi-agent social systems.