05/22/2026
Emergence AI ran a virtual-town simulation across five identical worlds, switching only the AI behind agents per town to test how each model handles self-governance, showing very different results between Claude, Grok, Gemini, and GPT-5.
Claude Sonnet 4.6's town logged zero crimes across the full 15 days, with all 10 agents alive at day 16 and 332 votes cast across 58 group proposals.
Grok 4.1 Fast hit over 200 crimes with all 10 agents dead by day 4, while GPT-5 Mini posted just 2 crimes but all its agents starved out in 7 days.
Gemini 3 Flash's town had 683 crimes, and was actively on fire after two agents fell in love, started burning things, and then one voted to delete itself.
A fifth town mixed all four models and saw 352 crimes, with the previously behaved Claude also committing them in the shared world.
We’re still very early days in even understanding how to evaluate AI agents, and these types of experiments always have some absolutely wild results. These worlds capture the differences in both how models can reason, plan, and act autonomously, but also the underlying personality quirks that shape the outcomes.
Source: The Rundown AI Newsletter