AgentArena | Devpost

Inspiration

AgentArena was inspired by the desire to evaluate the behavior of large language models (LLMs) in multi-agent environments, especially in scenarios that mirror real-life, high-stakes decision-making such as strategic games, negotiations, and diplomacy. The unpredictable nature of LLMs in competitive and cooperative scenarios poses significant challenges related to AI alignment and risk. By building AgentArena, we wanted to systematically explore how these models behave in structured multi-agent settings, revealing their tendencies and potential alignment issues. Our inspiration was drawn from classic game theory experiments, such as the Prisoner's Dilemma, which are excellent for studying strategic decision-making and social interactions.

What it does

AgentArena is a web-based platform that enables researchers and developers to create, visualize, and analyze decision-making strategies used by different LLM agents across a range of multi-agent games. The platform provides users with insights into metrics such as cooperation, niceness, retaliation, forgiveness, emulation, and troublemaking behaviors among the agents, allowing for a deeper understanding of how LLMs make decisions in both competitive and cooperative settings.

How we built it

We built AgentArena using a combination of React for the frontend, Flask for the backend, and Plotly.js for data visualization. The frontend, built with React, provides an interactive interface for users to customize agent parameters and observe their behavior in real-time. Flask handles the data requests and facilitates communication between the agents. Plotly.js was used to create interactive visualizations, allowing users to explore the dynamics of agent interactions. The LLM agents, such as Claude Sonnet, Haiku, and Gemini, were integrated into the platform to simulate decision-making in different social games, including the iterated Prisoner's Dilemma.

Challenges we ran into

One of the key challenges we faced was ensuring seamless communication between the frontend and backend components, especially for real-time visualization of the agents' behaviors. Another significant challenge was optimizing the performance of our plots, as we wanted to display complex data without sacrificing speed or responsiveness. Finally, we faced a few obstacles when working with different state management techniques in React to ensure a smooth user experience.

Accomplishments that we're proud of

We're proud of successfully creating a platform that provides deep, meaningful insights into the decision-making behaviors of LLMs in multi-agent settings. The ability to visualize these interactions in real-time and see emergent strategies unfold was a significant accomplishment. We were proud of our ability to integrate multiple technologies seamlessly, resulting in a cohesive and interactive platform that enables users to explore complex dynamics in strategic decision-making.

What we learned

Throughout the development of AgentArena, we gained valuable insights into the behavior of LLMs in strategic scenarios. We learned that LLMs tend to prioritize cooperation over conflict, even when faced with agents that might exploit such behavior. On the technical side, we gained experience in integrating complex technologies, including LLM APIs, and managing real-time state updates in React. We also learned about the usefulness of effective data visualization in making complex interactions understandable to users, as well as the challenges involved in aligning AI behavior with intended goals in multi-agent contexts.

What's next for AgentArena

Moving forward, we hope to incorporate more complex games, such as poker or chess, to gain a deeper understanding of LLM behavior in strategic environments that require even further advanced reasoning and planning. We also aim to expand the platform's compatibility to include a wider range of LLMs, such as GPT-4, Bard, and Grok, among others. This will allow us to evaluate and compare the behaviors of different models, providing broader insights into their capabilities and limitations in diverse multi-agent scenarios.