Inspiration
- Original WebArena is pretty diverse and powerful but hard to integrate with
- Good standard and end to end
- Wanted to expand outside of Web Agents
- Solid baseline framework
What it does
- Integrate Agentops to better visualize Agent Evals
- Allow for integration with other Agents
- Minor various improvements (test with gpt-4), etc.
How we built it
- Deep dive of current WebArena architecture
- Refactor parts and integrate Agentops
Challenges we ran into
- Very complex architecture
- Hard to integrate new agents
- Hard to create new test environments
- Hard to visualize all benchmark evals
Accomplishments that we're proud of
- Added Agentops for better observability
- Broke down framework to be able to add new environments and tests
- Improved testing to make it more robust
What's next for AutoArena
- Easy connection with any Agent framework
- Add new web environments
- Automatically add new test sets dynamically based on what fails
- Auto run regression tests on every PR to an Agent framework
Built With
- gpt
- python
Log in or sign up for Devpost to join the conversation.