Inspiration

Jared has a background in low level software and hardware, wasted compute power is offensive to him. There are lots of optimizations that can be made to compute systems, especially at the cloud level. However Site Reliability Engineers are often not aware of, or scared and/or too busy to implement them. By creating an system of AI agents we can get more out of the worlds computers.

What it does

Our agent, Derek, uses small AI agents to do specific optimizations to your cloud resources while watching for any unintended consequences so it can claim responsibility and revert harmful changes. Right now we've implemented an optimization for cloud object storage that operates as follows:

  • A small custom AI watches your cloud object usage, automatically migrating objects to cheaper services when appropriate.
  • Derek watches a glue thread for any mention of issues with the sites services.
  • When an issue is reported Derek will ask the smaller agent if it might have changed something that service was using.
  • If there a change was made Derek will revert the changes.
  • Derek then reports back explaining a change was made that might be at the root of the issue but that change was reverted.

How we built it

There are 3 components Derek, the smaller storage agent, and an API.

Derek

Langchain and OpenAI API in python. It's job is to understand messages from glue and turn that into commands and calls to the API.

storage agent

Custom ML model created in tinygrad using some test data generated and trained during the competition, all in python. It's job is to make data based decisions about whether objects can be migrated to save money.

API

This is simple REST API in python FastAPI that ties the various other components together

Challenges we ran into

Pearl: A lot of the actual code I had to write was relatively simple and painless. The Glue API, despite being so new, worked well, but it relies on graphQL, which I had never used before so it took me a second to figure that out. But wrapping my head around the flow of this thing was tricky. I had to write out where each piece would fit in to make sure I didn't forget anything. Jared: Even simple models are hard to train.

Accomplishments that we're proud of

Demonstrating the speed we can develop simple AI agents, and getting the glue API to work, were significant.

What we learned

Though we started with the general idea to build Derek, ideas for abilities we could give it starting flowing pretty fast and changed from the original. Having an agent that is smart enough to make optimizations wasn't nearly as interesting as one that could be accountable for unintended consequences of the actions.

What's next for Derek

Jared is going to continue to develop this concept into a SRE replacement.

Built With

Share this project:

Updates