Aetòs: From Chaos to Engineering Excellence — A 3-Year Transformation
How we transformed engineering productivity by building an Internal Developer Platform (IDP) that now processes ~50M API calls per day, manages 14,000 VMs, and powers 80+ releases per year and what you can learn from our journey.
Before I get into Aetòs, a quick clarification: this isn’t a post about picking the “right” language, framework, or toolchain. You’ll figure that out. The hard part isn’t the tech — it’s everything around it.
Most engineers have lived with broken workflows and manual toil for so long that the pain has become a fact of life. Changing that is fundamentally a cultural problem, not a technical one. Nobody is paying for your internal platform, and nobody is obligated to use it. Adoption is earned, not mandated.
You win only when they win. This blog is about my experience going through that journey with Aetòs.
Aetòs By The Numbers
Scale & Performance
- ~50M API calls per day — Orchestrating the engineering lifecycle
- 14,000+ VMs managed — With 3,000–4,000 VMs churned daily
- 80+ releases per year — ~2.5× improvement from pre‑Aetòs days
- 99.9% uptime — Platform reliability at scale
Business Impact
- $2.5M saved in cloud cost annually — Three years running, ~70% cloud cost reduction
- 10,000+ engineering hours saved per quarter — Time redirected from toil to innovation
- 90%+ reduction in triage time — AI‑assisted failure analysis
- ~50% faster release cycles — From commit to production
- “Run the business” work reduced from 76% → 36% — Most engineering time now goes into new initiatives instead of repetitive operational work
Engineering Efficiency
- 8,000+ kernel modules per release — Fully automated qualification
- 50+ platform configurations — Continuously certified and tested
- 500M+ database records — Deep historical analytics
- 90% reduction in manual intervention — Automation at every layer
The Breaking Point
April 2022. After more than a decade at VMware, I had just joined Pure Storage’s Portworx team. I was staring at my laptop, looking at the list of productivity challenges our Director of Platform Enginnering, Rajan Yadav, had handed me.
Our DevOps team had spent the last few quarters manually coordinating test infrastructure, tracking down flaky failures, and trying to get consistent results across an increasingly complex matrix of Kubernetes distributions, Linux distros, and cloud providers.
The problem wasn’t our engineers, we had brilliant people working extremely hard. The problem was that we had outgrown our tools. Every release felt like pushing a boulder uphill.
That night I started sketching what would become Aetòs (Greek for “eagle”), a platform that wouldn’t just change how we shipped software, but how we thought about engineering efficiency itself.
The Plan That Almost Wasn’t
Portworx was operating like a lean startup, with most investment rightly focused on customer‑facing features. Dedicated headcount for an internal platform wasn’t realistic yet, so we agreed to bootstrap the first version alongside our existing work.
That constraint forced a choice: wait for perfect conditions, or start small and prove the value of the platform.
The 20% Solution
We had to start small, but do it in a way that could scale to any level and get the foundation right.
A few of us in the DevOps team carved out about 20% of our time to build the first pieces of Aetòs and, within three weeks, we had our first microservice in place: Aetòs Private Cloud.
Not getting dedicated budget turned out to be a blessing. It forced us to:
- Build only what engineers actually needed
- Validate every feature through real usage
- Earn buy‑in by solving concrete problems, one after another
No big‑bang funded initiative. Just steady, compounding wins.
Three Years Later
Today, Aetòs is the single orchestration layer for all platform engineering operations at Portworx. Most importantly, it gave our engineers back something invaluable: time — time to innovate instead of coordinate.
Three years later, what started as a small DevOps group has intentionally evolved into a Platform Engineering team, with Aetòs as the primary product we build and operate for the rest of engineering.
Because Aetòs handles the heavy lifting, we now run a high‑performance Platform Engineering organization with roughly a 20:1 engineer‑to‑DevOps ratio. One platform engineer can effectively support around twenty product engineers without slowing them down — which would have been impossible without this IDP foundation.
And by design, Aetòs can support any product with minimal customization.
The Five-Pillar Solution: Building Aetòs
Aetòs now rests on five major pillars, each solving a critical part of the engineering efficiency puzzle.
Pillar 1: Private Cloud Management — Infrastructure at Scale
Aetòs today manages 14,000+ VMs with roughly 700 deployments per day, handling churn of 3,000–4,000 VMs per day. We operate both vSphere‑based and KubeVirt‑based private clouds, and we’re progressively rolling out a KubeVirt + Portworx model for new workloads.
The Impact
- Stable pipelines with a 95%+ infra success rate
- Cloud cost reduced by ~70%
- Roughly $2.5M/year in savings, three years in a row
What We Built
Aetòs Private Cloud is a software service that abstracts KubeVirt and vSphere behind a single, opinionated API and UI.
Key capabilities:
- Resource, quota, and lease management with built‑in guardrails
- Auto‑purge and reclamation for expired environments
- Notifications and extension workflows for engineers who need more time
- Consistent metadata about every VM and environment
With thousands of VMs churning daily, failures are guaranteed. So we built retry logic and graceful degradation into every component. The system expects things to fail and handles it without waking humans unnecessarily.
For Your Organization
There’s a big difference between running an on‑prem datacenter and running a private cloud. Many enterprises spend millions on on‑prem hardware but still treat it as static inventory.
The takeaway: treat on‑prem infrastructure the same way you treat public cloud resources, with leases, quotas, tagging, and automation.
Pillar 2: Pipelines & Governance — Toward Releasable Builds Daily
The heart of Aetòs is our pipeline orchestration engine. It gives engineers end‑to‑end visibility and the ability to debug from the browser, and it’s the reason we can ship 80+ releases every year.
The Problem We Solved
Before Aetòs:
- Pipelines ran in silos on different change sets, so “green” didn’t always mean the same code was actually tested.
- Test matrices were crafted by hand
- Engineers babysat long‑running jobs
- Results were scattered across multiple systems
A single release easily used to consume 20–40 hours of engineering time just in coordination.
Today — Key Capabilities
- Intelligent test orchestration — The system analyzes build metadata and automatically triggers the right pipelines
Multi‑level testing
- L1 — Smoke tests (≈2 hours)
- L2 — Functional tests (≈8 hours)
- L3 — System tests (24+ hours)
Build recommendation engine — Promotion between levels only happens when quality gates are met
Aetòs AI integration (powered by our internal engine, LongClaw)
- 90%+ reduction in triage time
- Automated log classification and clustering
- Human‑readable failure summaries and suggested owners
The Impact
- 80+ releases per year across multiple products
- ~90% reduction in manual intervention in pipelines
- ~50% faster build‑to‑release cycle time
For Your Organization
If you’re wondering where to start, start here. Release orchestration has the highest ROI because it touches everyone, every day. Build your pipeline engine first, and make it observable from day one. If engineers can’t see what’s happening and why, they won’t trust automation.
Pillar 3: EaaS (Environment as a Service) — One-Click Deployments
EaaS completely changed how engineers think about test environments. What used to take days now takes minutes.
We maintain a catalog of deployment recipes called Starting States. Each template encapsulates years of tribal knowledge: the exact configuration needed to support specific scenarios across AWS, GCP, Azure, IBM, Oracle, and on‑prem vSphere or bare‑metal environments.
Before EaaS
- An engineer filed a ticket with the infrastructure team
- Waited 1–3 days for access to a preconfigured environment (often not exactly what was needed)
- Manually configured Portworx and surrounding components
- Finally started testing
With EaaS
- Engineer selects a product and a Starting State
- Clicks Deploy
- Gets a ready‑to‑use environment, with lifecycle managed by leases and auto‑teardown
The Impact
- 10,000+ engineering hours saved per quarter
- Testbeds available across vSphere, AWS, Azure, GCP, and bare metal
- Automated cleanup prevents resource sprawl
- “Instant” testbeds for common workflows (Think talking to a customer and getting a env with one click to demo the product features)
For Your Organization:
Start with templates for your three most common scenarios. Don’t attempt to model every permutation at first. Let usage guide what you add next.
Pillar 4: Linux Distro Qualification — Automated Kernel Support at Scale
If you ship kernel modules, you know this pain: the universe of Linux distributions and kernels keeps expanding, and customers expect fast support.
For Portworx, this is mission‑critical.
The Challenge
For each release, we must support our FUSE module across:
- 8000+ of kernel modules
- Multiple distributions: RHEL, Oracle Linux, Rocky, Ubuntu, Photon OS, Amazon Linux, SUSE, and more
- Continuous kernel updates, sometimes weekly
Quality requirements are strict: every kernel must be tested and qualified before we claim support.
The Solution: Unattended Distro Pipelines
We built an Unattended Distro Qualification Engine that runs end‑to‑end without human intervention, detects new kernels, trigger qualification workflows and maintains a live data base of supportability matrix.
Intelligent Recommendation
- Analyzes pass rates and regression patterns
- Automatically recommends kernels for support when all checks pass
The Impact
- SLA improved from 7 days → 48 hours for kernel qualification
- A previously dedicated team’s manual work is now handled by Aetòs
- Real‑time visibility across all supported linux distributions
For Your Organization
This was a fairly unique problem, and we couldn’t find an off‑the‑shelf solution. If you’re wrestling with something similar, feel free to reach out before reinventing this wheel.
Pillar 5: Unified Dashboard and Analytics — Data-Driven Decisions
You can’t improve what you can’t see. Our dashboards turned gut‑feel decisions into data‑driven ones.
The Visibility Revolution
Real‑time test results
For every execution, we can answer:
- What’s running right now?
- What passed or failed in the last build?
- Which tests are consistently flaky?
- What’s the historical pass rate of a given test?
Engineers find out a lot faster if their change broke something, not hours/days later.
Historical trend analysis
With historical data, we discovered patterns we would have missed, such as:
- Some suites had ~80% pass rates because of infrastructure flakiness, not product issues. Fixing infra saved thousands of engineer‑hours previously wasted on chasing false failures.
Metadata is King
We tag everything, who requested it, which product, which team, what project.. not just that, Kernel, gitCommit, build and so many other things. This powers:
- Cost allocation
- Capacity planning
- Usage analytics
If you can’t measure it, you can’t optimize it.
The Impact
- Regression detection time: 24–48 hours → <2–6 hours
- 100% test traceability
- Decisions about releases and infrastructure changes are now 3× faster
For Your Organization:
Don’t start with a giant dashboard project. Start with the one metric that matters most. For us, it was pipeline visibility with flakiness and error categorization. Add visualizations only when you repeatedly find yourself answering the same questions manually.
Lessons From a Three-Year Journey
- Platform engineering requires saying “no” more than “yes.”
Every team wants customization. The platform must enforce standards, or complexity wins. - Observability is not optional at scale.
Our ability to debug infra failures, networking inconsistencies, and CI instability came directly from deep observability. - Cost governance must be engineered, not mandated.
Aetòs became the enforcement layer for cost‑optimized decisions, leaks show up in dashboards, not quarterly reviews. - Build governance is the most underrated accelerant of release velocity.
When engineers trust the test signal, everything else speeds up. - Culture changes only when the new path is easier than the old one.
Clean UX and reliable automation displaced tribal workflows — not policy documents.
Conclusion
Aetòs has fundamentally changed how we build, test, and release software at Portworx. What started as a bootstrapped side project has evolved into a comprehensive engineering efficiency platform that now touches every part of our development lifecycle.
Today, less than 35% of our time goes into “keeping the lights on,” down from 76% when we started every quarter, more of our engineering capacity is available for new initiatives instead of repetitive operational work.
But beyond metrics, Aetòs represents a cultural shift. It’s about:
- Empowering engineers with self‑service capabilities
- Using data, not gut feel, to make engineering decisions
- Automating the repetitive work so humans can focus on the creative work
You will hear “no” along the way, cultural change takes time. You don’t need a big team or a huge budget to start. What you do need is a clear vision, the discipline to solve one problem at a time, and the patience to let trust in the platform grow.
🚀 Ready to Build Your Own Platform? Start Here
You don’t need to recreate Aetòs to see value. Here’s a lightweight roadmap based on what worked for us.
Month 1 — Identify & Validate
Week 1–2: Find Your Biggest Pain Point
- [ ] Survey your engineers, understand their pain point
- [ ] Pick one problem to solve (pipelines, infra, or environments)
- [ ] Define success metrics (time saved, error reduction, adoption)
Week 3–4: Build the MVP following Design Thinking Approach
- [ ] Design an architecture that can scale, even if your first use case is small
- [ ] Automate the most repetitive, high‑impact task first
- [ ] Go API‑first; UI can come later
- [ ] Instrument everything from day one… logs, metrics, traces
Month 2 — Prove Value
Week 5–6: Get Real Users
- [ ] Find 3–5 early adopters, these are your partners
- [ ] Sit with them while they use it; don’t just collect feedback, co‑design with them
- [ ] Fix issues quickly to build credibility
Week 7–8: Build Trust
- [ ] Make results visible… dashboards, chat notifications, simple reports
- [ ] Provide safe override mechanisms (trust grows when people feel in control)
- [ ] Document the top workflows
- [ ] Share concrete wins with numbers attached
Month 3+ — Scale Gradually
Week 9–12: Expand Adoption
- [ ] Open to a slightly larger group
- [ ] Add the second most requested feature, based on actual usage
- [ ] Add a simple UI for the most common workflows
Quarter 2 and Beyond — Let Users Pull, Don’t Push
- [ ] Hold regular office hours or feedback sessions
- [ ] Build only what users actually request and are willing to adopt
- [ ] Measure and share impact metrics every month
Tech Stack:
- Portworx Enterprise with DR (Disaster Recovery)
- Pure Storage (FlashArray / FlashBlade) for backend storage
- Kubernetes (Red Hat OpenShift)
- Portworx + Kubevirt & vSphere for Private Cloud
- Python 3.12+
- FastAPI
- MongoDB
- Redis
- Elasticsearch, Logstash & Kibana (Elastic / ELK stack)
- Prometheus
- Grafana
- Jenkins
- Slack
- ChatGPT (used for Aetòs AI–driven triage and summarization)
- Open Policy Agent
- And a lot of coffee
Learn More
- Watch the Aetòs Demo (Old demo, but would give a glimpse of what we have built)
- Portworx Website
- Portworx Disaster Recovery
- Portworx Documentation
- PureStorage Website
