Aetòs: From Chaos to Engineering Excellence — A 3-Year Transformation

11 min readJan 2, 2026

How we transformed engineering productivity by building an Internal Developer Platform (IDP) that now processes ~50M API calls per day, manages 14,000 VMs, and powers 80+ releases per year and what you can learn from our journey.

Before I get into Aetòs, a quick clarification: this isn’t a post about picking the “right” language, framework, or toolchain. You’ll figure that out. The hard part isn’t the tech — it’s everything around it.

Most engineers have lived with broken workflows and manual toil for so long that the pain has become a fact of life. Changing that is fundamentally a cultural problem, not a technical one. Nobody is paying for your internal platform, and nobody is obligated to use it. Adoption is earned, not mandated.

You win only when they win. This blog is about my experience going through that journey with Aetòs.

Aetòs By The Numbers

Scale & Performance

~50M API calls per day — Orchestrating the engineering lifecycle
14,000+ VMs managed — With 3,000–4,000 VMs churned daily
80+ releases per year — ~2.5× improvement from pre‑Aetòs days
99.9% uptime — Platform reliability at scale

Business Impact

$2.5M saved in cloud cost annually — Three years running, ~70% cloud cost reduction
10,000+ engineering hours saved per quarter — Time redirected from toil to innovation
90%+ reduction in triage time — AI‑assisted failure analysis
~50% faster release cycles — From commit to production
“Run the business” work reduced from 76% → 36% — Most engineering time now goes into new initiatives instead of repetitive operational work

Engineering Efficiency

8,000+ kernel modules per release — Fully automated qualification
50+ platform configurations — Continuously certified and tested
500M+ database records — Deep historical analytics
90% reduction in manual intervention — Automation at every layer

The Breaking Point

April 2022. After more than a decade at VMware, I had just joined Pure Storage’s Portworx team. I was staring at my laptop, looking at the list of productivity challenges our Director of Platform Enginnering, Rajan Yadav, had handed me.

Our DevOps team had spent the last few quarters manually coordinating test infrastructure, tracking down flaky failures, and trying to get consistent results across an increasingly complex matrix of Kubernetes distributions, Linux distros, and cloud providers.

The problem wasn’t our engineers, we had brilliant people working extremely hard. The problem was that we had outgrown our tools. Every release felt like pushing a boulder uphill.

That night I started sketching what would become Aetòs (Greek for “eagle”), a platform that wouldn’t just change how we shipped software, but how we thought about engineering efficiency itself.

The Plan That Almost Wasn’t

Portworx was operating like a lean startup, with most investment rightly focused on customer‑facing features. Dedicated headcount for an internal platform wasn’t realistic yet, so we agreed to bootstrap the first version alongside our existing work.

That constraint forced a choice: wait for perfect conditions, or start small and prove the value of the platform.

The 20% Solution

We had to start small, but do it in a way that could scale to any level and get the foundation right.

A few of us in the DevOps team carved out about 20% of our time to build the first pieces of Aetòs and, within three weeks, we had our first microservice in place: Aetòs Private Cloud.

Not getting dedicated budget turned out to be a blessing. It forced us to:

Build only what engineers actually needed
Validate every feature through real usage
Earn buy‑in by solving concrete problems, one after another

No big‑bang funded initiative. Just steady, compounding wins.

Three Years Later

Today, Aetòs is the single orchestration layer for all platform engineering operations at Portworx. Most importantly, it gave our engineers back something invaluable: time — time to innovate instead of coordinate.

Three years later, what started as a small DevOps group has intentionally evolved into a Platform Engineering team, with Aetòs as the primary product we build and operate for the rest of engineering.

Because Aetòs handles the heavy lifting, we now run a high‑performance Platform Engineering organization with roughly a 20:1 engineer‑to‑DevOps ratio. One platform engineer can effectively support around twenty product engineers without slowing them down — which would have been impossible without this IDP foundation.

And by design, Aetòs can support any product with minimal customization.

Catalog of Services offered through Aetòs

The Five-Pillar Solution: Building Aetòs

Aetòs now rests on five major pillars, each solving a critical part of the engineering efficiency puzzle.

Pillar 1: Private Cloud Management — Infrastructure at Scale

Aetòs today manages 14,000+ VMs with roughly 700 deployments per day, handling churn of 3,000–4,000 VMs per day. We operate both vSphere‑based and KubeVirt‑based private clouds, and we’re progressively rolling out a KubeVirt + Portworx model for new workloads.

The Impact

Stable pipelines with a 95%+ infra success rate
Cloud cost reduced by ~70%
Roughly $2.5M/year in savings, three years in a row

What We Built

Aetòs Private Cloud is a software service that abstracts KubeVirt and vSphere behind a single, opinionated API and UI.

Key capabilities:

Resource, quota, and lease management with built‑in guardrails
Auto‑purge and reclamation for expired environments
Notifications and extension workflows for engineers who need more time
Consistent metadata about every VM and environment

With thousands of VMs churning daily, failures are guaranteed. So we built retry logic and graceful degradation into every component. The system expects things to fail and handles it without waking humans unnecessarily.

For Your Organization

There’s a big difference between running an on‑prem datacenter and running a private cloud. Many enterprises spend millions on on‑prem hardware but still treat it as static inventory.

The takeaway: treat on‑prem infrastructure the same way you treat public cloud resources, with leases, quotas, tagging, and automation.

Pillar 2: Pipelines & Governance — Toward Releasable Builds Daily

The heart of Aetòs is our pipeline orchestration engine. It gives engineers end‑to‑end visibility and the ability to debug from the browser, and it’s the reason we can ship 80+ releases every year.

The Problem We Solved

Before Aetòs:

Pipelines ran in silos on different change sets, so “green” didn’t always mean the same code was actually tested.
Test matrices were crafted by hand
Engineers babysat long‑running jobs
Results were scattered across multiple systems

A single release easily used to consume 20–40 hours of engineering time just in coordination.

Today — Key Capabilities

Intelligent test orchestration — The system analyzes build metadata and automatically triggers the right pipelines

Multi‑level testing

L1 — Smoke tests (≈2 hours)
L2 — Functional tests (≈8 hours)
L3 — System tests (24+ hours)

Build recommendation engine — Promotion between levels only happens when quality gates are met

Aetòs AI integration (powered by our internal engine, LongClaw)

90%+ reduction in triage time
Automated log classification and clustering
Human‑readable failure summaries and suggested owners

The Impact

80+ releases per year across multiple products
~90% reduction in manual intervention in pipelines
~50% faster build‑to‑release cycle time

For Your Organization

If you’re wondering where to start, start here. Release orchestration has the highest ROI because it touches everyone, every day. Build your pipeline engine first, and make it observable from day one. If engineers can’t see what’s happening and why, they won’t trust automation.

Pillar 3: EaaS (Environment as a Service) — One-Click Deployments

EaaS completely changed how engineers think about test environments. What used to take days now takes minutes.

We maintain a catalog of deployment recipes called Starting States. Each template encapsulates years of tribal knowledge: the exact configuration needed to support specific scenarios across AWS, GCP, Azure, IBM, Oracle, and on‑prem vSphere or bare‑metal environments.

Before EaaS

An engineer filed a ticket with the infrastructure team
Waited 1–3 days for access to a preconfigured environment (often not exactly what was needed)
Manually configured Portworx and surrounding components
Finally started testing

With EaaS

Engineer selects a product and a Starting State
Clicks Deploy
Gets a ready‑to‑use environment, with lifecycle managed by leases and auto‑teardown

The Impact

10,000+ engineering hours saved per quarter
Testbeds available across vSphere, AWS, Azure, GCP, and bare metal
Automated cleanup prevents resource sprawl
“Instant” testbeds for common workflows (Think talking to a customer and getting a env with one click to demo the product features)

For Your Organization:

Start with templates for your three most common scenarios. Don’t attempt to model every permutation at first. Let usage guide what you add next.

Pillar 4: Linux Distro Qualification — Automated Kernel Support at Scale

If you ship kernel modules, you know this pain: the universe of Linux distributions and kernels keeps expanding, and customers expect fast support.

For Portworx, this is mission‑critical.

Unattended Distro Qualification Inventory

The Challenge

For each release, we must support our FUSE module across:

8000+ of kernel modules
Multiple distributions: RHEL, Oracle Linux, Rocky, Ubuntu, Photon OS, Amazon Linux, SUSE, and more
Continuous kernel updates, sometimes weekly

Quality requirements are strict: every kernel must be tested and qualified before we claim support.

The Solution: Unattended Distro Pipelines

We built an Unattended Distro Qualification Engine that runs end‑to‑end without human intervention, detects new kernels, trigger qualification workflows and maintains a live data base of supportability matrix.

Intelligent Recommendation

Analyzes pass rates and regression patterns
Automatically recommends kernels for support when all checks pass

The Impact

SLA improved from 7 days → 48 hours for kernel qualification
A previously dedicated team’s manual work is now handled by Aetòs
Real‑time visibility across all supported linux distributions

For Your Organization

This was a fairly unique problem, and we couldn’t find an off‑the‑shelf solution. If you’re wrestling with something similar, feel free to reach out before reinventing this wheel.

Pillar 5: Unified Dashboard and Analytics — Data-Driven Decisions

You can’t improve what you can’t see. Our dashboards turned gut‑feel decisions into data‑driven ones.

The Visibility Revolution

Real‑time test results

For every execution, we can answer:

What’s running right now?
What passed or failed in the last build?
Which tests are consistently flaky?
What’s the historical pass rate of a given test?

Engineers find out a lot faster if their change broke something, not hours/days later.

Historical trend analysis

With historical data, we discovered patterns we would have missed, such as:

Some suites had ~80% pass rates because of infrastructure flakiness, not product issues. Fixing infra saved thousands of engineer‑hours previously wasted on chasing false failures.

Metadata is King

We tag everything, who requested it, which product, which team, what project.. not just that, Kernel, gitCommit, build and so many other things. This powers:

Cost allocation
Capacity planning
Usage analytics

If you can’t measure it, you can’t optimize it.

The Impact

Regression detection time: 24–48 hours → <2–6 hours
100% test traceability
Decisions about releases and infrastructure changes are now 3× faster

For Your Organization:

Don’t start with a giant dashboard project. Start with the one metric that matters most. For us, it was pipeline visibility with flakiness and error categorization. Add visualizations only when you repeatedly find yourself answering the same questions manually.

Lessons From a Three-Year Journey

Platform engineering requires saying “no” more than “yes.”
Every team wants customization. The platform must enforce standards, or complexity wins.
Observability is not optional at scale.
Our ability to debug infra failures, networking inconsistencies, and CI instability came directly from deep observability.
Cost governance must be engineered, not mandated.
Aetòs became the enforcement layer for cost‑optimized decisions, leaks show up in dashboards, not quarterly reviews.
Build governance is the most underrated accelerant of release velocity.
When engineers trust the test signal, everything else speeds up.
Culture changes only when the new path is easier than the old one.
Clean UX and reliable automation displaced tribal workflows — not policy documents.

Conclusion

Aetòs has fundamentally changed how we build, test, and release software at Portworx. What started as a bootstrapped side project has evolved into a comprehensive engineering efficiency platform that now touches every part of our development lifecycle.

Today, less than 35% of our time goes into “keeping the lights on,” down from 76% when we started every quarter, more of our engineering capacity is available for new initiatives instead of repetitive operational work.

But beyond metrics, Aetòs represents a cultural shift. It’s about:

Empowering engineers with self‑service capabilities
Using data, not gut feel, to make engineering decisions
Automating the repetitive work so humans can focus on the creative work

You will hear “no” along the way, cultural change takes time. You don’t need a big team or a huge budget to start. What you do need is a clear vision, the discipline to solve one problem at a time, and the patience to let trust in the platform grow.

🚀 Ready to Build Your Own Platform? Start Here

You don’t need to recreate Aetòs to see value. Here’s a lightweight roadmap based on what worked for us.

Month 1 — Identify & Validate

Week 1–2: Find Your Biggest Pain Point

[ ] Survey your engineers, understand their pain point
[ ] Pick one problem to solve (pipelines, infra, or environments)
[ ] Define success metrics (time saved, error reduction, adoption)

Week 3–4: Build the MVP following Design Thinking Approach

[ ] Design an architecture that can scale, even if your first use case is small
[ ] Automate the most repetitive, high‑impact task first
[ ] Go API‑first; UI can come later
[ ] Instrument everything from day one… logs, metrics, traces

Month 2 — Prove Value

Week 5–6: Get Real Users

[ ] Find 3–5 early adopters, these are your partners
[ ] Sit with them while they use it; don’t just collect feedback, co‑design with them
[ ] Fix issues quickly to build credibility

Week 7–8: Build Trust

[ ] Make results visible… dashboards, chat notifications, simple reports
[ ] Provide safe override mechanisms (trust grows when people feel in control)
[ ] Document the top workflows
[ ] Share concrete wins with numbers attached

Month 3+ — Scale Gradually

Week 9–12: Expand Adoption

[ ] Open to a slightly larger group
[ ] Add the second most requested feature, based on actual usage
[ ] Add a simple UI for the most common workflows

Quarter 2 and Beyond — Let Users Pull, Don’t Push

[ ] Hold regular office hours or feedback sessions
[ ] Build only what users actually request and are willing to adopt
[ ] Measure and share impact metrics every month

Tech Stack:

Portworx Enterprise with DR (Disaster Recovery)
Pure Storage (FlashArray / FlashBlade) for backend storage
Kubernetes (Red Hat OpenShift)
Portworx + Kubevirt & vSphere for Private Cloud
Python 3.12+
FastAPI
MongoDB
Redis
Elasticsearch, Logstash & Kibana (Elastic / ELK stack)
Prometheus
Grafana
Jenkins
Slack
ChatGPT (used for Aetòs AI–driven triage and summarization)
Open Policy Agent
And a lot of coffee

Learn More

Watch the Aetòs Demo (Old demo, but would give a glimpse of what we have built)
Portworx Website
Portworx Disaster Recovery
Portworx Documentation
PureStorage Website

Aetòs: From Chaos to Engineering Excellence — A 3-Year Transformation

Aetòs By The Numbers

The Breaking Point

The Plan That Almost Wasn’t

The 20% Solution

Three Years Later

The Five-Pillar Solution: Building Aetòs

Pillar 1: Private Cloud Management — Infrastructure at Scale

The Impact

What We Built

For Your Organization

Pillar 2: Pipelines & Governance — Toward Releasable Builds Daily

The Problem We Solved

Today — Key Capabilities

The Impact

For Your Organization

Pillar 3: EaaS (Environment as a Service) — One-Click Deployments

Before EaaS

With EaaS

The Impact

Pillar 4: Linux Distro Qualification — Automated Kernel Support at Scale

The Challenge

The Solution: Unattended Distro Pipelines

The Impact

For Your Organization

Pillar 5: Unified Dashboard and Analytics — Data-Driven Decisions

The Visibility Revolution

The Impact

Lessons From a Three-Year Journey

Conclusion

🚀 Ready to Build Your Own Platform? Start Here

Month 1 — Identify & Validate

Month 2 — Prove Value

Month 3+ — Scale Gradually

Tech Stack:

Learn More

Written by Charudath Gopal

No responses yet