Site Reliability Engineer (SRE) with over 4 years of experience in managing critical infrastructure, optimizing system, performance, and ensuring high availability across complex, global environments.
Recognized for designing and implementing robust, scalable, and secure cloud solutions that boost uptime and resilience. Demonstrated expertise in incident response, technical troubleshooting, and root cause analysis, minimizing downtime through proactive monitoring and automation.
As AI and ML become core production components, SRE is evolving from managing deterministic systems to ensuring the reliability of dynamic, learning systems. New metrics, workflows, guardrails and cross-disciplinary practices are redefining ...
Discover how context-aware reliability contracts (CARC) redefine SLOs for hyperscale systems—optimizing uptime, reducing infrastructure spend by 33%, and aligning reliability with business value across user tiers, regions, and workloads ...
Discover how redefining service level objectives (SLOs) around business impact — not vanity uptime metrics — reduced incidents by 75% and saved $2.3M in lost revenue ...
An AWS us-east-1 outage exposed how automation can backfire. Learn why autoscaling failed, how pinning ASGs saved uptime, and what to do in future outages ...