engineering-udaan - Medium

How udaan cut SQL database costs by 50% without sacrificing performance

Rajesh Kumar — Thu, 30 Apr 2026 07:37:35 GMT

Overview

At udaan, India’s largest B2B e-commerce platform, SQL databases power everything critical: payments, shipments, fulfillment, warehouse management and core business operations.

What made this harder than a generic SaaS setup was shape of our eB2B workload: seasonal procurement spikes, multi-tenant retailer and supplier data growth, and credit-driven transaction flows. We were also not tuning one kind of system. Our fleet included latency-sensitive transactional paths, analytical and reporting-heavy stores, and batch-oriented workloads with very different performance and cost profiles.

As we scaled so did our database bill and like most fast-moving engineering teams, our provisioning decisions were conservative:

we picked “safe” SQL server tiers.
we over-allocated vCores.
we enabled replicas as it appeared to be responsible thing to do.
we carried performance headroom we didn’t actually need.

Eventually while auditing our database fleet we realized the uncomfortable truth: We were paying premium prices for capacity and replicas we weren’t using. After audits, modeling and staged migrations, we clawed back waste and turned it into a structured optimization:

Reduced database spend by ~50%
Improved operational agility
Removed several scaling bottlenecks
Increased resilience in real incidents

This post breaks down what we learned and a tier-selection framework you can reuse.

The Problem: Over-Provisioning is the Default

Database provisioning decisions are rarely revisited. Once a database is created in a premium tier with a replica setup, it tends to stay there indefinitely, even if:

workload characteristics change
storage grows unexpectedly
replicas are never actually used
scaling becomes painful

In our case, we discovered multiple patterns:

Several Business Critical databases were paying for premium SSD performance, yet were still approaching 4 TB storage ceiling, creating a scaling and cost risk.
Multiple General Purpose databases were running replicas that were never actually used for read-only traffic or recovery, effectively doubling cost.
In both General Purpose and Business Critical, storage growth often forced us to scale up vCores, even when CPU wasn’t bottleneck, leading to structural over-provisioning.
Many databases were simply oversized: vCores allocated far above real CPU and I/O demand, with utilization staying low for long periods.

A Simple Mental Model of Database Tiers

Before we got to decisions, we aligned on what these three tiers really mean operationally.

General Purpose (GP): cheapest, but remote storage

General Purpose decouples compute and storage: the database engine runs on compute nodes while the data lives on remote storage. This gives good flexibility, but it introduces storage I/O latency (typically ~5–10 ms).

Key characteristics

Compute: 2 to 128 vCores
Storage: 1 GB to 4 TB (paid on allocated size)
Replicas: no built-in read replica
IOPS: ~320 IOPS/vCore (max 16,000 IOPS)
HA: backup replica (not a read replica)

The big gotcha: If you add a replica for availability, you’re effectively paying for two databases, yet still not getting Business Critical performance.

Business Critical (BC): fastest I/O, but scaling is painful

Business Critical uses local SSDs: compute and storage run on the same machine. I/O latency is typically ~1–2 ms due to tightly coupled & SSD architecture. It also includes a free read replica via an Always On availability group.

Key characteristics

Compute: 2 to 128 vCores
Storage: 1 GB to 4 TB (paid on max assigned)
IOPS: ~4,000 IOPS/vCore (max 327,680 IOPS)
Replicas: 1 read replica included
HA: multiple replicas managed by the platform
Scaling: mostly vertical

The biggest limitation:

Scaling (especially storage) can be operationally heavy. If you need to move to a larger configuration, you may need hours-long migration windows.

4 TB cap becomes a real bottleneck at scale.

General Purpose and Business Critical force a fixed vCore-to-storage ratio — as storage grows, compute must grow even when CPU demand stays flat, which leads to structural over-provisioning.

Hyperscale (HS): the tier that changed everything for us

Hyperscale decouples compute and storage like General Purpose, but adds a multi-tier caching architecture with local SSD caches on compute nodes. This gives:

General Purpose-like elasticity (compute and storage scale independently)
Business Critical-like performance for hot data (2-level SSD cache)
Drastically higher storage limits

Key characteristics

Compute: 2 to 128 vCores
Storage: 10 GB to 100 TB (pay for what you use)
Replicas: configurable high-availability replicas
Scaling: fast scale up/down (minutes, not hours)
Supports multiple read scale-out replicas (named replicas)
Unlike GP and BC, it can run the 100 TB with 2 vCores.

In practice, Hyperscale removed three biggest pain points we had with Business Critical: 4 TB ceiling, slow scaling during incidents and compute and storage being tied.

Our Tier-selection Decision Framework

After auditing our database fleet, we arrived at a simple framework that held up across most workloads.

When General Purpose makes sense

General Purpose is the right choice when:

workload is non-critical
workload is mostly batch, back-office, or low-QPS internal traffic
you don’t need read replicas
you can tolerate 5–10 ms storage latency
cost is dominant constraint
your data size is small

Rule of thumb: If you need a replica, General Purpose stops being “cheap”. The moment you add replicas to GP, you’re often paying 2× for performance that still isn’t great.

When Business Critical makes sense

Business Critical is the right choice when:

you have strict low-latency requirements (1–2 ms)
workload is primarily transactional and user-facing
you need a built-in read replica
your database is safely under the 4 TB cap
you want the simplest high-performance operational model

Note: It’s great until you hit scaling constraints.

When to skip Business Critical and go straight to Hyperscale

This was our biggest insight. We initially assumed Business Critical was “correct” tier for production workloads that required performance and availability. In 80%+ of production workloads we evaluated, Hyperscale provided equal or better latency while materially reducing cost.

Hyperscale gives:

Business Critical-like latency for hot data (two tier SSD-cache)
Far better scaling characteristics
Dramatically higher storage headroom up to 100 TB
Fast vertical scaling during incidents - scale up takes maximum 5 minutes
Flexible HA replica configuration

It worked especially well for mixed workloads where transactional traffic, reporting reads, and growth in multi-tenant datasets were all pulling in different directions.

And most importantly: it was consistently cheaper for us.

The Replica Trap (and why it was costing us lakhs)

We discovered several General Purpose production databases running with replicas. This doubled our billing because a replica is essentially a second, independently billed database. The painful part was that even after paying 2X, we still had General Purpose limitations:

remote storage latency (higher I/O time)
no built-in read scale-out
higher vCore requirements to compensate for I/O bottlenecks

As these databases grew, we often ended up scaling vCores just to stay stable, which made the “cheap” tier surprisingly expensive.

The Migration Playbook That Worked for Us

This is part that determines whether you actually save money: how we executed cost reduction without breaking production.

Step 1: Audit everything

We collected for every database: tier (GP / BC / HS), replica configuration, vCore utilization trends, storage usage and growth rate, I/O and latency patterns, read vs write mix, scaling history (how often we had to scale during incidents).

Immediate candidates are:

Business Critical databases approaching the 4 TB limit, or cases where we were forced to scale up vCores purely to unlock more storage (even when CPU wasn’t bottleneck).
General Purpose databases running replicas, effectively doubling cost without delivering proportional performance or availability benefits.
Any database with consistently low CPU utilization but high monthly spend, indicating clear over-provisioning.

Step 2: Migrate to Hyperscale

Online migration to Hyperscale is possible with minimal downtime. The migration itself is fairly documented and the real win comes from what you do after migration.

One migration did not go as smoothly as planned: we treated it like a straight tier switch, but the workload had a heavier reporting pattern than we had modeled. We saw temporary latency regression until caches warmed and had to slow down the cutover, re-check the read pattern, and right-size more carefully after the move. That experience is what pushed us to treat workload shape, not just database size, as a first-class input.

Step 3: Right-size after migration (where most savings come from)

This is the step most teams miss. Hyperscale’s architecture often delivers significantly higher performance for the same workload due to distributed storage, multi-level caching, and higher log throughput.

That means the same workload can often run on fewer vCores. Our workflow was: migrate, monitor for a tuning period, validate p95/p99 latency, CPU, IO stalls, query time, and reduce vCores gradually until we hit a stable efficient point. This is where the majority of our savings came from.

Step 4: Use fast scaling during incidents

One operational advantage we underestimated: Hyperscale scaling speed is a game-changer in system degradations or outages. With Business Critical, scaling could take hours in certain scenarios.

With Hyperscale:

scaling up vCores is fast
scaling down is fast
you pay only for the burst window

During unexpected load spikes, this turned scaling from a stressful multi-hour operation into a reversible configuration change.

Results

The impact was immediate and measurable.

Business Critical → Hyperscale: These migrations accounted for roughly 30% savings of 50% monthly cost-down while simultaneously removing the 4 TB storage ceiling and dramatically improving our ability to scale during incidents.
General Purpose + replica → tuned Hyperscale: These contributed another ~20% savings of 50% monthly cost-down while simplifying our replica strategy and actually improving read performance.
Post-migration vCore right-sizing Real optimization isn’t just picking the right tier, but tuning within it (and it remained the biggest lever per database once migrations completed).

In total: ~50% savings, with better performance, better scalability, and simpler operations.

A Decision Cheat Sheet

If you’re evaluating database tiers today, here’s the simplified framework we now use internally:

What’s Next

The cost program is still evolving, and we see three areas worth exploring next:

Automate the fleet audit by wiring telemetry feeds (CPU, IO, replica lag, storage growth) into a weekly report so rightsizing candidates are surfaced continuously instead of ad hoc.
Explore elastic-pool style approaches for smaller multi-tenant workloads where isolated provisioning is inefficient.
Connect operational databases more deliberately to analytics and data lake layers so reporting and batch demand do not distort transactional sizing decisions.
Add cost-governance guardrails and alerts directly into the provisioning workflow so fast scaling does not quietly become recurring over-provisioning.

Key Takeaways

The most expensive database is one you haven’t audited recently. Over-provisioning compounds quietly — unused replicas, oversized tiers, vCores allocated “just in case” — and the bill grows without anyone noticing because no single decision looks unreasonable in isolation.

Here’s what we learned the hard way:

Hyperscale has matured enough to be default for most production workloads in our fleet. Its combination of performance, scalability, and cost efficiency made it the right choice for 80%+ of our databases.
Replicas on General Purpose are a cost trap. The moment a GP database needs replicas, you should seriously evaluate whether Hyperscale is a better fit. It almost always is.
Real savings happen after migration. Migrating to the right tier is necessary but not sufficient. The majority of our cost reduction came from post-migration vCore right-sizing by methodically stepping down compute until we found the efficient operating point for each workload.

Treat tier selection as an ongoing economic decision, not a one-time architectural choice. Workloads evolve. Storage grows. Traffic patterns shift. The tier that was right 18 months ago may be costing you lakhs today. If your organization runs SQL databases at meaningful scale, a tier + replica audit is likely one of the highest-ROI infrastructure optimizations you can do this quarter.

Authored by Engineering team at udaan. If building and scaling complex backend systems interests you, this is the kind of work we do every day.

How udaan cut SQL database costs by 50% without sacrificing performance was originally published in engineering-udaan on Medium, where people are continuing the conversation by highlighting and responding to this story.

How we solved a 2s+ stutter caused by re-rendering React components

Divjot Singh — Wed, 30 Jul 2025 08:45:07 GMT

TL;DR

Add to Cart interaction in our React Native app developed a 2-second UI lag due to accumulating tech-debt and improper usage of context and redux.
react-scan while really amazing, was too good for our app's chaotic re-render patterns; using React DevTools' "Highlight Updates" with video recording helped in pinpointing the issue better.
Careful refactor of context usage into multiple granular custom hooks in one file and then replacing those with useSelector hooks and batching helped us reduce frequency of re-renders 14 times over.
React Compiler further reduced the problem horizontally by reducing the breadth of components that re-rendered.

Disclaimer: This post is touched up using LLM tools.

Before we begin, let’s first understand the problem we are trying to solve. We started receiving user reports about how our add to cart button felt slow and unresponsive. Originally the cart APIs were quite latent, so the original diagnosis was written off as a “backend issue” and we added a loading bar to give immediate feedback to the user. However, as the latency was reduced, another problem surfaced. Notice how the loading bar stutters and gets stuck and the footer doesn’t update for good 2 seconds? Yeah, this post is about solving that with minimal code changes.

https://medium.com/media/14489f5ff43cd3169e098177bc94d322/href

Making Our React Native App Faster

Ever notice your app slow down, even for a moment? That little pause is a gateway to your users bouncing off of your app. This is especially bad when your user is trying to actually transact, like adding items to their cart or completing a purchase. These delays, often caused by too many screen updates and heavy calculations, make users frustrated, reduce how much they use the app, and can even make them think less of your product. In this post we go over how we made key interactions 2850ms faster by controlling re-renders of our React application.

How did we get here: Compounding works both ways

Small design choices and incremental feature additions can quietly compound into significant performance problems over time.

At first, we used Redux for managing our app’s data. To simplify interactions, an abstraction layer was introduced using React’s Context API. This was not a logical, well-thought-out architectural choice; rather, it was a decision made in haste, driven by the immediate pursuit of shipping features quickly.

However, this decision began to obscure the underlying complexity. The problem was, for every re-render caused by redux store update, context was adding another re-render to update its own value.

Over two years, as new features were added and existing logic grew, it became increasingly straightforward to add more state-related logic directly within this Context-wrapped Redux setup. The result? Instead of a single, atomic update to our Redux store, a seemingly simple action like “Add to Cart” could trigger 14 back-to-back store updates and re-renders. Each dispatch, though individually small, contributed to a cascade of UI updates, monopolizing the main thread and leading to noticeable stutters. This illustrates a critical lesson: taking the easy route now can lead to major headaches later.

This insidious degradation is a bit like the “boiling frog” story. Maybe an interaction first had a tiny 100ms lag. Then it slowly grew to 250ms, then 800ms. Developers might have felt something was a bit off, but with constant deadlines for new features, it was hard to argue for time to fix a small 100–200ms problem.

Measure twice, cut once

Finding exactly what was causing these problems and how big they were was crucial. We first tried using react-scan for performance insights. While react-scan is a very powerful and robust tool, the sheer volume and chaotic nature of re-renders occurring across our application meant its detailed output, though accurate, was too overwhelming to quickly pinpoint specific culprits in our particular scenario (oof!).

https://medium.com/media/3474d2524bab5073e811184747aa4800/href

Our breakthrough came from a more direct, visual approach. We used React DevTools’ “Highlight Updates” feature. This tool visually showed us what was updating on screen. We combined this with recording a video of the “Add to Cart” task on a test phone. By looking at the exact moment right after the action, we clearly saw which parts of the screen were redrawing and, most importantly, how many times.

Screenshot of the app showing every element with a bright golden label “View x14” “Path x13” etc. where the first part is the name of the component and the x14 is the number of times that component re-rendered.

We could also see in the Chrome Performance Dev Tools an undeniable visual proof of 2246ms delay on moderately slow phones during critical user flows.

Chrome Performance Dev Tools showing a massive 2.2s stutter, all that time spent in scripting.

This gave us the clear reason we needed to dedicate time to fix it. That annoying, jagged mountain in our performance graph, showing 14 back-to-back updates, told the whole story.

Crucially, having this reliable measurement technique not only exposed the problem but also gave us a precise target for our code changes. We’ll see how this simple janky measurement ended up being a great metric.

How we fixed it with minimal changes

It’s easy to give in to the desire to grab shiny tools when posed with such issues. Maybe we should migrate to zustand? Maybe we should upgrade to react 19 and use use? Maybe we should enable React Compiler everywhere? Maybe we should rewrite entire state logic in redux toolkit? But when dealing with production code, it’s often in your best interest to find minimal fixes for your problems to reduce the maintenance and QA load, and actually ship the fixes in realistic timelines.

Our plan involved three key steps:

1. Cleaner Code and Switching to Redux: Setting Up for Speed

For context, whenever the value in a React Context changes, all the Context.Consumers and components surrounding useContext re-render, regardless of which part of that value updated. You may find this excellent post by Nadia to be quite valuable in understanding the nuances of React re-renders. If you really have to use context, consider splitting each key of object into its own individual Context.

In our case, not only we had a huge object in context, we also were then saving it to redux. This meant, that we were paying 2x the price of state updates due to the unnecessary context layer in between. We didn’t really need to “fix” context usage, we could just remove it, as we already had redux in place.

To make our lives easy, our first step to solving the problem was a big code cleanup. We started by finding all the places where useContext() was used directly and made sure they all went through specific, smaller custom hooks. For example, lines like

const {
  cartItems,
  activeCategoryId,
  cartCoupons,
  // and so on...
} = useContext(CartContext);

became

const cartItems = useCartItems();
const activeCategoryId = useCartActiveCategoryId();
const cartCoupons = useCartCoupons();

This careful work limited all direct useContext calls to just one single, dedicated file. It’s a nice refactoring technique that helps you isolate your problem to one file. This didn't change how our data was managed right away, but it created a clear, traceable path for data flow and paved the way for more impactful optimizations. Putting the problem code into one file was a major reason we could fix it so fast.

After this cleanup, the next logical step was to replace the useContext calls with Redux's useSelector hook.

function useCartItems() {
  const cartItems = useContext(CartContext).cart.cartItems;
  return cartItems;
}

function useCartActiveCategoryId() {
  const activeCategoryId = useContext(CartContext).activeCategoryId;
  return activeCategoryId;
}

function useCartCoupons() {
  const cartCoupons = useContext(CartContext).cartCoupons;
  return cartCoupons;
}

// became

function useCartItems() {
  const cartItems = useSelector(store => store.cart.cartItems);
  return cartItems;
}

function useCartActiveCategoryId() {
  const activeCategoryId = useSelector(store => store.cart.activeCategoryId);
  return activeCategoryId;
}

function useCartCoupons() {
  const cartCoupons = useSelector(store => store.cart.cartCoupons);
  return cartCoupons;
}

Switching from context to redux alone reduced the re-renders two-fold, and the granular hooks now only asked for the specific bits of Redux data they needed, and had no React Context to worry about. This stopped unnecessary redraws when other parts of the data changed.

One could argue that it would be nice if React team to add selectors to the context API itself, but it looks like they’re taking a different approach by letting you use use(context) inside of an useMemo , which also means it could get optimized by React compiler in the future.

As of July 2025, use(context) has shipped in React 19 but not the part where you can use it within useMemo and get it to recompute automatically. I think the idea has merit as it bypasses the code-smell of selectors and lets compilers take care of it. Alternatively, selectors too could be written by compilers, so there’s that.

Key takeaway: By limiting Context use and switching to specific useSelector hooks, we halved the re-renders caused due to improper usage of React context.

2. Using Redux Batching: Combining Updates

With the Redux selectors properly in place, we quickly found situations where several related actions were being sent out very fast upon add to cart action. Like before, the problem grew over time.

// day 1
function onAddToCart () {
  dispatch(someAction());
}

// 2 years later
function onAddToCart () {
  dispatch(defaultValues());
  // someRandomCommentedCodeCozWhyNot();
  dispatch(resetSomeOtherValues());
  someTelemetryCode();
  dispatch(someAction());
  if (someCondition) {
    dispatch(someOtherAction());
    if (userConditon) { 
      someOtherTelemetryCode();
      dispatch(newUserConditionAction());
    }
  }
  someFunctionThatCallsDispatchInternally();
  dispatch(newFeatureAction());
}

Ideally all your business logic should be written in your reducer, or even better, in a separate service file, but in practice this falls apart if not properly enforced, especially with changing business requirements that prevent you from creating perfect abstractions. Luckily, we can solve this without rewriting any business logic.

While modern React (version 18 and newer) and recent react-redux updates automatically batch state updates, our older React Native version in the app meant we couldn't just wait for a big platform update to benefit. The urgency of fixing this massive stutter without an equally massive QA pass prompted us to specifically use Redux Batching with the batch function from react-redux. Simply wrapping onAddToCart in batch was enough to substantially bring down the re-renders.

This resulted in a 6x reduction in screen redraws compared to just using Redux before, and an amazing 14x reduction from our original, unoptimized state. Our performance graph, which once showed a long, bumpy line from 14 fast updates and redraws, now had one short, sharp peak for a single, larger redraw. We literally watched the bottleneck shrink.

Before vs After of Chrome Dev Tools. We can see a much smaller mountain of “Scripting” time in the graph, with the time reduced from 2246ms to 321ms. There’s still scope for improvement but this is a dramatic W.

Key takeaway: Explicit Redux batching was a crucial manual intervention for our React Native application. It let us deliver immediate, critical performance improvements via OTA updates, greatly improving user interaction responsiveness.

3. Getting Ready for React Compiler: Automatic Optimizations

So far we’ve been tackling the frequency of updates, but there’s another issue that’s quite apparent in the screenshot of re-renders. Components that shouldn’t re-render, were re-rendering. Think of all the Svg Icons, the titles. Traditionally this is fixed by manually polluting your codebase with useMemo, useCallback and memo. Any sane person can tell you that this is a crazy whack-a-mole game. Enter React Compiler. This babel plugin (it’s being ported to swc soon) automatically detects dependency of variables within your React components and “memoizes” them. You can play with it here. In fact, that’s what we did first to identify potential gains before setting up react-compiler for our project.

It’s important to know that while it stops unnecessary screen redraws by smartly optimizing components, it doesn’t remove the fundamental reasons for redraws (like when data or properties legitimately change). In our specific case, our original code was intentionally triggering 7 separate redux dispatches, and then doubling that with setting it in context, effectively compounding the redraw problem. I doubt the compiler can fix that big error of judgement, but it can surely reduce the blows of that suboptimal architectural decision.

By adding the React Compiler to our Bolt Design System (our internal library of components), we unlocked the potential for automatic speedups. For some of our Bolt components, this has already led to zero redraws in certain update situations, greatly reducing their contribution to the overall work the app has to do. We were careful, though. Adding a tool that changes how our app runs at the compilation stage can be risky in a large, existing codebase. To lower this risk, we decided to enable the React Compiler only within our separate, isolated Bolt component library. This allowed us to thoroughly test its effects on a small set of frequently used components, building confidence before thinking about using it more widely. This also paves the way for devs to simply drop reusable components in the library without explicitly adding useMemo or useCallback . To be honest, it isn’t even practical to ensure referential stability when you’re dealing with 10+ props for some components. This is where React Compiler shines, and fine-grain reactivity frameworks outshine React.

Key takeaway: The React Compiler offers powerful, automatic optimizations to stop unnecessary redraws, however it can’t solve logic that deliberately causes re-renders. Now that we’ve fixed the fundamental issue, future gains will come from effective usage of the React compiler.

The Result: A More Responsive User Experience

Four screenshots of the search page with “highlighted updates” enabled in react dev tools. First image says “On Project Start” and shows almost every element in the screen re-rendering 13–14 times with a bright yellow color, implying costly re-renders. Second image says “Context -> Redux, 14 -> 6 re-renders” and has reduced frequency of re-renders (from 14 to 7) with a darker shade of yellow, implying less costly re-render. Third image says “react-compiler for bolt-components, 0 re-renders for some components”, and with this some icons, images and text fields have disappeared from re-render labels and the color is still darker shade of yellow. Fourth image says “redux batching 6 -> 1 re-renders”, here everything has is re-rendering only once and the color has changed to teal color, implying cheap re-renders. There is still scope for improvement as many elements are still re-rendering, but it’s a long way from “Project start”.

Let’s see the difference in action now.

https://medium.com/media/9d5a59553666e5ecf97878c6bd103a26/href

All these optimizations together have greatly improved how fast and smooth our app feels. While our app isn’t universally lightning-fast, eliminating a debilitating 2-second stutter in add to cart action was extremely gratifying.

What’s Next: Always Making Things Better

Even though we’ve made big progress in making our frontend faster, our work isn’t over. We’ve found more areas to improve:

Optimizing other components: Right now, our listing widget depends on the entire list of cart items due to some edge case. Similarly the footer depends on cart value and progress of coupon ladders). By breaking or isolating this dependency, we will further reduce redraws when cart items change. We defintely can get to the point where only Add to Cart button and the footer re-render. But we are looking at diminishing returns for the amount of code changes required. The next point is probably a better investment;
More React Compiler Use: As we learn more and gain confidence with how the React Compiler works in our Bolt components, we will look for chances to use it in other important parts of our code, aiming to get the most out of its automatic optimization features.

Improving performance is an ongoing process, and it also keeps getting worse as more code is checked in. It’s important to audit your app from time to time to find gaps, and for every highly dynamic state, it’s better to take the time to make right choices early.

What are your experiences with performance problems that grew from small changes? Share your thoughts below!

How we solved a 2s+ stutter caused by re-rendering React components was originally published in engineering-udaan on Medium, where people are continuing the conversation by highlighting and responding to this story.

Taming the Ebb and Flow: Cost Efficient Time Based Autoscaling

Aditya Bennur — Thu, 01 Aug 2024 14:26:50 GMT

The Challenge: Predictable Peaks and Unpredictable Costs

At Udaan, we operate a complex microservices architecture with over 200+ Kubernetes deployments. Analysis of our application traffic patterns reveals distinct “ebbs” and “flows”.

During business hours, our servers are bustling city squares, by night, they’re quiet ghost towns. This cyclical nature of our workloads presented us with an opportunity to optimize our compute resource utilization and reduce operational costs.

y=rps; x=time

Primer: Autoscaling

Kubernetes autoscaling capabilities play a pivotal role in its widespread adoption across enterprises. Among the various autoscaling methods available for microservices, such as Vertical Pod Autoscaling (VPA), Horizontal Pod Autoscaling (HPA), and Event-Driven Autoscaling, HPA stands out as the most commonly adopted solution.

HPA dynamically adjusts the number of replicas for a Deployment or ReplicaSet based on metrics like CPU and memory utilization, or custom metrics. This reactive scaling responds to incoming traffic patterns without relying on predefined time-based components.

In contrast, time-based scaling is a modern approach that allows applications to scale based on user-defined schedules. This method proves advantageous for scenarios with predictable traffic patterns.

While HPA excels in handling dynamic and unpredictable traffic, it may not be the ideal choice for environments where traffic patterns are well-understood and predictable. Time-based autoscaling empowers organizations to efficiently scale resources up or down ahead of expected demand.

Approach: Time Driven Scaling

Post evaluation and comparison of available autoscaling solutions that aligned with our requirements and limitations, we opted to implement time-based autoscaling. Enter “Service Scaler”, a home-grown Kubernetes operator which pro-actively monitors and controls the HPA object of a corresponding deployment and facilitates gradual scaling of workloads in accordance with a specified configuration.

The Configuration (CRD)

CRD definition

Time can be specified in the following formats:

ZonedTime — HH:MM
ZonedDateTime — rfc3339 format Ex: 2023-01-11T08:00:00+05:30

If no time range matches the default configuration will be applied.

Operator “Mechanics”

Let’s explore the inner workings of the Service Scaler through visual representations

state machine

architecture

Operational Overview

The Controller monitors and reacts to three types of events within the ServiceScaler CRD:
1. Create
2. Update
3. Delete
Reconciliation:
1. Verifies if the “time-range” matches.
2. Assesses “early-exit” conditions.
3. Execute HPA operations accordingly.
Given the dependency on the “time” dimension, reconciliation is forced every 5 minutes to maintain the desired state of the HPA consistently, even when Create/Update/Delete events are not triggered.
Invalid HPAs (HPAs having minReplicas == maxReplicas) are automatically removed conserving configuration sanity.

Ramp up-down “mechanics”

Scaling activities commence 30 minutes in advance of both anticipated traffic increases and decreases, ensuring systems are primed and ready.

ramp-up down module

Enough talk! Let’s take a look at a scenario where it would begin scaling down from 14 to 2 replicas prior to midnight (00:00), ensuring the 2 replicas are fully provisioned and ready to handle the expected lower traffic load at the start of the day.

y=replicas; x=time

Kill Switch

For those rare instances when things might not go as planned, we’ve built in a manual override. By adding a simple annotation to the HPA, the Service Scaler can be bypassed, putting control back in the hands of our operators.

service-scaler.kubernetes.io/managed: “false”

Game Plan: patch “minReplicas”

With the ServiceScaler framework in place, we proceeded to adjust the minReplicas field of the Horizontal Pod Autoscaler (HPA) based on the time of day for 200+ deployments. We divided the 24-hour period into three distinct intervals:

Morning (8am — 4pm)
Evening (4pm — 12am)
Night (12am — 8am)

The corresponding minReplicas values for each interval was calculated statistically, proportional to the observed load patterns.

Aftermath: More Than Just Promising Numbers

We achieved a 25% reduction in VM costs whilst ensuring our infrastructure was breathing in sync with our business demands. Let’s take a look at some visuals portraying the reduction.

node-count reduction

vm-quantity reduction

We’ve open sourced the Service Scaler, so be sure to give it a spin! Contributions and feedback are always welcomed.

Taming the Ebb and Flow: Cost Efficient Time Based Autoscaling was originally published in engineering-udaan on Medium, where people are continuing the conversation by highlighting and responding to this story.

Unveiling Innovation: udaan’s Inaugural Hackathon 2023

Gaurav Bhalotia — Sun, 20 Aug 2023 02:15:26 GMT

udaan celebrated its first ever tech meet-up ‘Techverse’ in the first week of August 2023, where engineers from various locations flew down to Bangalore for a week of fun, collaboration, learning and reconnecting with our customers. As part of the meet-up week, we hosted our inaugural Hackathon. This event offered a unique platform for our talented engineers to elevate their skills and channel their creativity, engaging in a 24-hour marathon exploration of the product lifecycle, from ideation to implementing a working prototype; and finally delivering a sales pitch to a panel of judges.

The underpinning mantra of the Hackathon was a simple yet potent phrase: ‘Unleash Your Inner Geek.’ With this rallying call, participants were not merely invited to demonstrate their technical prowess, but rather to delve into the depths of their creativity and problem-solving skills but also to build for Bharat. Engineers were encouraged to push their boundaries, break conventions, and craft solutions that solved any real-world problem, not restricted to udaan. Witnessing a remarkable 45+ submissions, culminating in 38 teams crossing the finish line, albeit sleep-deprived, the energy fuelled by adrenaline and the iconic red bull.

Red bull team supplying energy potions to the hackers :)

Mentoring: Nurturing Innovators

Each registered team was paired with a mentor from our esteemed executive team (TAB team). These TAB team leaders dedicated their time to collaborate with the teams both prior to and during the event, skilfully guiding them through the conceptual landscape and assisting in transforming ideas into concrete solutions. This constructive collaboration with our leadership team created a coming together of collective wisdom, enriching the journey towards innovation.

In the true sense of celebrating the spirit of technology at udaan, this mentorship not only helped hone ideas but also created a good tech connect for our leadership teams.

Recap of an Unforgettable Experience

As the dust settled and the final lines of code were written, the Hackathon had etched a memorable experience in the hearts of all involved. The collective energy, sleepless nights, and relentless pursuit of excellence culminated in an event that was more than a competition — it was a celebration of innovation.

Participants delved deep into their wellsprings of creativity, ignited by an unbridled passion for technology and innovation. Among the highlights, udaan’s trailblazing engineers — Siddartha Reddy, Tarun Mangal and Shashwat Agarwal — reunited to evoke cherished memories from seven years ago when they forged the initial fragments of udaan’s technological journey. Embodying an entrepreneurial spirit, we even witnessed the convergence and divergence of ideas, with teams engaging in idea mergers and splits. This flexible approach saw teams pivoting their focus after identifying shared problem spaces, a testament to the adaptive ethos of the event.

Fun moments on the hack floor

Judging and Awards: Celebrating Excellence

The judging process commenced with a comprehensive floor walkthrough led by a panel of judges, aimed at handpicking the most promising ideas for further consideration. From the initial pool, a curated selection of ten standout ideas advanced to the final round. During this crucial stage, these shortlisted ideas were presented before an esteemed panel of judges, in addition to a larger audience spanning the entire company.

The criteria for evaluation encompassed multiple factors, including the degree of innovation, user-friendliness, potential impact, intricacy of the solution, and quality of presentation itself. This ensured that the ideas not only demonstrated technical prowess but also the ability to resonate with audiences on multiple levels.

Floor walkthroughs by the Judging panels

Artificial Intelligence, along with Generative AI (Artificial Intelligence), spurred a multitude of innovative ideas, including:

AI-powered voice agents for enhanced customer support.
Multilingual voice interface amplifying app accessibility.
Extracting patterns from customer support calls for insights.
Utilising phone cameras for seamless inventory replenishment.
Human like chat interface to udaan analytics.
Decoding investment algorithms and trust from YouTube content.

Three of these groundbreaking concepts secured their place in the esteemed league of final winners.

Five winners emerged, as highlighted below:

In summary, the udaan Hackathon 2023 was not just an event; it was a hub of innovation, a showcase of creativity, and a testament to effective teamwork. As we move towards a tech-driven future, I invite all engineers to keep their passion alive and contribute as architects in this evolving eB2B commerce landscape.

A sincere thank you to the participants, mentors, judges, and organizers who made Hackathon 2023 a reality. Look forward to the next year’s edition of the event, brimming with even more groundbreaking innovations brought to you by team udaan.

Unveiling Innovation: udaan’s Inaugural Hackathon 2023 was originally published in engineering-udaan on Medium, where people are continuing the conversation by highlighting and responding to this story.

Reacting to Change: Tale of a Web Developer Improving Startup Performance for a large React Native…

Divjot Singh — Tue, 15 Aug 2023 04:50:51 GMT

Reacting to Change: Tale of a Web Developer Improving Startup Performance for a large React Native App | udaan

Two years into my stint at udaan, I was tasked with “improving the startup performance” of our react-native app. I asked myself, what does that even mean and how do I even begin tackling this?

AI-Generated image of a person scratching their head while looking at a large billboard with React’s atom logo at night with a starry sky

Join in my journey of understanding react-native’s startup flow, identifying bottlenecks, using simple optimization techniques, discovering common pitfalls with startup time telemetry, and realizing the importance of perceived performance. This isn’t a “one trick to boost your app performance by 10x”, but rather a long journey of figuring things out, finding patterns in a mess, measuring impact and delivering a much improved experience to your users. My hope is that this helps you build intuition and tackle daunting problems with your software, and at the same time learn from the larger community about similar challenges faced by developers in the react-native ecosystem.

It’s a long one, but don’t worry, there’s a tl;dr at the bottom.

Background

I know, as a web developer, what optimization of page load performance feels like. You go to web.dev/measure, get a score for various metrics, and follow the recommendations mentioned over there. The usual suspects are large asset files, lack of good compression (brotli/gzip), lack of code splitting, lack of appropriate resolutions for images, lack of font compression, lack of inlining of assets, lack of inlining of HTML (SSR), lack of preloading/prefetching, lack of a CDN, lack of optimization of waterfalls, and so on.

If you squint hard enough, everything I just mentioned is basically trying to fit as much relevant information to that page’s initial render in the tiniest network footprint. Just taking care of that usually takes care of everything else. Even alternatives to ReactJS, like Preact, are touted for their 3kb size, and not necessarily for their runtime performance improvements.

Fast 3kB alternative to React with the same modern API
— Preact’s tag line

Of course, there are caveats to the above, you do find fine-grain reactivity winning over lately thanks to efforts by the team behind Svelte and SolidJS. You also see a plethora of ReactJS-like alternatives that regularly outperform the runtime performance of ReactJS. In fact, ReactJS without hooks stands at ~95th position as of the time of writing on Krausest’s js-framework-benchmark, while ReactJS with hooks is further behind, yet that’s the default choice.

Speaking of React, React Server Components, the next big thing in the React world also tries to primarily solve for the network bandwidth and bundle sizes.

Put another way, the server, far more powerful and physically closer to your data sources, deals with compute-intensive rendering and ships to the client just the interactive pieces of code.
— Understanding React Server Components

This tells us that improving the roundtrip performance matters far more when it comes to the web world rather than improving runtime/interactive performance. Maybe it’s thanks to browser engine vendors who consider the real-world performance of these frameworks when optimizing their hot code paths. Maybe it’s the fact that web-apps are usually very simple compared to native apps in terms of interactivity.

Needless to say, react-native is completely different (or is it? vsauce.mp3). There is no one “engine” per se. Host platforms have their own complex web of APIs, that introduce breaking changes annually. React Native adds its own layers of abstraction. And finally, there’s your application code using all sorts of third-party libraries. All the assets are bundled with the app and are accessible locally, so there’s no network roundtrip optimization to consider here.
When I was tasked with this, very little of my web performance knowledge could easily translate, or so I thought. I had to dive deeper.

The article is divided into three parts that demarcate three distinct parts of this journey. They deal with identifying the root cause via measurements and benchmarking, deploying fixes and measuring the impact, and lastly further improving the perception of startup loading times.

Part 1: Let’s dive deep

Quick Google searches left me with even more doubts. There’s no one answer to why your particular app is slow to start. Could it be the react-native version we were using (we already had Hermes enabled so everything post that wouldn’t necessarily bring large improvements)? Could it be the react-navigation version we were using? Should we migrate to react-native-navigation instead (yes, they’re two different things)? Maybe we should move to Flutter? What about running a PWA inside a native shell? All of these were wild guesses with heavy costs. Unlike websites, you can’t just upgrade every single thing about your app and expect to not have to worry about your old code base at all. Even though we heavily use CodePush, upgrading the react-native version and pushing a new native release to respective application stores would still mean we would have to push bugfixes and maybe even major features to the old version until the new version is fully adopted, which can take anywhere from 1–3 months. Upgrading native dependencies would also lead to similar caveats as their JS APIs may not always be backward compatible. And moving to Flutter is a different story altogether. Even after going through all these steps, there’s no guarantee that we would’ve solved the problem.

Here’s how I see it, you seemingly have very little control over how things boot up when you use a complex stack like this. You can’t really remove all your modules, you can’t really delete all our JS logic, and you can’t really remove all your UI code. Largely, what you do between JavaScript VM kicking in and your first component rendering is where your control lies as someone who’s proficient only in JavaScript.

An Excalidraw screenshot that shows 3 rectangles, “MainActivity”, “react-native” and “App Code”, left to right. “react-native” has two more rectangles within it; “modules” and “JS Bundle” left to right, implying that “react-native” phase has a module initialization phase and a “JS Bundle” execution phase. “App Code” has two rectangles within it; “Boot” and “”, implying that “App Code” has a bootstrap phase where things happen before React components are mounted, and a “” phase where React components and their lifecycles (componentDidMount, useEffects) dictate where time is spent.

A more formal graphic would be something shared by react-native’s own team, Ram Narasimhan, who has excellent videos on the same in their Chain React talks (2018) (2019) (2020).

A screenshot from the “Chain React 2019 — Ram Narasimhan — Performance in React Native” video around 9-minute mark, where Ram is describing all the steps that take place to go from app launch to first View to be rendered. It has a yellow bar, denoting the app startup time timeline. Beneath there are two bars horizontally laid down; “native” in blue and “JavaScript” in green, implying the first bar is comprised of these two bars. Then there are furthermore bars beneath “native” and “javascript”, like a flame chart. “native” has “init” taking 1/4th of space, “native modules” taking half space, and “jsvm” takes the remaining quarter. “Javascript” has “init”, “network” and “components”, all taking roughly a third of space. “components” has unlabeled bars underneath, implying various application components would be taking their individual times.

A screenshot from React Native EU 2020: Parashuram N — React Native Performance — Take 2 around the 2-minute mark. Parashuram is recapping his previous talk with the same screenshot mentioned above, this time it has labeled arrows to summarize the previous talk. The yellow bar is labeled as “app start”, this time around the bar is much shorter from its original length, and the original length is expressed as a dotted rectangle, implying we have reduced the startup time from what it was originally. The leftmost edge of the bar has an arrow labeled “ReactMarker”, denoting that it can be measured using React Markers, as explained in the previous talk. The rightmost edge has two arrows pointing to it. One says “End Marker”, denoting when we stop measuring the “startup time”, this could be the presence of a particular component, measured by an event listener on the native side where the nativeId matches. The other arrow says “Better App Startup time”, implying that we’ve reduced the bar length via optimizations discussed in the previous talk. This time around “network” has moved from being underneath “JavaScript” to being right beneath the “native” bar, but starting at the same time as “init”. The arrow pointing to “network” says “Native network, Mock Data, Native Loading Screen”, implying we can move the “network” required for app launch from “javascript” and do it in “native” land, to prevent stalling. “init”, “native modules” and “jsvm” are right where they were in the previous screenshot, just that “network” is now parallel to these three, implying concurrency. “TurboReactPackage” labeled arrow points to “native modules”, and “Hermes, JS Profiler” points to “jsvm”, implying strategies to improve these phases of startup. “init” underneath “javascript” has an arrow pointing towards itself labeled as “inline requires”, a strategy used to reduce module resolution times. “React Profiler” points towards the bars underneath “components”, implying “React Profiler” can be used to optimize these.

If I’m being honest, I don’t understand every single thing here. It was a bit intimidating at first. Don’t get me wrong, Ram explains it really well and it makes great sense. However, I could see large gaps in my understanding of the stack and domain knowledge required to tackle this. And I think that’s okay. It’s okay to be intimidated by something that runs your regular looking JavaScript and ReactJS code on a custom highly optimized JavaScript VM, on a framework that glues native platform APIs and the said VM, with fancy stuff like TurboModules, Hermes, Bridge, or JSI, in handful of languages that you may have very rusty familiarity with. Yup, it’s supposed to be hard.

However, one clear takeaway for me was that I need to measure before I can optimize anything. I need to understand how my app actually boots up, and where all the time is spent. Before I understand ReactMarkers, maybe I can take a hard look at the diagram I drew above and find the bottlenecks there. At least we’ll be somewhere in the ballpark. And trust me, the kinds of optimizations our app would require might not really reside in the native side of things, otherwise I’ll be having a hard time writing this post.

That’s my first learning to share. If your solution to the problem is to upgrade various dependencies, then the locus of control is never on your side. Sure, you may find great improvements every now and then (Hermes single-handedly saved so many react-native projects!), but at what opportunity costs? What if you’re waiting for react-native’s new architecture for years and even when it lands, you may still not necessarily get 5x improvements in startup performance? Was it worth the wait?

Measure twice, cut once

This is when I realized I shouldn’t approach this problem by finding packages to update in my package.json. I need to understand how our app even boots up. I was also tasked to evaluate react-native’s viability for our use case, and see if moving to alternate technologies is indeed a smart move for us or not.

Before I looked at any code, I wanted to truly see how slow our app is. Previous developers (Kartik Ukhalkar, Bhavya Rawal) had already done great work by adding excellent telemetry. We already knew how much time our app spent in the real world to go from MainActivity to the first screen. The numbers looked something like this; 4.2 seconds on P50, 12.2s on P75, 59s on P90, and ~7 hours on P99 (Press F for our poor users waiting for hours to load our app). Yeah, something isn’t right here. Let’s put a pin on it for now.

Now that I know real-world performance, I wanted to compare our app to other apps. I launched all remotely resembling apps, and even highly performant ones, and tried to compare their performance Digital Foundry style by making a video and comparing the start and end manually. I warmed up my phone by opening them all first after a fresh install and bypassing their onboarding step. Not the most scientific, but the most user behavior-like I would say.

This is what I found on my mid-range Android phone (Realme X, Snapdragon 710);

https://medium.com/media/052cd46d91f944b2b4eedbb52268c89b/href

Never expected our app to actually be on the top side of the list! I was surprised to know Swiggy was actually technically slower to load, even though I regularly use it and never had any issues with its loading times. Clearly, the designers and engineers at Swiggy understand the importance of perceived performance, something we’ll talk about later in the article. Anyway, coming to our question of the viability of react-native, I know from my previous stint at Swiggy that it is not a react-native app, and if our app could actually “beat” it, then we can say that react-native is a fairly safe bet. Flipkart is also known for using react-native and their incredible engineering team for building one of the finest solutions in the react-native world, so if our app is able to match its performance as is in my unscientific test, surely we can ooze out a bit more performance staying with react-native. Note, unlike most b2c apps, our users are business owners, not necessarily with high-end phones. Another thing to note is that not all apps are comparable. BigBasket shows a relatively simple first page, notion’s shell is interactive much earlier just that their webview takes its time to load the page, and Slack shows a list of channels by default (which is much easier to cache). The intent of user also plays a role here, when you open Swiggy, you probably have very high intent to order something and your fingers quickly tap the right buttons right away, while when you open slack, you are probably dreading to see a list full of red bubbles. So please don’t draw incorrect conclusions out of this, the goal is simply to understand how apps load and how they feel and where udaan stands, at least in this unscientific test.

Coming to udaan, despite absolute numbers showing udaan to actually be quite competent, why is it that our team and users feel our app is really slow? I guess it’s a mix of perception, our user base, and a lack of optimizations. No matter how fast the app loads in this test, the median performance is still 4.2s, some ~30% slower than what we observed above, this probably means the majority of our users have even slower phones than my mid-range phone.

Taking matters into your own hands

So far I’m convinced our app is actually fairly performant and our bottlenecks are probably specific to our app code and not the technologies we are using. This meant I needed to dig deeper. I spent the next few days adding logs, right from MainActivity to the first meaningful screen we show to our users. I could figure out integrating several debugging tools, learn about systrace and make sense of flipper, but nothing beats the simplicity of the mighty logger. I can add a logger between two timestamps in any language and call it a day. By understanding the mechanics of how our app loads, I could truly understand the bottlenecks in the startup journey. This is what I came up with in the debug mode;

A Gantt Chart with the title “App Startup Timeline (Before)”. The raw data can be found under “Appendix” at the bottom of the post in tabular format.

Legend:

Native Package Registration is the time spent for each packages.add() in react-native template code for non-autolinked packages.
JS Bundle is a step in a debug mode where Metro bundler would build and stream a bundle to the emulator. This would be a file-read operation in the production build. This took ~15s on my setup, but I’ve trimmed it to 0.5s to prevent it from overshadowing everything else.
CodePush Wrapper + Redux is one of the first react components we render, the time is representative of react-native giving control to JS execution and the time spent to go from AppRegistry.registerComponent to render the function of these components.
, , etc. are just small react components in the initial react tree.
is the first react screen visible to user after splash screen. It is visible while we “load” our app.
Configuration.init() is the first operation outside of ReactJS where we await various promises and refresh our session tokens with our auth service.
is another small react component.
is the component that builds our react-navigation stack, tabs, drawers navigators.
is the first react-navigation screen. It’s componentDidMount waits for navigation focus event, and a DeepLinkHelper.init() to finish before navigating to the actual first screen (HomePage/Login).

This activity, while tedious, helped me truly understand our app better. It doesn’t even matter if it’s accurate or not, we only need to find relatively large tasks that we could move around or even eliminate here. And I could see that most of the time is spent AFTER control flow reaches JS, another sign that the problem lies in our app code side. Most of these components are part of critical path, and it doesn’t seem like we’re doing anything wasteful here. CodePush and Redux are needed to be initialized before anything else, a LoadingBanner is shown right when a long running task is executed, NavigationContainer has to setup all our screens during boot, and we need to handle deep links before showing the first screen to avoid splash of incorrect screens. Everything looks relatively alright.

Next, I tried to breakdown Configuration.init() and InitialRoute’s mounting phase.

A Gantt Chart with a title “Configuration.init() Breakdown”. The raw data can be found under “Appendix” at bottom of the post in tabular format.

A Gantt Chart with a title “ breakdown”. The raw data can be found under “Appendix” at bottom of the post in tabular format.

I have simplified the pieces of the code to just a simple component with useEffect, however this was buried behind 5 components and different helper functions. So depending on your codebase and its age, expect finding such a pattern to not be so easy and obvious.

/**
 * Imagine this is how Configuration.init
 * NavigationBuilder and LoadingBanner are
 * used in the app startup
 */
function App() {
  const [status, setStatus] = useState({
    loading: true,
    config: null,
    error: null,
  });

  useEffect(() => {
    let mounted = true;
    Configuration.init()
      .then((config) => {
        if (!mounted) return;
        setStatus({ loading: false, config, error: null });
      })
      .catch((error) => {
        if (!mounted) return;
        setStatus({ loading: false, config: null, error });
      });
    return () => {
      cleanup();
      mounted = false;
    };
  }, []);

  if (status.loading) return ;

  if (status.error) throw status.error;

  return ;
}

Hmm, at this point these charts started resembling the Network Panel of a website. I asked myself, can’t I just “prefetch” or “preload” some of these tasks and reduce the stalling phases in the pipeline? Clearly Configuration.init() is a critical task that blocks everything else, doing this as the very first step would speed up everything. Similarly, DeepLinkHelper.init() can be done before even React components are mounted. These functions are promise returning functions, so I thought it would be easier to fire up these functions and “cache” these promises in their pending state, so that by the time they’re actually required, they might as well be resolved. There are other obvious optimizations, such as loading session on native side and send it over the bridge right during boot, or to consider TurboModules to reduce package registration times, or refactor how we use react-navigation. All of these appeared to be large breaking changes, that may come with their own set of bugs, and may require segmenting our users and annoy development team to maintain two branches of our app for months. And they didn’t even seem to be that much impactful. Loading Configuration.init() ~500ms sooner seems much larger win than shortening a ~180ms step to say ~100ms. So I shortlisted these two init functions to begin with. Here’s how you can go about “caching” these promises early:

// before
export class Configuration {
  static init() {
    return somePromiseReturningFunction();
  }
}

// after
export class Configuration {
  private static promiseValue = null;

  static init() {
    if (this.promiseValue) {
      return this.promiseValue;
    }
    this.promiseValue = this.actualInit();
    return this.promiseValue;
  }

  static actualInit() {
    return somePromiseReturningFunction();
  }
}

The above changes allow us to call Configuration.init() at any moment. So we can do this right before ReactJS is even in the picture, without changing its actual usage in App.

/**
 * Warm up Configuration.init for later use in .
 * Make sure it is idempotent to avoid side-effects.
 */
Configuration.init();

AppRegistry.registerComponent(appName, () => cp(Root));

Say what you may, but classes are sometimes good tools encapsulating data and behaviour. If you’re more of a functional purist, I guess one could write it like this:

// before
export function initConfiguration() {
  return somePromiseReturningFunction();
}

/**
 * Should be called once per session
 */
function createConfigurationInitializer() {
  let promiseValue = null;

  return () => {
    if (promiseValue) {
      return promiseValue;
    }
    promiseValue = somePromiseReturningFunction();
    return promiseValue;
  };
}

export const initConfiguration = createConfigurationInitializer();

Moving on, we can do the same with DeepLinkHelper.init(), even though it depends on Session to be loaded, it’s still possible to “warm up” this async task before it’s required in . It was also waiting on three other DeepLink handlers (CleverTap, AppsFlyer and react-native’s own Linking package), but sequentially. I made them concurrent to optimize it further, though keep in mind, the first few seconds of app launch tax the hardware quite a bit, CPU, GPU, RAM, everything is kick started to dump huge payloads all over the place, so it may not always be easy to find gains via concurrency.

I’ve simplified the code once again, so expect it to be nested and not so easily laid out in the real world. Nobody wants to write unoptimized code, in this particular case it probably started off with Linking.getInitialURL() where using async/await probably made sense, but with new requirements, developers just followed the already established pattern and added one more async call.

It’s simply impossible to anticipate and assume problems of future to write complex optimized code prematurely that actually scales well, but it is relatively very easy to identify unoptimized and simple code once a problem actually arises as you scale.

// before
const linking = await Linking.getInitialURL();
const cleverTap = await CleverTap.getInitialUrl();
const appsFlyer = await AppsFlyer.getInitialUrl();

/* once all promises resolve, one with a url gets picked */
const url = linking ?? cleverTap ?? appsFlyer;
if (url) {
    handler(url);
}

// after
let handled = false;
const handleOnce = (url?: string) => {
    if (!url) return;
    if (hanlded) return;
    handled = true;
    handler(url);
};

/* first promise to resolve with a url gets handled */
Linking.getInitialURL().then(handleOnce);
CleverTap.getInitialUrl().then(handleOnce);
AppsFlyer.getInitialUrl().then(handleOnce);

And yup, just these two pipeline reordering steps were enough to get substantial performance wins in my local tests. We went from 2335ms to 1862ms, a ~20% improvement without actually touching any business logic! Also observe how Configuration.init() and DeeplinkHelper.init() are now running concurrently to other tasks. Even though some of the task durations have actually increased, there’s still a net improvement in overall startup time. A ~470ms win is quite noticeable even in the debug mode and should stand out in production as well. Though I wouldn’t read too much into the absolute figures on a local setup in debug mode, the story may be different on production. foreshadowing.

A Gantt Chart with a title “App Startup Timeline (After)”. The raw data can be found under “Appendix” at bottom of the post in tabular format.

How to deploy scary changes?

I was afraid of releasing it as is. What if some race condition somewhere could break something completely different? Sure our app loads ~500ms slower presently, but at least it loads! Also, how do I truly quantify these results in production? Can we do some sort of A/B?

Well, this all happens before session is loaded, so we need to bucket our users, or more accurately user sessions, on device. Fear not, Math.random() is here!

I refactored the above code something like this to begin with:

/**
 * Housekeeping before we mount and render the root react-native component.
 * Use this to eagerly prefetch/initialize/warm critical I/O sources.
 * Make sure these initializations aren't blocking main-thread,
 * as that can lead to delays in startup times.
 */
bootstrap.init();

AppRegistry.registerComponent(appName, () => cp(Root));

This allowed me to do my shenanigans in an encapsulated bootstrap.init() method. I created a getLocalExperimentByPercent utility function that uses Math.random() to segment each session. I pushed my changes with conservative experiment values like 1% and regularly monitored its impact until I was confident about its stability. I actually found some bugs along the way, like this optimization breaks when the app is started by a notification due to some race condition. It wasn’t hard to disable this optimization in this manner, and we anyway don’t care about startup performance for background notifications.

/**
 * This is where we boostrap all startup related dependencies.
 * Goal is to fire things asynchronously ASAP and then use them when they're actually needed.
 * To do this, we'll save the reference to promise and keep them idempotent.
 */
export class bootstrap {
  static config = {
    /**
     * warms up Configuration.init and DeeplinkHelper.init.
     * Observed benefit has been around 20-40%
     */
    shouldWarmUp: getLocalExperimentByPercent(1),
  };

  static init() {
    // ... other init tasks like setting up telemtry module, error reporting module, etc.
    const { shouldWarmUp } = this.config;
    /**
     * We don't need to optimize background app launches or shouldWarmUp is disabled
     */
    if (AppState.currentState !== "active" || !shouldWarmUp) {
      return;
    }
    /**
     * Warm up Configuration.init for later use in .
     * Make sure it is idempotent to avoid side-effects.
     */
    Configuration.init();
    /**
     * Warm up DeeplinkHelper.init for later use in .
     * Make sure it is idempotent to avoid side-effects.
     */
    DeeplinkHelper.init();
  }
}

/**
 * Simple util to provide local experiment system using Math.random
 * This won't be sticky, i.e. each session is unique as oppposed to each user.
 */
const getLocalExperimentByPercent = (percent: number) =>
  Math.floor(Math.random() * 100) < percent;

Was it worth it?

So far whatever I’ve shared with you has very little to do with react-native, platform APIs or third party dependencies. One may even wonder if any of this was even worth it, after all how efficacious these ~100 lines of code could truly be. Thanks to the telemetry already put by my predecessors, it wasn’t hard to put it to the test. The changes were deployed in late May 2022 as a CodePush update (no native release required!). And I couldn’t have been happier with the results. We actually see a doubling of the performance on P75. This directly translated to 35% increase in conversions to our search page!

When I think about it, I guess for our users every latency would have exaggerated effects. If my auth handshake took ~400ms on my WiFi connection, their highly congested marketplace with appalling 4G performance might even be ~1200ms. Similarly, if my mid-range phone took sub 500ms for most tasks, their phones would’ve greatly benefited from this pipeline reordering to better utilize the constrained hardware.

A bar graph showing msApp for 75th Percentile. Jan 2022–12.59K, Feb 2022–12.98K, Mar 2022–13.15K, Apr 2022–10.37K, May 2022–12.49K, Jun 2022–6,637, Aug 2022–5,841, Sep 2022–5,722, Oct 2022–5,443, Nov 2022–5,354, Dec 2022–5,335.

I performed my Digital Foundry-esque test once again, and yup, it was faster. We jumped a spot and were now in the top 3 apps from my weirdly curated list, with ~400ms improvement. Although, we can see other apps also improved and some even regressed, and this was done few weeks apart, so it’s probably not wise to read too much into it. However, our app still “feels” slow, despite being objectively faster. Clearly we need to work on perceived performance at some point.

https://medium.com/media/f826d2d303bf80c5b2828c02a4fc65aa/href https://medium.com/media/de4af8fe2d7ba3beb00e79c15ba93ed3/href

Promising stuff. Now let’s look at the P90 numbers, which was already more than a minute! These changes could’ve saved them minutes of their lives every day. As I excitedly selected P90 aggregate group on Mixpanel, the startup performance of my frowns was beyond measure. I was utterly disappointed.

A bar graph showing msApp for 90th Percentile. Jan 2022–59.27K, Feb 2022–59.47K, Mar 2022–62.74K, Apr 2022–54.53K, May 2022–77.38K, Jun 2022–90.89K, Jul 2022–79.05K

“No, this can’t be right, I made it worse for them?! Shouldn’t my hypothesis only get proven right at fringes of our user base with even slower devices and slower network types?” I said to myself in despair. It was also curious to see how July numbers were better than June; this was my hint that these numbers are probably caused by events during June and now by the code directly.

Part 2: Question everything, even the evidence

I couldn’t believe it, so I dig deeper. I knew that P99 numbers were absolutely non-sensical. Nobody would wait for our app to load for hours, so something must be wrong with our measurements. I approached this from multiple directions.

One of my colleagues recommended me to call some of the folks from this cohort. I requested access to their business phone numbers, and shortlisted ~5 people who were witnessing >5 minutes startup times in the last week. They had all sorts of concerns, regarding promotions, delivery times, discovery of past orders, but they never mentioned app performance. When I specifically asked about app loading times, some said it’s alright, some said it’s okay, it loads up in 10–20 seconds. One said that even if it loads after 30 seconds it’s okay, but the order details page is loading very slowly for them. This was my first hint that our telemetry is reporting wrong numbers.
I looked at P90 closely over a span of a day, hour by hour in a tier 3 city. I was expecting to see a smooth line of ~60s, but that wasn’t the case. I saw numbers like 2.5s, 3.7s, 3.6s in most hours, but at some irregular intervals the number would spike up to 21s, and even 187000s (51 hours). These sporadic large spikes were making an otherwise ~3s P90 number an averaged 60s. 51 hours, damn, that’s over 2 days. These high peaks were not consistent, 12pm one one day, 3pm on another. All cities that I gathered the data for all had similar trends.
We have other react-native apps under udaan that were forked from the b2b app at some point in the past. I looked at their P90 numbers that are also measured similarly, and they actually were much better. Something must be unique about our app.
I also computed order conversion rates broken down by app launch. I made cohorts of 0–1s, 2–3s, 3–5s, 10–60s, 60s+ startup times. One would assume that if the app loads in 3–5 seconds vs 100–200 seconds, the user just might not care to place an order in the latter case. But I didn’t find any correlation. In fact, it was better in 10–60s group, which was also where majority of the reported startup times were. This was my second hint that these numbers are all non-sense.

A screenshot showing line graph for Mar 20, 2022 — Apr 5, 2022 for msApp, P90. It’s largely near 0 but with peaks going as high as 500M. There’s a table beneath it, showing numbers ranging from 3000–5000 and and some in ~100 million. The average is 12.39M.

I started to wonder what all can make a simple subtraction so wrong. How we measure this msApp value is that we measure the time on the native side (t0) and pass it over the bridge to JavaScript side. Then, once we reach a particular phase of our startup timeline, we subtract the current system (t1) time with previously calculated native time, and send it to our telemetry service.

🚨 React Native 0.72 adds performance.reactNativeStartupTiming with similar measurements, though the pitfalls shared ahead might still be relevant.

How can this number be so huge?

What if t0 is from a previous app session? This can explain a number like 51 hours.
What if t1 occurs much after the app actually starts? This can explain sporadic events.
How come my changes regressed this number? What if t1 is actually that huge in certain cases?

Case 1

After various experimentations and logging, I found that each time our codepush update is applied and the JS context refreshes, the startup time is counted again. The issue is that the native side never truly restarted, so that t0 was still stale. So if I started the app at 1pm, received an update at 1:30pm, and actually applied it at 2pm, t1 would be ~2 hours ahead of t0. This was easily fixed by adding an alreadyInvoked boolean that we flip to true after sending the initial t0 payload over the bridge at the native side. If JS requests for t0 again, say due to a codepush update, we can simply return early if alreadyInvoked is true, as we don’t have to measure this startup time. I did a small A/B using the above Math.random() system and yes, there was substantial difference.

A bar graph comparing two sets of data labeled “Without Fix (seconds)” and “With Fix (seconds)” for Sep 6, 2022 — Sep 13, 2022. “Without Fix” has erratic numbers, minimum value of 73.83 and maximum value of 784.7. “With Fix” is largely stable with a downward trend, with minimum value of 15.38 and maximum value of 23.06.

Case 2

The second point also turned out to be true. I got this hint from previously mentioned race condition I fixed by removing the optimization when our app was launched by a notification. So, t1 is measured when a particular component is mounted, but it doesn’t get mounted in case of a push notification. However, native system and some parts of our bootstrap timeline do get executed. Imagine I receive a notification at 1pm, and I actually open my app the next day. This would make t1 roughly 24 hours ahead of t0. This explains 51 hours. Users may receive notifications over the weekend but they may open the app only when the open their shops on Monday, for example. The fix was simple, we skipped these measurements if the app was in background using AppState.currentState !== "active".

Case 3

What if t0 and t1 were truly that far apart? None of the above cases explain the regression that I might have caused by my simple optimizations? It was surely interesting to see that right when I was performing all these comparison and aggressively logging the app, the P90 numbers increased. Wait a minute, do we send logs in debug mode!? Yup. That was it. Can we please have a moment of laughter at the stupidity of this. Each time I was investigating our app performance, or a developer was running the emulator, we were inflating the P90 numbers! If only we didn’t work this hard, the numbers would’ve been so much better! The fix was simple, I just added an early return using __DEV__.

Did we get it right this time?

We pushed all these fixes under similar A/B experiments and took them to 100% by late September 2022. Now that we spent so much time fixing our telemetry, it was time to observe the impact, and it was once again incredible. Note, these improvements are largely on paper, as users were never actually observing 51 hour long app launches, but it doesn’t hurt to have accurate telemetry next time we do another round of optimizations. Note that P75 numbers immediately improved in June 2022, but improvements in P90 only started to appear around August-September 2022. This is also part of slow adoption rates at the fringes. The number came down to ~11s by the end of the year.

A bar graph showing msApp for 90th Percentile. Jan 2022–59.27K, Feb 2022–59.47K, Mar 2022–62.74K, Apr 2022–54.53K, May 2022–77.38K, Jun 2022–90.89K, Jul 2022–79.05K, Aug — 39.02K, Sep 2022–20.33K, Oct 2022–13.86K, Nov 2022–11.71K, Dec 2022–11.66K. A clear downwards trend, settling around ~11 seconds.

This also shows that the P75 improvements weren’t the result of fixing the telemetry. However, we did see the numbers slightly improve (~200ms) and stabilize as these fixes were rolled out, making them much more deterministic. And as you can see, we have been consistently performant for almost a year now.

A bar graph showing msApp for 90th Percentile. Jun 2022–194.8K, Jul 200–78.6K, Aug 2022–39.06K, Sep 2022–20.26K, Oct 2022–13.5K, Nov 2022 to May 2023 are all around ~11K. A sign of stability and consistency of data.

Fun fact, we did upgrade from react-native 0.63.5 to 0.70.6 sometime in December 2023. My incredibly talented teammate Mihir Karandikar handled that migration for our app with ~300 screens. The adoption took ~4 months, during which he and Anupam Prakash, another exceptional developer, maintained two tracks of codepush, one for 6.x and other for 7.x versions of our app. This meant, most of the business logic had to deployed twice, for both android and iOS. Luckily, we have a layer of separation between our business logic (ui-screens) and app shell (udaan-fe/app), so it was as simple as updating a version in package.json of the two git branches (maybe we should write a blogpost on the same).

As you can imagine, this was a huge collaborative effort between release managers and developers, and we finally made it by April 2023 when both iOS and Android apps had reached ~95% adoption. We benefited a lot by modernizing our codebase and several third party dependencies. This added much more polish to otherwise neglected parts of our app. It was definitely much needed, as 0.63.5 was released more than 2 years ago at the time of the upgrade.

However, simply updating react-native didn’t really translate into startup performance substantially (~400ms). This only validated my initial gut feeling that simply updating package.json dependencies ain’t gonna cut it. Hermes was definitely the exception though.

Part 3: Sometimes perception can matter more than reality

The proverbial hotel elevator story comes to my mind when we talk about perceived performance. This is how it goes;

A hotel manager was receiving complaints from guests that the elevators were running too slowly. She looked into speeding them up and installing destination dispatch, a multi-elevator optimization system, but the cost prohibited her from implementing either. The complaints continued to pour in and she was at a loss for what to do. She knew that she had to come up with a solution before the guests started leaving negative reviews. After much thought and consideration, she finally came up with an idea. She installed mirrors in the elevator. The guests, busy admiring themselves, actually were quite satisfied by the elevator’s speed, in fact they felt it was a bit too fast, if at all.

I don’t know the validity of this story, but it seems reasonable. In fact, it is common almost everywhere you have to wait. Magazines at the Dentist’s, cheery music in office lobbies, useful tips and player stats, or sometimes even squeezing through tunnels in video games as the engine loads up the next level.

Softwares usually mask loading using splash screens and animations. This was quite evident in BlinkIt’s startup timeline. Kudos to their team for doing a fantastic job. Though BlinkIt’s actual startup time is slower in absolute terms, the experience actually feels quite delightful and reasonably fast.

https://medium.com/media/e044b7c8737b0db3b9aa73008b76c8a1/href

I teamed up with our design head, Gaurav Sharma, and we deployed a simple yet effective startup animation. The effort was primarily in fine tuning the durations. We kept it long enough to cover our P75 startup duration, with an extended loading screen in case even more time elapsed. Apologies for using an emulator for the comparison and for minor stutters in the video, I couldn’t find time to find the old APK to do side by side comparisons on same resolution on real device, but you get the idea I guess.

https://medium.com/media/c4b163dc563fcba8734c8fb675ff39a6/href

It’s really hard to measure impact of perceived performance for it’s purely subjective, but general consensus between the team was that the app felt really smooth and fast after this change. One of our Program Manager shared this with me.

“The startup animation is really slick. The app genuinely feels so much faster. This was a great change!” — A happy udaan user

I think there’s a lot one can do with perceived performance. As we saw above, some of the apps loaded a second or two later than udaan, but still the perception was that udaan app feels slower. Fixing this bias is very difficult, but there’s definitely a strong argument to be made here that investing time in better loading experience can pay better dividends than oozing out a few 100ms using complex engineering. At the same time, it doesn’t mean it’s a silver bullet. Ajio Business, for example, tries to mask its ~10s startup time with an elaborate animation, and I wonder how less than ideal that could be for someone opening the app multiple times a day.

Coming back to udaan, we achieved the above by converting After Effects animation into an Animated Vector Drawable on Android along with react-native-bootsplash. You can actually use different splash screens for different android versions, by simply naming your “values” folder differently. As a web developer I found this pattern to be pretty cool.

A screenshot of VS Code’s folder tree view. It shows two expanded folders named “values” and “values-v31”. Both of them contain a “styles.xml” file. “values” folder has 2 other files “colors.xml” and “strings.xml”, implying these are used by default, and “styles.xml” from “values-v31” overrides the “styles.xml” from “values” when API_31 is detected, i.e. on Android 12 and above “values-v31/styles.xml” is used and on Android 11 or below “values/styles.xml” is used.

Here’s a part of our styles.xml for android. Pretty straightforward. If and when we wish to improve this animation, all we’ve to do is update the drawable/animated_logo.xml file and push a native release. It would be fully backwards compatible and wouldn’t require segmenting users.

What’s next?

There’s a lot that we can still do, though I think we were lucky to discover the above pipeline optimization.

For starters, we have a lot of cleanup to do. Fast paced and high velocity culture naturally accumulates a non-significant amount of tech debt. As previously mentioned, we’ve roughly ~300 screens, though not all of those might be in use, or they could be clubbed together and refined. Such tasks usually have high cost but with seemingly low impact, but doing it for ~100s of screens definitely adds up. Reducing JS bundle should make the JS VM init times faster. Reducing number of screens should also improve s performance (react-navigation). Lazy loading those screens might help us further.

Similarly, using TurboModules should in theory lazily initialize every single native package that isn’t part of the critical path. Imagine a QR Code scanner package that’s buried 5 screens deep after launch. The 330ms was only for non-autolinked packages, the auto linked ones are in the majority.

Streamlining all the necessary network calls (auth session and data for first screen for example) can also help us a lot. I am not sure how fragile moving this to native would be, coz we suddenly lose the benefits of colocating data requirements and UI/business logic, but it is definitely something to think about.

And we haven’t touched any third party logic and business logic yet. What if we looked closely at how CodePush works and optimized its code paths? What if we explored even better way to register our screens with priority (critical screens are registered immediately and rest are registered after the fact)? What if udaan’s first page wasn’t an infinitely loading feed but a “launcher” screen, akin to what BigBasket or Swiggy does? What if the feed is pre-populated with cached data so that the user instantly gets to use our app? There’s so much one can do!

Closing Thoughts

We have reached the end. I started with poor understanding of React Native and am now left with substantially faster startup times with minimal business logic changes, reliable deployment via experiments, and even a non-broken telemetry system with consistent reporting. So if you too are scratching your head trying to improve your app’s performance, I hope this article shares some insights as to how to tackle this problem beyond the general “enable this flag, use that package, delete that code” advice. All of this was achieved by an intimidated and scared web developer who simply added log statements all over the place to understand where all the time was spent.

Here’s your tl;dr

Comparative analysis suggested that our app wasn’t actually that slow to begin with, but it “felt” slower.
Major bottlenecks were found in the JS side of our react-native app by simply adding logs, everywhere from MainActivity to first screen we rendered.
2 of those were ~500ms long each asynchronous tasks that were running way too late than they could’ve.
Moving them to run right at the very start of JS’s control flow and reusing “cached” promises helped us reorder the startup pipeline. This is more akin to preloading assets in a network waterfall on a website. The change was ~100 lines without impacting any business logic.
Math.random() was used to segment user sessions on device, for a reliable, safer and gradual rollout. The improvement were pushed as CodePush updates, without any native release.
Several issues in the telemetry were found by analyzing hourly data for a city at P90. High peaks were causing large P90s;
CodePush was skewing our startup times due to rerunning JS while native timers were still intact.
Push Notifications were messing up with our startup times as the “start” and “end” were happening when the notification was sent and when the user actually opened the app (sometimes days later) respectively.
We were recording startup telemetry even in the debug mode, adding noise to high percentile figures.
Perceived performance was improved by using animated splash screen. Argument can be made that this is more important if not as important, as you start seeing diminishing returns after a certain threshold.

I also wanted to say that this wasn’t possible without the environment created by my peers. I learned to build things in a way we can measure, to release things behind experiments, to do cost-benefit analysis before jumping to code, to problem solve one step at a time, to be driven to excellence and to use first principles, only due to the tech culture created by the folks of udaan. If I wasn’t pushed in the right direction (shout out to Kaushik Mukherjee for the same), we might have blindly gotten into blackhole tasks without any meaningful improvements. So if you are in a position to lead others, make sure you create a safe place where your peers can work and learn by experimentation without fear of judgement and come up with tangible, measurable solutions that don’t require you to perform an engine transplant while the car is still running.

Finally I wanted to exclaim how awesome Web Development tooling is. You can just drop any URL to web.dev/measure and come up with more solutions than you can count with your hands in matter of seconds. While React Native tooling may not be as robust as Web Development, it’s only improving thanks (FlashLight is the mobile Lighthouse-equivalent) to the efforts by team behind Maestro and other community members. I don’t have any tool to share but I hope the caveats and general approaches I shared might help you to solve problems with your app, and maybe even build a tool for the same!

Appendix

Here’s the raw data for all the Gantt Charts above. Note the absolute values don’t matter much. The order and relative length of tasks is critical in identifying spots for optimization.

https://medium.com/media/b9c31571aff437e612041b69fb730f96/href https://medium.com/media/d7f791bfaa6753e602a75c5e4500e487/href https://medium.com/media/815b679bcabe951b898e6ecb6d948fc6/href https://medium.com/media/916cdba0d5ebedbd951aef28a53a430c/href

Reacting to Change: Tale of a Web Developer Improving Startup Performance for a large React Native… was originally published in engineering-udaan on Medium, where people are continuing the conversation by highlighting and responding to this story.

Deviations from the norm

Kaushik Mukherjee — Sun, 13 Aug 2023 02:29:37 GMT

An object at rest will stay at rest, and an object in motion will stay in motion with a constant velocity unless acted upon by an external force.
- Newton’s first law of motion

An external stimulus is often necessary to change the status quo. Sometimes deviations occur without impetus, because of environmental triggers — intensified competition, political upheaval, a pandemic — they are all examples of such extraneous impulses. Organizations tend to depend heavily on “metrics” to make sense of these changes. However, the purist in me is not always fully convinced with the attributions. Attributions are sometimes incomplete and on occasion, incorrect or misleading. This coupled with our need to interpret metrics based on our biases can compound the problem.

Deviation over a timeline

For instance, while environmental awareness has increased in the past century, so has global warming. Does that mean a growth in environmental awareness increases the likelihood of global warming? Correlations tempt us to believe that actions lead to consequences (causation). Comprehensive evidence-based attributions are necessary to lead to sufficiently deterministic conclusions. Not convinced? Here’s another example: If you launch a new feature on your app that replaces an existing feature in a similar real estate (say the top section of the app), just by virtue of it being new, the feature will end up getting a lot more engagement. That does not imply that the feature is inherently better than the one it replaced. Over time, it is quite possible that the new feature’s engagement tapers off as the novelty wears off, and it becomes as good or worse than the feature it replaced. This phenomenon represents a regression towards the mean.

While it is easy to pontificate, it is in fact incredibly hard to accurately identify and attribute a root cause to a change. In this context, would it not be easier if a framework or tool could unravel some of these vagaries?

An illustration

Imagine this simple usecase of adding an item to the cart in the context of an e-commerce app.

Different paths to get to the cart

The illustration above considers a destination like the cart and the various paths through which one can get to it . In this case p1 is the busiest or the most popular path, followed by p2 and p3 hence p1>p2>p3. Furthermore, regardless of what the external circumstances were, the norm would continue to be p1>p2>p3. There could be of course changes in the ratios, but the overarching relationship would not change.

Now imagine a change was introduced in the system that deviated from this norm, for example, if the relationship p1>p2>p3 no longer held true or the ratios of p1:p2:p3 changed. Such deviations are usually very hard to identify let alone attribute unless they are obvious. The obvious ones would be removing the add to cart button from the search page or an outage on the product page. Maybe there was an additional path p4 that was introduced?

While there are tools available that can figure out the anomalous patterns with some degree of accuracy, the attribution as to why this occurred or what caused this change needs analysis that is almost always very time consuming.

Further dissections

Despite the advances in software technology, AI, networking, data engineering…we still do not have efficient mechanisms to make attributions conclusive. It’s not as though the benefits of proper attributions are not understood. Yet, we uneasily accept the output metrics, often relying on our intuition to tell us if the analysis done was sufficient. In my view, the reasons behind this could be the following:

1. Lack of proper coverage when it comes to instrumentations

2. Incomplete data governance

3. Limited analytical tools that focus on the problem of attributions

A possible solution

Given the problem statements above are fairly universal, even more so in an ecommerce organisation where there are many different influencing factors, both internal and external, udaan considered building a tool that attempted to reduce some of these ambiguities.

The way we went about doing it is to figure out event-based dependencies and have them both instrumented and observed. For instance, if an event like tapping a button led to adding to the cart, the source of that event generator would be captured. As a result, attributions started to become clearer. Later relationships were graphed and pictorially represented. The next logical step was to integrate all of this into a tool . We realised there wasn’t a tool which addressed all of these concerns that I had listed and we went about building something in house. We called it percept insight .

I will perhaps discuss percept insight (pi) in a separate blog later as this isn’t about pi . What pi would help with though in this context was to show the variances in relationships as new features got introduced. We would see some features cannibalising existing ones while the net result remained unchanged. While this was no doubt interesting, what this also told us is that no specific feature alone can move the needle significantly. That it takes time for habits to build, for features to build acceptance and for other features to wear out. But we already knew all of that, didn’t we? pi also showed us exactly how those feature sets changed over time and allowed clear attributions to form, leading to more specific actions.

For example: Let us assume that we start seeing a sudden dip in orders placed from a certain geography. This could be because of a plethora of reasons
Was there a sudden change in pricing? (internal or external factor)
Are the warehouses in that region not operational? (external factor)
Did the tech infrastructure in that geography have issues? (internal factor)
Was there a new feature or workflow that got introduced recently? (internal factor)

There could be others as well….

pi is powered by netra, udaan’s observability platform. What is built on top of it are data structures that allow for specific attributions that lead to better explainability. How? That is a topic for later.
For now though I happy to have a little more clarity into what is actually causing the deviations from the norm.

In conclusion

In an environment where variances are large it becomes important to have leaf level events instrumented and observed. Also the input metric definitions should add up to macro output metrics such that deviations can be observed and attributed accurately. While it is good to conduct A/B experiments and such, there are inherent issues in looking at metrics in isolation. Also unless it is a long running A/B the attributions can be suspect due to recency biases.
Lastly, a tool or framework that can take care of both instrumentations , visualisation and detection of these deviations helps in ensuring proper explainability.

Deviations from the norm was originally published in engineering-udaan on Medium, where people are continuing the conversation by highlighting and responding to this story.

PerceptInsight — Intelligent Insights Platform | Release

Sai Sharan Tangeda — Wed, 14 Jun 2023 10:29:21 GMT

PerceptInsight — Intelligent Insights Platform | Release Note

PerceptInsight (or pi) is an Intelligent Insights Platform, that provides real-time insights across various cuts and dimensions of your customer’s journey. pi enables you to not only understand your users, but allows you to take actions near real time basis insights and data.

Why pi? What sets us apart?

There are multiple other alternatives/competitors in this market. So why did we have to come-up with another product?, what sets pi apart?

Performance — pi is upto 2x faster than the other competitors
Data Availability — Unlike the existing competitors we are available across all major cloud providers and cloud regions in the world.
Security — We take data security very seriously, and to that end we ensure that our enterprise customers have visibility into where we store the data. We take pride in our “Control Plane — Data Plane” architecture which allows our customers to have data in their cloud or on-premises there by avoiding any compliance risks.
Affordability — We are very competitive in pricing and to that end we ensure that our pricing is affordable by all start-ups. Having fine-grained control over scale of events allow start-ups to leverage our pay-as-you-scale model.
Intelligence — We are ahead of the curve in leverage cutting-edge technology to provide better and more seamless experience to our customers. To that end, our first milestone is beta release of “Interact” capability (Check more about it below).

Vision

Our vision is to create an affordable, easy to use, cutting edge, real-time analytics platform for everyone. We aim to truly bring the power of analytics to the decision makers and domain experts.

For data administrators, our constant effort is toward improving the governance of the analytics eco-system, providing simpler ways to instrument, manage and publish events from any application.

With the dawn of new AI era, lot of previously complex problems are now solvable with the LLMs, hence we are leaving no stone unturned to improve the analytics experience for our customers.

Features

1. Insights

Insights help our customers understand the trends over various dimensions and cuts.

Not only trends, but Insight also enable easy visualisation of data with powerful charting library.

https://medium.com/media/77cda235c9bda59e9c2f4b13d673a02c/href

2. Funnels

Funnels is the simplest way to understand the user conversion at each step. With integrated breakdown, you have visibility into user journey at different demographies

Funnels

3. Cohorts

Segment your users based on any property to build sophisticated user cohorts. Cohorts are evaluated in real-time based on the filter or condition thus keeping you up-to-date about the user behaviour.

Cohorts make experimentation on group of users very easy, with integration of cohorts into Insights. You can easily monitor the trends for a single cohort of users.

cohort builder

4. Dashboards

Dashboards is designed to streamline collaboration and enable efficient access to critical data, making it the ultimate one-stop destination for all your reporting needs.

Dashboards will enable you to create stories about user behaviour and improve data driven business decisions.

5. Interact (Private Preview)

Interact is our cutting-edge and most recent capability that revolutionises the delivery of reports.

Interact enables real-time generation of Insights from Natural Language!. Yes, that’s right. All you have to do is ask the platform, and we get your reports.

We strongly believe that power of interacting with data using natural language will improve the visibility and decision making power for business and ops teams, and interact is the first step towards that.

Example of Interact

Apart from the ones we have already released, we have more exciting features in the pipe that are dedicated to providing more seamless analytics experience for our customers.

Check it out!

Visit our website(https://perceptinsight.com) to know more about PI.

Try out our platform at https://demo.perceptinsight.com.

Reach us at sales@perceptinsight.com for more information or free demo.

PerceptInsight — Intelligent Insights Platform | Release was originally published in engineering-udaan on Medium, where people are continuing the conversation by highlighting and responding to this story.

Securing udaan with HashiCorp Vault

Sai Sharan Tangeda — Wed, 17 May 2023 17:29:29 GMT

At udaan, we had always relied on kubernetes secrets for storing application secrets which were accessible by our Infrastructure Gatekeepers. But as the use-cases grew along with engineers it became more and more challenging to continue with this process.

We started observing scaling challenges that affected productivity and maintainability of the systems, such as

Handling cloud secrets expiry
Manual rotation of secrets
Granting database accesses to engineers
Local secrets sharing
Updation of secrets for all micro-services

These shortcomings prompted us to explore and leverage the HashiCorp Vault. Vault’s documentation (https://developer.hashicorp.com/vault/docs) are very descriptive and detailed enough to help anyone get started. Therefore in this blog we will avoid the standard Vault setup and focus on how we operationalised Vault at scale in udaan along with practices around securing Vault

1. Authentication & Authorization

For authentication and access control we classified different ways of accessing Vault into following categories

a. User involved login

b. Containerised Services & Jobs

1.1. User Involved Login

Vault supports many authentication methods such as OIDC, GitHub, LDAP, tokens etc. We used OIDC backed by Azure AD to enable authentication and authorization.

1.1.1. Access Control

Once Azure Service Principal was integrated, we were easily able to integrate Azure User Groups into Vault and configure it as External Group. To further simply the process of creation and integration of an external group we wrote a shell script that uses Vault’s CLI to auto-generate the External Group in Vault.

1.2. Containerised Services & Jobs

Vault provides multiple ways to integrate with Kubernetes, Containerised Systems, Databricks Job or traditional VM based deployment.

We wanted to adopt a single environment agnostic solution, which led us to AppRoles authentication method. Through this, we are now able to completely isolate secret access for micro-services from the type of compute they use.

1.2.1. Vault for Kotlin & Python

Once Auth Method was locked-in, we wanted the integration into our application framework to be very straightforward for our engineering teams. Hence we had built our custom client SDKs for Vault which could,

a. Cache secrets based on use-case to avoid the repeated API calls. Unless real-time secret update is needed, this is recommended for all of the use-case.

b. Seamlessly fit into different environments like Local, Dev, Stage & Production without any code change.

1.2.2. Secret Engines Setup:

For isolating secrets and permissions across environments, we created four secret engines namely kv-local, kv-dev, kv-stage, kv-prod. Based on individual organizational needs, this can change.

Once this was done, next step was isolating permissions and providing space for plethora of services hosted within udaan. For this we wrote another shell script as a part of our Vault’s IAC that automates creation of secrets and corresponding AppRoles for every new micro-service that is built.

https://medium.com/media/6d46b448200d763028c2d78044686dff/href

Based on individual organisation’s policy we can configure the Azure Group can access any secret in all four of the secret engines. For example, if kv-prod may contain sensitive secrets which can be shared only on-demand whereas other environments can be configured in more lenient fashion.

We also had setup a common secret in each of these engines, which contains keys that are used by multiple services, this could be the keys to your private maven repository or PyPI server etc.

Now that secret engines are setup, its’ just a matter of having a client that will allow the dev team to seamlessly integrate Vault into their applications while being infrastructure and environment agnostic.

1.2.3. Vault Clients:

To achieve this, we built our own Vault clients that abstract away the process for development team. Below shown is a code snippet that depicts the way we created environment agnostic Vault Client object.

https://medium.com/media/a8dedc73c70f8147e049ffe8865fe875/href

In local environment, when VAULT_APP_ROLE_TOKEN is not available, the user vault token kicks in automatically allowing dev team to run applications in test/local environment without any configuration or code change.

2. Credentials Management & Password Sharing

To avoid password sharing between employees over different channels, we moved responsibilities of all secrets creation/sharing to Vault.

2.1. Postgres DB Management:

We leverage Vault’s Database Secrets Engine capability to manage our Postgres Databases.

Vault ensures dynamic secret creation/revocation while providing us the capability to manage TTLs and rotation periods.

Example of Dynamic Role

For Dev Databases, we allow more lenient 90 days secret rotation period where as production databases have read-only and read-write access profiles with 1-2 days secret rotation.

Even with secret rotation, it is important and recommended to setup firewall rules on your databases to allow connections through trusted networks only

This process has allowed us to maintain audit of all database accesses and helped us move away from the insecure method of password sharing over threads.

2.2. Service Principals & Azure Credentials:

Creation of Azure Service Principals and following up on their timely secret rotation is a huge hassle, even a single miss could potentially cause a Service Disruption across the systems. Hence we now leverage Vault’s Azure Secret Engine, that has allowed our applications to lease permissions for accessing Azure Resources, while managing key rotation and dynamic access revoke at micro-service level.

We configured each service/job to have it’s own AppRole which is configurable with our IAC script. And we manage the permissions needed by this AppRole via Vault. Once this setup is done, as shown below Vault seamlessly manages and leases credentials for accessing azure

Image Credit: https://developer.hashicorp.com/vault/tutorials/secrets-management/azure-secrets

3. Reliability & Security of Vault

Now that Vault has streamlined and is managing all our operations, the most important problem that’s left to solve would be securing Vault to avoid abuse or vulnerabilities.

3.1. High Availability & Storage

We configured our Vault in High Availability mode (https://developer.hashicorp.com/vault/docs/concepts/ha) with enough redundant deployments across geo-regions to ensure availability even during a major network or datacenter failures.

Unseal keys which are required to unlock or initialise Vault instance should be distributed strategically as well. And these should be safely kept with the gatekeepers, so that Vault can be re-unsealed in case of disaster.

3.2. Firewall & Network Security

In this digital era, one can never be too safe, and to ensure this we have taken the extra important step to ensure Vault is only accessible within our Private & Trusted Networks.

This guarantees the additional security to prevent unauthorised access even if the Unseal keys or tokens are leaked.

4. Conclusion

There are several ways to adopt HashiCorp vault, based on organisational needs the adoption strategies could change and what we have shared above is one such way among them. As Vault evolves and our use-cases evolve we will continue to improve and further optimise our processes.

Feel free to reach out to us for any questions or to discuss about the HashiCorp Vault implementation in further detail.

Securing udaan with HashiCorp Vault was originally published in engineering-udaan on Medium, where people are continuing the conversation by highlighting and responding to this story.

What is agility in tech ? (Doing 10 deployments an hour @udaan )

Kaushik Mukherjee — Tue, 16 May 2023 03:06:44 GMT

What is agility in tech ? (Doing 10 deployments an hour @udaan )

Prologue

I am often stumped when confronted with the classic “what’s agility like in tech?” type of question. While I am fully aware of the level of agility and the rigorous control over metrics at udaan, as well as at some of my previous stints leading engineering and product orgs, the answer to such a question is almost always never straightforward. Perhaps because the question itself is ambiguous.
Agility in what ? Solving a complex problem, adapting to a dynamic environment or to produce a blueprint? Or, is it about agility in coding, in design, in build times, or in release cycles? What about quality? Can it all be extrapolated to a lucid number?

Startups sometimes consciously choose to compromise on certain software engineering principles for the sake of speed. But, it is equally important to remember that stacking up debt can only result in bankruptcy over time.
Then there is the entire gamut of product debt — abandoned code from failed experiments that continues to live in the system. This mounting debris can cause problems with readability, maintenance and performance if left unaddressed.
The question hence is how do you balance agility and yet ensure sanity.
BTW this blog isn’t about tech or product debt, although they do influence agility. This blog is mostly about what needs to be done to continue to remain agile, despite incurring debts all the time. While making ~ 10 deployments an hour.
But first, let us chat first principles and talk a little bit about SDLC.

What is SDLC?

Imagine how building a house, requires a set series of stages: site preparation, floor slab, framing etc. Similarly in software engineering there are 7 stages to building out an end product. Now imagine building out tens of hundreds of those end products that are all interconnected and talk to each other. This requires intricate orchestration of the stages and replication of the same in a seamless manner. A smart SDLC mechanism enables just that.

Stages of building a house

SDLC or Software Development Life Cycle consists of a set of tools and processes that assists in producing software with the highest quality and lowest cost in the shortest time possible. SDLC provides a well-structured flow of phases that help an organisation to quickly produce high-quality software that is well-tested and ready for production use.

So what are those phases?

Plan — in this phase, thorough research is conducted on the product. Then, depending on the span of the product, cross deliberation is required across sub orgs.

Pros and cons of the current processes, software, methods get identified. An outcome of this would be a Software Requirement Specification (SRS).

Design /Code / Build — Once the SRS is completed, design considerations (as applicable) come next including how the design will cater to the requirements.

From here on its about coding and building. The implementation phase.

Test — Once the product is developed, the software development life cycle testing phase follows. Traditionally, this was done by a QA team but more evolved organisations leverage various tool sets that allow a developer to do things end to end. More on this in later sections.

Release — The tested product is rolled out into a different environment. Let’s call it staging or pre-prod. Now it’s time to see if this product, when exposed to this different environment, functions as expected. A/B tests are also done to validate hypothesis.

Deploy — Once all the errors are removed, the product is rolled out to the market.

Monitor — After deployment, there is an observation phase wherein the market reacts to the product. Sometimes A/B tests are conducted here as well. Based on the feedback received, improvement analysis is conducted.

Operate — The software is now achieving what it was targeted for, and is responding to the feedback cycle. “Does the software version need an upgrade? Are new features needed? Should the interface be simpler and more intuitive?” And so on.
Every team then replicates these steps using some methodology. It could be waterfall, agile etc. Methodology is out of scope for this blog.

Why was doing SDLC the right way important for udaan?

e-commerce in itself is a complex domain. Hundreds of micro services across thousands of API end points need to intricately orchestrate across multiple subsystems for the platform to function 24x7 365 days.
Now imagine building such systems to power the fastest to unicorn company in India!

Photo by Ronnel Ramos on Unsplash

It is very easy for chaos to reign in if building systems out the right way (SDLC) is not thought of at the inception stage.
At udaan, as systems continued to increase in complexity, as well as volume, it became important to ensure there was a methodical approach where:

Productivity could be measured conclusively
Capabilities were built that allowed to scale without compromising on quality
It deepened the understanding of the overall quality of the digital artefacts.

To put things a little bit more into perspective, while the number of services at udaan grew by 41% Y/Y, and so did our commits, our Speed to Stability (S2S) ratio saw an improvement of 1% .

Speed to Stability ratio (S2S) is quite simply ( # production deployments — # system degradations and outages) / # production deployments

This metric, as the name suggests, provided a sense of the true velocity. For instance, while number of deployments could be looked at separately, an increase in number of deployments but a decrease in stability was suboptimal and vice versa.
Now, improvement in S2S would be possible in all probability if the following areas in software development were actually improving

A predictable CI/CD pipeline
Clearer contracts between systems
Intelligent tools that continuously reduced the likelihood of a broken build / deployment
Observability, alerting and monitoring that enabled a faster MTTI (mean time to identify)
And smart processes and tools that allowed a faster MTTR (mean time to resolve)

SDLC at udaan

Before I go into the details on how we went about implementing SDLC at udaan I want to spend sometime discussing the landscape.

SaaS solutions are amongst the fastest growing segments in the software industry. While a majority of this pie is held by Cloud offerings like Infrastructure (GCP, AWS etc) the SDLC tools and Dev-Ops are showing significant growth as well.

Image credit : Better Cloud https://www.bettercloud.com/wp-content/uploads/sites/3/2017/05/2017stateofthesaaspoweredworkplace-report-1.pdf

This an indicator of how organisations are adapting to some of the latest Dev-ops and SDLC tools to remain competitive. According to Gartner , the SaaS industry will continue to grow in 2022 and beyond. Business Research Company predicts SAAS to grow from 270B USD 2022 → 435B USD 2025.

Companies worldwide have started leveraging SaaS in a huge way. Of course the adaption has been accelerated significantly by the pandemic. The Dev-Ops and SDLC tools adaption from organisations providing the same as a SaaS service has shown a higher adaption especially by small and medium scale organisations.

So why is this upward trend on using tool sets for SDLC emerging?

The stages of SDLC as explained at the start of the blog need to be carried out in order to ensure the right outcomes are being engineered. Traditionally organisations have invested in people as gatekeepers for each of the stages. As systems and complexities grow it requires more and more people to be deployed in order to maintain the sanity of the overall outcomes.

This has two major problems.

The quality of each stage now becomes dependent on the people and is therefore susceptible to human errors.
As complexities grow they start impacting execution speed.
All of this has a compounding effect and leads to significant negative impact on velocity.
What is required hence, is to bring in systemic interventions in all of these stages and build a platform for engineers that enables them to focus on what they do best i.e. write awesome code and design systems that develop with emerging requirements. This removes distractions and automates the mundane, repetitive tasks as far as possible, so engineers can focus on their craft and have a great time doing it.

Achieving upwards of 10 deployments / hour

At udaan various frameworks / and tools were interspersed into the entire SDLC lifecycle.

Here is a peek into all the tools that were either built or customised internally (except for a few which are paid)

Each of these tools above are important cogs in the SDLC wheel.
They together continue to aid in increasing the S2S at udaan. For eg, the SQL dashboard has a tool called as snorql that will monitor and diagnose sql related problems. It helps write durable queries that scale with time. The engineer does not need to go back and fix things as requirements evolve because snorql apart from implementing best practices also provides recommendations on how to optimise a given query.
Or for e.g., the Netra tool alerts and explains possible system degradation and which business metric might be impacted as a result of that degradation , as well as which services amongst the hundreds of services that might gone awry. This helps in reducing MTTR as well as stopping faulty deployments.
Or for that matter, the binary compatibility detector ensures binaries are backward compatible. If not, it will fail the build thus saving time and the heartache of an inevitable production disaster. There are blogs on what they do on the udaan engineering blog posts.

Guiding principles

Create a measurable plan for each step of the SDLC cycle
Incorporate incidental human learnings into systems, making them smarter
Have a genuine interest in enhancing engineer experience

Creating a measurable plan :
Each track in SDLC was deliberated and we almost forced ourselves to put metrics on the vectors like so:

Tracks in SDLC

This results in honest conversations on current states and measurable movements thereof.

Learnings from past mistakes:
This aspect is quite fundamental to udaan and I highly recommend this regardless of what stage of the maturity curve an organisation is at. Being the pioneers in e-B2B commerce, udaan did not necessarily always have a playbook to follow but often times had to create one. Thus, mistakes would have been made and that was completely ok, but it was extremely important to learn from them and innovate.
Not only is udaan diligent about RCAs, the RCAs at udaan are guiltless. Focus is on the incident and not the individual. There is analysis around “how did this issue escape the current guardrails?” and “what can we do to avoid this in the future ?“
This brings valuable insights that go into creating some of the most effective tools and processes in house. Most action items from the RCAs at udaan make their way into one of the tracks of SDLC described above. This means continuous systemic improvements that avoids similar mistakes without the need to rely heavily on processes.

Caring about engineering experience:

udaan encourages curiosity, questioning everything and critical thinking

Fellow engineers regularly talk to each other specifically around technology, advancements thereof, our tech stack , and do not hesitate to provide feedback in case they feel something is not working, regardless of who they are talking to (or at least that is what I would like to believe) . Any problem that looks like a recurring pattern is a fair candidate to be systemised. Also, the culture of surveys and feedback are passionately nurtured. Every tool that gets developed inhouse goes through both qualitative as well as quantitative feedback to measure its effectiveness. Sometimes, the feedback can be brutal, but it leads to honest conversations around what is working, what is not, which then further leads to creating frameworks that promotes efficiency, creativity and productivity.
Also, there are no artificial walls. Any engineer can contribute to the SDLC tech stack. Engineers see a problem, there is a good chance they will build a tool to solve it, and it will go right into the SDLC stack for others to use. There are several special interest groups inside the organisation — for eg special interest groups on SQL, redis, cosmos to name a few. Many cool tools have emerged out of these groups.

SDLC stack maturity at udaan

What’s next?

Tech at udaan is foss first when it comes to consumption.
Likewise, I feel that the tech team may have produced useful artefacts that can be leveraged by the larger community and so we have started open sourcing some of what we produced. snorql is one such eg., which has already started seeing contributions coming in from the community. Several other tools are in the works towards being open sourced.

That said, there are still unexplored areas that lie ahead to transform the landscape, towards building intuitive software applications. Especially so, when it comes to replicating real production behaviour, complete with its fair share of jitters, network partitions and varied handheld device responses based on their configurations. Last but not least, there is the largely uncharted, hallowed ground of self healing mechanisms that we will soon explore.

What is agility in tech ? (Doing 10 deployments an hour @udaan ) was originally published in engineering-udaan on Medium, where people are continuing the conversation by highlighting and responding to this story.

Index redundancy analysis on steroids

Prakul Jain — Sun, 07 May 2023 17:07:09 GMT

Index Redundancy Analysis On Steroids

Revolutionising the Process of Identifying Redundant Indexes in SQL Server Databases

redundant index
index that becomes useless and unnecessary due to the presence of another index or set of indexes.

Have you ever heard of the saying “too much of a good thing can be bad”? Well, that applies to SQL Server indexes as well. I mean, sure, you can create as many non-clustered indexes as you want, but why would you? It’s like trying to impress your crush by wearing ten layers of cologne — it’s not gonna work and it just makes things worse.

Index overload: Finding the real culprit behind your sluggish SQL database

Double Trouble

The Perils of Duplicate Indexes in SQL Server

In SQL Server, it’s possible to create duplicate indexes on the same object. While everything from the index keys to the properties can be identical, this practice comes with several drawbacks.

Duplicate indexes take up extra storage space. And who likes to waste precious storage space? Not us! By performing an Index Redundancy Analysis (fancy words for analysing for redundant indexes), we were able to identify redundant indexes and save almost 10% of storage space. That’s like finding a hidden 400GB treasure chest in your 4TB database!
Redundant indexes can slow down DML statements (inserting, updating, deleting data). Imagine having to update multiple copies of the same index every time you make a change. Talk about a time-waster! But fear not, deleting redundant indexes can improve performance and make your database feel like the Incredible Hulk. Plus, you’ll save money on the CPU cores you can reduce.
Rebuilding or reorganising indexes can be a real pain with duplicate indexes. By getting rid of redundant indexes, you can speed up this process and get back to more important things (like binge-watching your favourite show).
Having multiple indexes on the same object can make it more difficult for the query optimiser to do its job. And nobody wants a cranky optimiser! By simplifying your indexes, you can make the optimiser’s life easier and improve query performance.

3 Shades of Redundancy

The Index Dilemma in SQL Server

Indexes can be considered redundant even if they are not identical. In an index, columns are divided into two categories: Index Columns and Include Columns.

Don’t Play Jenga with Your Indexes: Why Order Matters!

In SQL Server, The order of index columns is critical. SQL Server can use an index for a range scan only if the leftmost column is specified, and then only if the next leftmost column is specified, and so on. It’s like following a recipe — you can’t skip steps or add ingredients in the wrong order.

On the other hand, the order of non-key columns (include columns) in the index doesn’t matter at all. It’s like building a sandwich — you can put the lettuce before the tomato or vice versa, and it won’t affect the taste or texture of the sandwich. Including non-key columns in your index can significantly improve query performance because the query optimizer can locate all the column values within the index, resulting in fewer disk I/O operations.

Classification of redundant indexes

There are three types of redundant indexes:

Duplicate Indexes: Two indexes have the exact same key columns in the same order (i.e. identical indexes) with the same include columns. This is because the order of include columns does not matter.
Eg:
CREATE INDEX idx1 ON MyTable (ColumnA, ColumnB) INCLUDE (ColumnC, ColumnD);
CREATE INDEX idx2 ON MyTable (ColumnA, ColumnB) INCLUDE (ColumnD, ColumnC);
Both indexes have the same key columns “ColumnA” and “ColumnB” in the same order and include the same non-key columns “ColumnC” and “ColumnD”, making them duplicate indexes.

Overlapping Indexes: One index has key columns that form a left ordered subset of the key columns of another index and has non-key column subset of non-key columns of another index are overlapping indexes. The key columns in the overlapping index must be left-ordered, meaning that they are listed in order of decreasing importance, with the most important column appearing first. This is because SQL Server can use an index for a range scan only if the leftmost column is specified, and then only if the next leftmost column is specified, and so on.
Eg:
CREATE INDEX idx1 ON MyTable (ColumnA, ColumnB) INCLUDE (ColumnX, ColumnY);
CREATE INDEX idx2 ON MyTable (ColumnA, ColumnB, ColumnD) INCLUDE (ColumnX, ColumnY, ColumnZ);
In this example, index “idx1” has key columns “ColumnA” and “ColumnB”, which is a left subset of the key columns “ColumnA”, “ColumnB” in index “idx2”. Therefore, “idx2” overlaps “idx1” and these two indexes are overlapping indexes.

Similar Indexes: Indexes that have identical key columns in the same order, but different included columns. To solve for similar indexes, one index should be updated to contain include columns of both redundant indexes.
Eg:
CREATE INDEX idx1 ON MyTable (ColumnA, ColumnB) INCLUDE (ColumnX, ColumnY);
CREATE INDEX idx2 ON MyTable (ColumnA, ColumnB) INCLUDE (ColumnP, ColumnQ);
Solution Index:
CREATE INDEX idx1 ON MyTable (ColumnA, ColumnB) INCLUDE (ColumnX, ColumnY, ColumnP, ColumnQ);
In this example, index “idx2” has key columns “ColumnA” and “ColumnB”, which is identical to “idx1”. The Include columns are however different, and the resulting index must contain include columns as distinct union of both index include columns.

Uncovering the Hidden Treasure of Redundant Indexes

The Hunt for Efficiency in SQL Server

As we delved deeper into the mysterious world of SQL Server databases, we discovered a strange phenomenon — unused indexes! These indexes were like the ghosts of databases past, haunting our storage and gobbling up our precious computing power with unnecessary index updates.

We noticed some serious issues with our big databases — our CPUs and Data IO were frequently spiking, which is never a good sign for a healthy database. The culprits were mainly pesky queries, but in a few cases, the queries seemed completely innocent and the indexes looked fine. After diving deeper, we discovered that the real problem was multiple unused indexes on primary columns. These redundant indexes were causing an excessive amount of updates and creating multiple query plans, thus wreaking havoc on our database performance.

So, we began our quest to delete these redundant indexes, one by one. But alas, the process of deleting redundant indexes turned out to be a real snooze-fest. As a database administrator, hunting down redundant indexes is like playing a real-life game of Where’s Waldo. You have to sift through a sea of SQL scripts, execute a script that lists all indexes and their indexed and included columns, and then comb through each individual index to identify the redundant indexes. The irony is not lost on us — we’re trying to boost database efficiency by doing something that’s incredibly inefficient. And just when you think you’ve identified the redundant indexes, you still have to delete them one by one, which can make you second-guess your career choice and dream of trading in your laptop for a beach chair and a piña colada.

And Oh boy, we were in for a treat — turns out our database problems were just getting started! We figured out that getting rid of useless indexes was pretty low on the list of priorities for our dear DBAs. It was like trying to get them to give up their beloved caffeine fix — near impossible! But we knew we had to make the process more efficient if we wanted to have any hope of fixing our database woes.

But fear not, for we remembered our trusty sidekick, snorql! With its help, we created a supercharged optimization metric — the Index Redundancy Metric. Thanks to this handy tool, we were able to streamline the process and get rid of those useless indexes in no time flat. This bad boy made DBAs so fast, they could make a cheetah look like a sloth.

DBAs at udaan using Index Redundancy Analysis Tool on snorql

What is snorql?

diagnosing databases like a boss!

Developed at udaan, snorql is an open-source and completely free-to-use framework aimed at diagnosing, resolving, and optimizing SQL metrics. snorql is pluggable and can be easily implemented with any database and is here to make your life easier. Get started with our easy-to-follow guide on “Getting started with snorql”.

Snorql Index Redundancy Analysis at database table level

But that’s not all — Snorql also identifies unused tables and unused indexes, giving you even more insight into potential space savings. With Snorql’s optimisation metrics, you can confidently make informed decisions about which indexes and tables to remove, reducing clutter and improving organisation in your database.

Snorql Repositories:
1. snorql (framework): https://github.com/udaan-com/snorql
2. snorql-frontend: https://github.com/udaan-com/snorql-frontend

In case you need further convincing, be sure to check out the fantastic article snorql — diagnosing databases like a boss

Snorql Index Redundancy Analysis for a particular table

Indexing Follies

Lesson Learned in the Quest for Database Efficiency

To tackle redundant indexes, we came up with an algorithm to classify our indexes. Sounds simple enough, right? Just compare and contrast, and voila — unused, duplicate, overlapping, and similar indexes will be neatly sorted. But hold your horses, my friend — this was no walk in the park! It took multiple iterations to perfect our metric, and we gained valuable insights along the way.

Inviting the Read-Replica and Geo-Replica databases to the party:
We realised that just because an index exists in the primary database, it doesn’t mean it’s being treated the same way in the Read-Replica and Geo-Replica instances. And to make things even more complicated, an index that’s being used heavily in the Read-Replica or Geo-Replica database might be marked as an unused index in the primary database. So, to get accurate statistics for our index classification algorithm, we had to gather and summarise both the Index Usage and Index Updates across all instances.
Making sure Unique Indexes stay in the party:
So, here’s the deal — unique indexes and UNIQUE constraints are like siblings who enforce uniqueness in the same way. When you create a UNIQUE constraint, SQL Server is like a magician who creates a unique index out of thin air. And because of this sorcery, you can’t just drop unique indexes directly from the database. So, to avoid any mishaps, we had to exclude unique indexes from our classification algorithm.
How not to do UX:
In the first version of our Index Redundancy Analysis, we tried grouping redundant indexes at the index level. It made it hard to understand the relationships between parent and child indexes without looking at the whole table. Plus, the algorithm we developed was as complicated as a game of 3D chess, and we quickly realised this was not the way to go.

Version 1 of Index Redundancy Analysis

4. Identifying Unused Tables
Thanks to index statistics, we were able to determine which tables in our database were going unused. If we found that all indexes in a particular table were going unused, we marked it as “unused” — simple as that!

5. How to Handle Similar Indexes
So, imagine you have two friends who look alike, but one of them has a cool hat and the other has a funky pair of glasses. They’re like similar indexes — duplicates or overlapping for indexed columns, but with different include columns. But here’s the catch — you can’t just swap them out like a pair of shoes without causing some serious performance issues. Deleting them is not the answer either — it’s like breaking up with one of your friends just because they both have the same name! The best approach is to update one of the indexes to have all the include columns from both, and then wave goodbye to the other one. This way, we can save up on similar indexes and keep the peace between our database friends.

Similar Indexes in SQL Server Database

Alright folks, it’s time to roll up our sleeves and get down and dirty! We’ve identified those pesky edge cases, and now it’s time to dive headfirst into the implementation phase.

So, who’s ready to get their hands dirty? Let’s do this thing!

Diving into the Index Redundancy Metric

Its time to focus on the Magician behind this, the Algorithm!

Note: Index Redundancy Analysis is currently available for SQL Server databases, and the implementation is specific to this database system.

Github Issue ✅ #79 New Metric — Index Redundancy Metric
Github PR ⛓ #84 [New metric] Index Redundancy Metric

I have broken down the algorithm into steps:

Get detailed data of all indexes in a database using the below sql query

https://medium.com/media/165f2498f70a54d385de47098f4d5522/href

2. If a Read Replica/Geo Replica exists, then summarise the Index Updates and Index Usage statistics

https://medium.com/media/aa6f970b54962cac502b84cd20af13a7/href

3. Group indexes by table, and iterate on each table, Filter indexes whose name == NULL to filter Heap Indexes, and sort in descending order by the number of indexed columns.
We also maintain a list of indexes that are to be skipped. These include the indexes which are already classified or unique indexes.

4. Identify Unused Indexes. Here, if the usage is below 10 we consider it as an unused index. We have kept this small threshold because it might be possible that an index gets used while running ad-hoc queries.

https://medium.com/media/4817a1c62b8a15cd765615f542ef9d7a/href

5. Identifying Unique Indexes. Unique indexes are created intentionally on columns to maintain uniqueness, hence we skip classifying these in analysis. We do show unique indexes at table level for better analysis and visibility.

https://medium.com/media/54a612519602a99d2c4ec57d4f39cf48/href

6. Post this, we iterate on each index, and analyse it to find redundant indexes

a. Classify Duplicate Indexes:
Classifying duplicate indexes is straight forward. The Indexed columns and Include Columns need to be the same and the indexes columns must be in the same order

https://medium.com/media/9ce9d05ae59cd16d9187b1a3cb26df94/href

b. Classify Overlapping Indexes:
The Child Index Indexed columns should be a Left based Subset of Parent Indexed columns in the same order, and the include columns should be the same.

https://medium.com/media/ec95c33e5afebb5dbede838595734c26/href

c. Classify Similar Indexes:
The indexed columns should be identical, whereas the include columns can be different

https://medium.com/media/f433c555ff3c1c20814e62c13cc71caf/href

What udaan got out of this?

More Savings, Less Stress, and Happy DBAs!

Turns out the Index Redundancy Analysis wasn’t just a treat for the DBAs, but for the whole organization. We’ve got some juicy benefits to brag about:

Cha-Ching! We saved some bucks by downsizing our databases without any performance hiccups. Our peak database compute on average dropped by 8% and storage shrank by 10%. For instance, we slimmed down a 32 vCore database to a 24 vCore one, and bam, we saved 22% on costs! Who needs all those extra cores anyway?

2. Our apps are lightning fast now! Queries are zipping through with lightning speed, thanks to the Index Redundancy Analysis. Better performance means happier users, and happier users mean more cash in our pockets. It’s a win-win!

3. The DBAs are over the moon. We’ve even heard rumours of them smooching their monitors with the Index Redundancy Metric open on the screen. Hey, we’re not judging — if that’s what it takes to make them happy, so be it!

Index redundancy analysis on steroids was originally published in engineering-udaan on Medium, where people are continuing the conversation by highlighting and responding to this story.