Plaid Engineering - Medium

Simplifying backward compatibility with session affinity

S Santichaivekin — Tue, 09 Jun 2026 13:37:19 GMT

By routing each session back to the commit where it started, Plaid removed a layer of versioning and made shipping both safer and faster.

By Santi Santichaivekin

When users connect a financial account with Plaid Link, they move through a session that spans several minutes of multi-step requests and responses. This creates a subtle infrastructure problem: if we slowly roll out new code changes, user interactions will hop between old code and new code. The two versions of the code need to be interoperable. This is the problem of backward compatibility.

The challenge of backward compatibility

In most systems, backward compatibility is just about one system being able to operate with different versions of another system. Session-based systems add an additional layer where they also need to be compatible with their future self. A session can start on one version of the code and continue in an older or newer version during percentage rollouts.

Left: service that interacts with other services must be compatible with different versions of other services. — Right: Service that interacts with different versions of itself in the future must be compatible with itself.

For Plaid Link, the compatibility question is even trickier because we model interactions as paths through directed graphs. Developers have to think about backward compatibility through the lens of graph-state-transition changes and not just back and forth responses. In rare cases, we could break things, and we really wanted to eliminate this class of issues to make Link extremely reliable.

The solution space

There are a few approaches when it comes to solving backward compatibility:

1. Allow incompatibility and recover when sessions break

The simplest option is to accept that some in-flight sessions may break during code rollout. This can be viable when sessions are short, incompatible changes are rare, or users can easily retry. For Link, none of these apply, and users often do not retry when they fail to connect their bank accounts.

Additionally, this approach also creates operational toil: if errors spike during or after a deployment, teams have to investigate whether the issue is a persistent regression or is caused by the backward compatibility during the rollout itself.

This approach is simple and is commonly used, but it does not work for Link.

2. Prevent incompatible changes with validation and tests

Another option is to enforce compatibility by validating and testing that all changes are compatible.

Schema-level validation can catch obvious contract breaks, such as removing a required field or changing a response shape. However, incompatibility can occur through how the code semantically interprets and uses the schema, even without structural changes.

Code-level validation goes further by exercising old and new versions together in integration tests. This is expensive but will consistently catch behavioral drift. AI code review tools are also helpful in catching incompatibility in code. Still, the test matrix grows with every supported version, and product development will require compatibility overhead.

Overall, validation is useful, but it does not eliminate the problem. We still need to ask product developers to reason about compatibility as part of feature work, which slows down development.

3. Introduce schema versioning

In addition to validation and testing, we can introduce explicit versioning. Versioning acts like different swim lanes for backward compatibility. This is a common pattern in the industry, and works especially well when multiple parties need to align and negotiate on a shared standard, such as the TLS protocol version.

This was the solution that we initially adopted when we developed Plaid Link. We started versioning Link graph schemas when we first introduced the Link graphs in 2020. We shipped a few schema versions in each production binary, and production pods negotiated the version all can support.

Still, while versioning makes incompatibility more explicit and easier to reason with, it does not remove the developer burden. Having to create versions and maintain different graph versions confuses product developers and decreases development velocity.

4. Use session affinity

The last option that we ended up choosing for this project was to move compatibility handling into the infrastructure layer. Instead of requiring application code to support multiple workflow versions, we changed network routing so each session is routed to a compatible execution environment. This changed the interface for product developers. With versioning, product developers still have to understand and maintain multiple versions. With session affinity, the problem moves to the infrastructure layer. The code contains exactly one schema version, and infrastructure makes sure in-flight sessions keep going to the right place.

On the left, we have versioning where application code implements features against multiple schemas. On the right, we have session affinity where request goes to the same commit.

There are many possible levels of affinity:

Connection affinity: route a session on the same long-lived connection, e.g. WebSocket.
Pod affinity: route a session back to the same Kubernetes pod, e.g. via Kubernetes Session Affinity.
Commit affinity: route a session back to any pod running the commit where the session started.
Version affinity: route a session back to a known compatible version, even though it might run different code.

Connection and pod affinity is often the preferred option because it provides the additional benefits of allowing systems to store session data and cache in the pods themselves, which often results in significant performance, latency, and cost improvements.

However, they couple product behavior to infrastructure lifecycle. Pods restart and scale up or down. We also have infrastructure migrations that rotate pods and nodes. When sessions depend on a pod staying alive, platform operations become hard to perform.

If we choose connection or pod affinity, we will need to implement user-facing recovery paths for cases where the underlying pod or node disappears mid-session. This would also complicate rollout monitoring: when developers roll out new code, it could be tricky to triage whether issues are coming from pods winding down or from the rollout itself.

If we rely on pod-based session affinity instead of commit-based, users will see more “Something went wrong.”

Commit-level affinity provides the necessary properties without complicating infrastructure operations. A session does not need to go to the same pod to maintain compatibility. By routing each session back to the commit where it started instead of specific pods, pods are able to restart, rotate, and scale normally without killing sessions, while also preserving consistency.

Replacing versioning with session affinity

We chose session affinity with the goal of removing compatibility concerns from the Link development loop. Instead of requiring every workflow change to reason through in-flight rollout states, individual sessions see one coherent version of the system.

When a Link session starts, we associate it with a deployment group based on the current rollout traffic weight. Subsequent requests for that session are routed back to that same deployment group.

Conceptually, this provides commit-level consistency: a session continues on the version of the system it started on. In practice, sessions are routed to compatible deployment groups rather than individual commits. By the time deployment advances, sessions from the previous state have mostly drained.

Routing metadata is captured at session start and propagated through subsequent requests, allowing infrastructure to maintain session consistency throughout the session lifecycle.

After we completed session affinity, we were able to remove all incompatibility validations and versioning code:

We no longer need to split backward incompatible changes into multiple backward compatible deployments.
A Link graph change can ship as one coherent PR, both schema and code together.
We deleted a big part of our codebase: all versioning code and artifacts.
We removed all Link backward-compatibility tests, reducing the company’s total test compute by 30%.
Graph changes reach production 6 times faster after we remove version-negotiation systems.

Supporting rollbacks

Designing rollback behavior was the hardest part of this project.

When a deployment is aborted, in-flight sessions have two choices. They can stay on the version they started on, or they can move back to the previous stable version.

Both choices have tradeoffs. Staying preserves session consistency, but keeps some users on code we no longer trust. Moving back gets users onto safer code, but can switch a session to a different workflow version mid-flight and create the same compatibility issue we were trying to avoid.

At first, we treated this as a user-impact question: how many sessions would we expect to break in each case over a year? Because Link sessions are short and rollbacks are rare, the expected impact was under our error budget and was acceptable in either case.

Eventually, we came to realize that the bigger concern was operator behavior. Rollback should be the safest action during an incident. We did not want engineers to hesitate because a rollback might create a second wave of compatibility issues.

In the end, we chose the rule that made rollback predictable, bounded, and safe under pressure. Session affinity removes compatibility concerns from normal development. The rollback design makes sure the issues do not come back during incidents.

Supporting A/B testing

A/B testing is a cornerstone of Link development. Teams use experiments to validate product changes and understand how users move through Link.

Before removing versioning, developers had two ways to run experiments. They could put one code path in one workflow version, another code path in another version, and control the release percentage between them. Or they could keep one graph and one schema, then branch inside the workflow using feature flags.

After the migration, we remove the first model. There is one deployed graph, and experiment assignment routes users into different branches of the graph.

This did not reduce what we could test. Anything we could model as two graph versions can also be modeled as one graph with two branching paths. In practice, this is often easier to reason about. Monitoring, conversion analysis, and debugging all happen inside one workflow instead of across multiple versions.

Takeaway

The goal of this project was simple: to make a category of compatibility problems disappear from the Link development loop.

Developers no longer need to think through every old-new version pairing during a rollout. Each active session stays in the deployment lane it started in, and new code can move forward without old sessions needing to understand it. As a result, we have fewer versioning processes, fewer backward-compatibility tests, fewer PRs per graph edits, and one fewer thing for Link developers to worry about.

Session affinity is often also called sticky sessions because each user session “sticks” to a compatible deployment. To celebrate the project, we shipped stickers, sticky notes, tape, and super glue to the stickiest contributors across Plaid offices and remote teams.

If problems like this sound interesting, we’re hiring.

Simplifying backward compatibility with session affinity was originally published in Plaid Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

How Ruby hides complexity

Plaid Eng — Fri, 05 Jun 2026 17:01:03 GMT

By the Cognito team

Ruby makes it easy to write concise code. This is a benefit of the language and the ecosystem. Matz focuses on “making programs succinct” and Rails boasts that it lets you build “in a matter of days” what used to take months.

Concise code can have a dark side. Convenient interfaces can tuck away complexity and side effects that might surprise you later. Brevity in software comes at the cost of diligence both from developers and reviewers. It is especially important to understand how your abstractions work and the business rules they implicitly handle.

Moving fast

Imagine you are adding a new feature to your Ruby on Rails web application. This feature breaks down into three small tasks:

Integrate with an internal API which provides information about the current user
Use information about the current user in order add a welcome message to the header of each page
Display a flag alongside the message corresponding to the user’s countryfield

The current user JSON looks like this:

{
  "status": "success",
  "data": {
    "name": {
      "first": "Edmond",
      "last": "O'Connell"
    },
    "address": {
      "street1": "53236 Camilla Light",
      "street2": null,
      "city": "Pierceville",
      "state": "NJ",
      "country": "United States"
    }
  }
}

To integrate with the API you create three simple classes with ActiveModel::Model:

class User
  include ActiveModel::Model

  attr_accessor :address, :name
end

class Name
  include ActiveModel::Model

  attr_accessor :first, :last
end

class Address
  include ActiveModel::Model

  attr_accessor :street1, :street2, :city, :state, :country
end

To extract the user data you use the new #dig method introduced in Ruby 2.3:

User.new(
  name:    Name.new(response.dig('data', 'name')),
  address: Address.new(response.dig('data', 'address')))
)

Finally, you add a current_country view helper method and create a new view partial:

module UserHelper
  def current_country
    return 'Unknown' unless current_user

    current_user.address.country
  end
end


  <% if current_user %>
    Welcome back <%= current_user.name.first %>!
  <% end %>

  
    <%= image_tag("/imgs/flags/#{current_country}.png") %>

Breaking things

A few weeks pass and you find out that some pages rendered the message “Welcome back !” and a broken image in place of the flag. The internal API encountered its own error and returned:

{
  "status": "error",
  "message": "Internal server error"
}

Oddly enough this did not break your code:

response = { 'status' => 'error', 'message' => 'Internal server error' }

name    = response.dig('data', 'name')    # => nil
address = response.dig('data', 'address') # => nil

user = User.new(name: Name.new(name), address: Address.new(address))

user.name            # => #
user.address         # => #
user.name.first      # => nil
user.address.country # => nil

Feeling a bit embarrassed by the bug you reflect on how you could prevent similar issues in the future:

What if the internal API renames the country field to country_code? That would also silently break the view. Can I only avoid these cryptic bugs by being vigilant about every external dependency?

Reflection

The features in Ruby and Rails which let you write concise code can also let you cut corners. Consider our Name class and how the corresponding response data was originally extracted:

class Name
  include ActiveModel::Model

  attr_accessor :first, :last
end

module ResponseHandler
  def self.extract_name(response)
    Name.new(response.dig('data', 'name'))
  end
end

Let’s rewrite Name without ActiveModel or attr_accessor:

class Name
  # Inlined from Active Model source http://git.io/vuECr
  def initialize(params={})
    params.each do |attr, value|
      self.public_send("#{attr}=", value)
    end if params

    super()
  end

  def first
    @first
  end

  def first=(first)
    @first = first
  end

  def last
    @last
  end

  def last=(last)
    @last = last
  end
end

Imagining our code like this is instructive. It seems like three questions are now immediately obvious

Should the initializer invoke setter methods for any key passed to the initializer?
Will Name ever be invoked without arguments?
Are these public setter methods necessary or is Name a value object?

Let’s throw out #dig and instead handle each edge case manually.

module ResponseHandler
  def self.extract_name(response)
    return Name.new(nil) unless response.key?('data')
    return Name.new(nil) if     response['data'].empty?

    Name.new(response['data']['name'])
  end
end

Expanding this method highlights three distinct outcomes which are each important to consider. The original code properly handled a valid user object but overlooked two important edge cases:

1. API error handling when response['data'] is nil

return Name.new(nil) unless response.key?('data')

This happened when the internal API encountered an error. This condition should instead result in our application notifying the end user of an error.

2. Alternate behavior when a user is not returned

return Name.new(nil) if response['data'].empty?

This corresponds to the following JSON:

{
  "status": "success",
  "data": {}
}

This might mean that the current user has not yet logged in. It could also be a buggy response.

Depending on how robust you expect the internal API to be you might want to handle this case independently as well. If this is invalid state then the response handler should raise an error. If it is valid state and you want to handle cases where the user is not logged in then there should be a separate Guest class independent of the User class.

Both of these options are better than implicitly assuming this condition never happens. Once the code embedding your assumption is deployed it is too easy to forget and unknowingly introduce a silent regression in the future.

Conclusions

Ruby certainly makes it easy to write concise code. The question then is how do you reap these benefits without cutting corners accidentally? We have a few practices which help us write better Ruby.

1. Strict and simple dependencies

Active Model’s initializer is permissive and this led to surprising behavior. Consider the benefit of a strict alternative like anima:

# Test cases
valid_arguments  = { first: 'John', last: 'Doe'                  }
missing_argument = { first: 'John'                               }
extra_argument   = { first: 'John', last: 'Doe', nickname: 'Jim' }

# With Active Model
class Name
  include ActiveModel::Model

  attr_accessor :first, :last
end

Name.new(valid_arguments)  # => #
Name.new(missing_argument) # => #
Name.new(extra_argument)   # => NoMethodError: undefined method `nickname=`
Name.new(nil)              # => #
Name.new                   # => #

# With Anima
class Name
  include Anima.new(:first, :last)
end

Name.new(valid_arguments)  # => #
Name.new(missing_argument) # => Anima::Error: Name attributes missing: [:last]
Name.new(extra_argument)   # => Anima::Error: Name attributes missing: [], unknown: [:nickname]
Name.new(nil)              # => NoMethodError: undefined method `keys'
Name.new                   # => ArgumentError: wrong number of arguments (given 0, expected 1)

2. Meticulous code review

An inconspicuous line of code like:

Name.new(response.dig('data', 'name'))

can encode multiple important code paths. With Ruby it is especially important to visualize the equivalent “expanded” code.

3. Static analysis

Tools like reek and rubocop are great for learning how to write better code. Reek might point out a design issue before you notice it. Rubocop now goes way beyond style: the next release will include eight new cops for helping you catch bad performing code.

4. Mutation testing

Mutation testing helps me write better Ruby. It sniffs out dead code, helps me find missing tests, and generally helps me think about the assumptions I’ve made.

Originally published at https://cognitohq.com on January 6, 2016.

How Ruby hides complexity was originally published in Plaid Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

5 Pry features every Ruby developer should know

Plaid Eng — Thu, 04 Jun 2026 17:01:04 GMT

By the Cognito team

Pry is a great tool for Ruby. You have probably used it by setting binding.pry in the middle of your code like so:

From: lib/dry/types/hash/schema.rb @ line 58 Dry::Types::Hash::Schema#try:

    40: def try(hash, &block)
    41:   success = true
    42:   output  = {}
    43:
    44:   begin
    45:     result = try_coerce(hash) do |key, member_result|
    46:       success &&= member_result.success?
    47:       output[key] = member_result.input
    48:
    49:       member_result
    50:     end
    51:   rescue ConstraintError, UnknownKeysError, SchemaError => e
    52:     success = false
    53:     result = e
    54:   end
    55:
    56:   binding.pry
    57:
 => 58:   if success
    59:     success(output)
    60:   else
    61:     failure = failure(output, result)
    62:     block ? yield(failure) : failure
    63:   end
    64: end

> (#)

Pry is much more than a tool for setting a breakpoint though. It is a great tool for exploring code interactively.

Discovering available methods

Pry provides a command called ls that lists methods and variables available in the current scope. In the code snippet above, the ls command would print out the following:

> (#) ls
##methods:
  hash
  inspect

Dry::Equalizer::Methods#methods:
  ==
  eql?

Dry::Types::Options#methods:
  meta
  pristine
  with

Dry::Types::Builder#methods:
  constrained
  constrained_type
  constructor
  default
  enum
  optional
  safe
  |

Dry::Types::Definition#methods:
  ===
  default?
  name
  options
  primitive?
  success
  constrained?
  failure
  optional?
  primitive
  result
  valid?

Dry::Types::Hash#methods:
  permissive
  schema
  strict
  strict_with_defaults
  symbolized
  weak

Dry::Types::Hash::Schema#methods:
  []
  call
  member_types

Dry::Types::Hash::Weak#methods:
  try

instance variables:
  @__args__
  @member_types
  @meta
  @options
  @primitive

locals:
  block
  e
  failure
  hash
  output
  result
  success

This is a breakdown of all the methods available in the current scope, grouped by the class or module that owns that method. It also lists the available instance variables and local variables. This is a very powerful tool for quickly understanding the role and responsibility of the code you are debugging.

The ls command also lets you drill down into different parts of the current scope. We can use ls --locals to view the names of local variables alongside their current values:

> (#) ls -l
result = {
  :name=> #    input=nil
    error=#      @success=false,
      @id=nil,
      @serializer=#>>}
hash = {:name=>nil}
output = {:name=>nil}
success = false
block = nil
e = nil
failure = nil

Learning without documentation

Pry makes it easy to search for methods under a namespace. For example, if we wanted to find methods for handling xpaths with Nokogiri, we can use find-method:

> find-method xpath Nokogiri

Nokogiri::CSS.xpath_for
Nokogiri::CSS::Node
Nokogiri::CSS::Node#to_xpath
Nokogiri::CSS::Parser
Nokogiri::CSS::Parser#xpath_for
Nokogiri::XML::Document
Nokogiri::XML::Document#implied_xpath_contexts
Nokogiri::XML::Node
Nokogiri::XML::Node#implied_xpath_contexts
Nokogiri::XML::NodeSet
Nokogiri::XML::NodeSet#xpath
Nokogiri::XML::NodeSet#implied_xpath_contexts
Nokogiri::XML::Searchable
Nokogiri::XML::Searchable#xpath
Nokogiri::XML::Searchable#at_xpath
Nokogiri::XML::Searchable#xpath_query_from_css_rule

We learn some interesting features from this list:

We can convert CSS selectors into XPaths
We can search XML documents with #xpath and #xpath_at

If we want to learn more about how to precisely use one of these methods we can use the stat command:

> stat Nokogiri::CSS.xpath_for
Method Information:
--
Name: xpath_for
Alias: None.
Owner: #
Visibility: public
Type: Bound
Arity: -2
Method Signature: xpath_for(selector, options=?)
Source Location: /dev/gems/ruby/2.4.1/gems/nokogiri-1.7.2/lib/nokogiri/css.rb:22

If we wanted to learn how the method works, we can use show-source:

> show-source Nokogiri::CSS.xpath_for

From: /dev/gems/ruby/2.4.1/gems/nokogiri-1.7.2/lib/nokogiri/css.rb @ line 22:
Owner: #
Visibility: public
Number of lines: 3

def xpath_for(selector, options={})
  Parser.new(options[:ns] || {}).xpath_for selector, options
end

These handful of commands are a great daily resource for debugging and exploring new gems. Give it a try!

Originally published at https://cognitohq.com on May 20, 2017.

5 Pry features every Ruby developer should know was originally published in Plaid Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

How we secure service-to-service communication at Plaid

Brandon Wang — Tue, 02 Jun 2026 15:36:33 GMT

By Jianing Yu and Brandon Wang

Plaid runs hundreds of microservices that communicate continuously with one another. Historically, internal traffic relied on implicit trust within our network: any service running inside our VPC was considered trusted by default. As Plaid scaled and embraced a Zero Trust security model, especially the tenet “never trust, always verify,” we recognized the need to formalize service-to-service authentication and authorization.

We started the design and implementation in Q4 2022 and completed the rollout to all gRPC traffic in Q1 2024. Since then, the system has remained highly stable in production. In this blog, we will walk through how we built and safely deployed service-to-service authentication and authorization at scale, and how we built guardrails that helped teams adopt explicit authorization without slowing down development.

Design principles

Before designing the solution, we established a set of guiding principles:

No new runtime dependencies: Avoid additional runtime calls that could impact latency, availability, or overall system reliability.
Support all types of services: Work consistently across Kubernetes services, CI/CD systems, Spark jobs, and Airflow jobs.
Support all major languages: Work seamlessly across the primary languages used at Plaid — Go, Python, and Node.js.
Simple adoption: Require minimal changes from individual service teams, as even small integration costs can scale into significant engineering overhead.
High reliability: Meet a very high reliability bar to avoid widespread incidents, since the system would sit in the critical path of nearly every internal request.

Design choices

mTLS vs token

Mutual TLS (mTLS) is a common approach for service-to-service authentication and is widely used in service mesh solutions such as Istio and Linkerd. We evaluated mTLS as a potential solution. At the time, limitations in Plaid’s infrastructure made adopting mTLS-based approach impractical. As a result, we chose a token-based model for service identity, which allowed us to achieve strong authentication guarantees while better aligning with our existing infrastructure and operational constraints.

Kubernetes service account tokens vs AWS signed GetCallerIdentity

Because our workloads — including Kubernetes service, CI/CD systems, and offline jobs — run on AWS, we initially considered using IAM roles as the service identity and having services prove their identity using a signed GetCallerIdentity request. However, AWS didn’t provide a concrete rate limit for the GetCallerIdentity API and described its limit as dynamic. For a global authentication bootstrap mechanism, especially during deployments where thousands of Kubernetes pods may start simultaneously, this lack of predictability was unacceptable. We could not risk authentication becoming throttled during critical rollout events.

In the end, we leveraged Kubernetes service account tokens to assert the identity of workloads running in Kubernetes, while non-Kubernetes environments continued to rely on signed GetCallerIdentity requests.

Architecture

We built Plaid Security Token Service (Plaid STS), a token exchange service that performs identity federation. At a high level, it validates the credentials issued by identity providers such as Kubernetes or AWS and issues a Plaid-native service identity token that can be uniformly trusted across our infrastructure.

End-to-end flow

1. Service startup: When a service starts, it presents a trusted credential to Plaid STS:

For Kubernetes services, this is a Kubernetes service account token.
For non-Kubernetes services, this is a signed AWS GetCallerIdentity request.

Plaid STS verifies the credential by calling the Kubernetes API server (for service account tokens) or AWS STS (for signed GetCallerIdentity requests). Once verification succeeds, Plaid STS issues a signed identity token that asserts the service’s identity.

2. Before gRPC requests: When making a gRPC request, the service attaches a Plaid STS issued identity token and a locally signed, short-lived token. The short-lived token is scoped to prevent token replay attacks.

3. Authentication: When a service receives a gRPC request, the gRPC server middleware verifies the Plaid STS issued token and then verifies the caller’s signed token.

4. Authorization: After authentication succeeds, the service evaluates authorization policies to determine whether the caller is allowed to invoke the requested method.

Alignment with design principles

No new runtime dependencies: Services only call Plaid STS during startup to obtain an identity token and periodically refresh it in the background. Request-time signing and verification are local cryptographic operations in gRPC middleware, with no network calls on the request path.
Support all types of services: Kubernetes and AWS-based workloads can federate identities into a Plaid STS-issued identity and share a consistent authentication model across Plaid’s infrastructure.
Support for all major languages: We implemented middleware support in Go, Python, and Node, ensuring consistent authentication and authorization behavior across the ecosystem.
Simple adoption: gRPC services at Plaid use a shared gRPC library that includes client and server middleware. Kubernetes services also share a common bootstrap process. By integrating identity bootstrap, token signing, and verification into these shared components, we enabled service-to-service authentication without requiring individual service teams to write custom integration code.
High reliability: Plaid STS is deployed in every Kubernetes cluster to avoid creating a single global failure domain. We rolled it out slowly and gradually, cluster by cluster. Since Plaid STS is used when a service starts or refreshes its long-lived identity token, a transient Plaid STS outage does not immediately impact running services.

Authorization policies

When designing our authorization policy language, our goal was to keep it simple and easy to reason about. The policy consists of a set of policy-wide flags and a list of allowRules. Each allowRule specifies which client services are permitted to call specific gRPC methods, along with a descriptive rule name for clarity. We also support wildcard rules, allowing a service to access all gRPC methods when needed for simplicity. Here’s an example of authorization policy:

authorizationPolicy:
  allowRules:
  - name: Authorized Access From Service A
    routes:
    - /service.Service/GrpcMethod1
    - /service.Service/GrpcMethod2
    principals:
    - k8s-service-a-production
  - name: Authorized Access From Service B
    routes:
    - /service.Service/*
    principals:
    - k8s-service-b-productionGuardrails

Guardrails

Service-to-service authorization fundamentally changed how engineers build at Plaid. gRPC calls are no longer implicitly allowed. Any new call that isn’t explicitly permitted by authorization policy will be denied by default.

During the initial rollout, engineering teams raised concerns that this could easily introduce outages if the authorization policies were not properly tested in lower environments. To mitigate these risks, we introduced multiple layers of guardrails embedded throughout the development lifecycle. These guardrails are designed to catch authorization issues as early as possible.

Pull request linters

We built several linters to proactively catch issues, including:

Detecting newly added gRPC methods in the proto file and prompting engineers to define corresponding authorization rules.
Validating that authorization policy changes are correctly formatted and reference valid gRPC methods.
Flagging risks when a single pull request includes both client and server changes for a new gRPC method — if the client change is deployed before the server’s authorization policy is active, it may trigger authorization failures.

Unit test validation

Many engineers requested the ability to test service-to-service authorization in unit tests, since unit tests typically have much higher coverage than integration tests and are the most effective way to catch authorization issues early. But unit tests often mock gRPC calls, which means the authorization middleware doesn’t always run. To address this, we extended our existing mocking framework so that mocked gRPC clients automatically evaluate the corresponding authorization policy on each mocked gRPC call.

Developer environment validation

At Plaid, the developer environment runs and tests code locally and as part of CI. It provides a consistent environment for running services, interacting with them, and executing integration tests that depend on those services. In this environment, services don’t run in Kubernetes, so they don’t have access to Kubernetes service account tokens, and Plaid STS is not involved. Instead, we provision a lightweight, local service identity token for each service and use it to exercise the same authentication and authorization logic in gRPC middleware. This allows us to validate authorization policies without depending on production infrastructure.

Automated deployment safeguards

Finally, we integrated authorization metrics into our zero-touch deployment process, enabling automatic rollbacks when a new deployment causes an elevated rate of authorization errors. This gives us a final layer of protection if an issue reaches production.

Conclusions

With Plaid STS, we introduced a unified identity model that federates infrastructure credentials into Plaid-native identities, enabling internal gRPC requests to be authenticated and authorized by default. Along the way, we learned that implementing a security feature is only a small part of the effort. Educating engineers, collecting feedback, and addressing concerns were critical to rolling it out safely at scale. Service-to-service authentication and authorization marked an important milestone toward a Zero Trust architecture at Plaid, and the system has proven to be highly reliable over the past two years.

If this kind of work interests you, check out our open roles.

How we secure service-to-service communication at Plaid was originally published in Plaid Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

ConfigDB: from chaos to confidence with our unified app config stack

Plaid Eng — Mon, 01 Jun 2026 17:01:01 GMT

By Tim Ruffles

TL;DR we built a unified config system, watch the demo

Configuring applications is a problem that sneaks up on you. For example, Plaid’s config for connecting to financial institutions started simple. We stored the config as a blob of JSON in Git and deployed it alongside our services. But over time the number of services consuming it grew to the point where deploying them all to propagate changes proved too slow and error-prone. The edit flow became unworkable as the number of people editing it increased, and the config model grew more complex.

The system we’re discussing stores the config for how we display and communicate with financial institutions.

So next we migrated it to a database-backed service. This meant updates propagated at runtime, without deployments, and we could build an edit UI. This approach grew with us from 2017 to date, but over time we observed two pain-points, neither of which ever got to the top of our priority stack. First: a trickle of misconfiguration incidents — and a scary editing experience — because we had nothing like Git’s history and atomic reverts. Second: a frustrating development experience when extending config. Despite its origins as a simple CRUD service, it has become intimidatingly complex. To quote two of the engineers who worked on changing the configuration model:

Developing in the service was stressful because the slightest misconfiguration or bad migration could result in downtime, with no easy way to revert. Years ago, a teammate and I spent a few long months rearchitecting the data models. We’re quite pleased to be doing anything else now.

David Fish, Engineer

The code — weathered with years of changing invariants and business requirements — was impossible to understand.

Joanne Lee, Engineer

If this had been our only config system this situation may not have warranted investment, but we had other systems for other config datasets with similar issues. All considered, we were doing a lot of work across the company to build and maintain multiple systems, and none of them were our desired combination of highly reliable, productive, and safe to use.

So we decided to replace them with a database specifically designed for application configuration: ConfigDB. The ‘DB’ name makes it sound like a huge project; thankfully I’m not here to tell you we built a competitor to Postgres! Instead we composed technologies we already trusted at Plaid — Git, GitHub, protobuf and S3 — into a system that met our application configuration needs across languages.

Watch the demo

https://videos.ctfassets.net/zucqsg1ttqqy/7yCAFTxp9gQl6JeFRz9Cuf/7e89029b6e430650db715d95a5d4233b/ConfigDB_Go_demo_-_Eng_Chat_2023_Feb.mp4

This demo walks you through creating a new configuration type, authoring some data, validating it, and then exposing it over a gRPC service. Below we’ll walk through the system — feel free to watch and read in whichever order you prefer.

Constraints & Desired Features

Our system had to meet these constraints to be acceptable as a platform for all our use-cases:

Availability — must guarantee availability of config in the critical path of our apps
Can handle our read load (our financial institution config has 12,000 reads per second on the client-side)
Runtime propagation of updates within ~1 minute
Handle datasets on the order of ~100MBs
Programmatic edits, for services and to enable UIs for non-engineers
Full history, with easy reverts

Beyond that, ideally it would have features we felt we’d benefit from, but had been able to operate without:

Semantic data validation (e.g. mins, maxes, uniqueness, string patterns)
High productivity — it should be possible to add new configuration types, or add fields to existing types, solely by editing the schema
Typed and structured data model: structs, maps and lists, rather than just scalars.
Review tooling to preview, discuss and approve/reject changes
Configurable rollouts — ability to slowly roll out changes to config values

Comparing the various options we had in place in the table below, you can see that none of the existing configuration stacks did a great job against the majority of our requirements. That’s why we decided to build ConfigDB. The big idea was to again make GitHub our config system of record, giving us a full history and a review workflow ‘for free’. But now we’d add systems to provide runtime update propagation, and programmatic edits.

Developer Experience

ConfigDB data is organized into tables with schemas. We define a table’s schema by writing a .proto file:

// config/movies.proto

syntax = "proto3";
package config;
option go_package = "github.plaid.com/plaid/go.git/lib/proto/configpb";

import "configdb/configdb.proto";
import "google/protobuf/duration.proto";

message Movie {
 option (configdb.table) = {
   name: "movies",
   primary_key: "slug",
 };
 string slug = 1;
 string title = 2;
 // data is denormalized in configdb
 repeated Character characters = 3;
  google.protobuf.Duration runtime = 4;
}

message Character {
 string character = 1;
 string actor = 2;
}

Data is authored in YAML, which we parse and map into the protobuf schema (with support for nicer syntax for well-known types like durations and wrappers):

# by default, filename is the primary key, so: movies/big-lebowski.yml
slug: big-lebowski
title: The Big Lebowski

# well-known types like durations and dates have syntax support
runtime: 1h57m

character:
 - character: The Dude
   actor: Jeff Bridges
 - character: Maude Lebowski
   actor: Julianne Moore

This data is stored in a Git repository hosted on Github Enterprise.

We access it via each language’s ConfigDB library. Here’s what that looks like in Go:

package yourgrpcserver

import (
   "context"

   "github.plaid.com/plaid/go.git/lib/configdb"
   "github.plaid.com/plaid/go.git/lib/proto/configpb"
)

func (s yourserver) getMovie(ctx context.Context, id string) (*MovieStats, error) {
   tx := s.cdb.Tx()

   movie, err := configpb.GetMovie(ctx, tx, id)
   if err != nil {
       return nil, err
   }

   stats := MovieStats{}
   stats.CharacterCount = len(movie.GetCharacters())
   return &stats, nil
}

type yourserver struct {
   cdb configdb.DB
}

The read API is transactional. This ensures that when a runtime update to the configuration arrives we don’t read rows from different versions and end up forming responses based on an inconsistent version of the dataset. The read is from memory, so the only error possible is a missing row.

You may have noticed that we have specific typed getters for each table. This gives us type-safe and ergonomic access to datasets, supporting things like composite primary keys. This is implemented via code-generation from the protobuf schema. Other supported languages — Python and TypeScript — get away with less code-gen as they’re dynamic, or have more expressive type-systems.

Architecture

The YAML data is stored in Git repo hosted on GitHub Enterprise. As new commits pass validation and are merged to the main branch of our config repo, the configpush service pulls them down, converts them to protobuf, and pushes them into S3:

The application services only rely on S3 to get access to configuration. The current version is determined by an object in S3, and all data is read from there. Neither Github nor any Plaid service needs to be up to allow readers to pull config — S3 is the sole read dependency. This is important as operations like upgrading GitHub Enterprise can make it unavailable for 30–60 minutes.

Application nodes poll S3 to pull down new versions and load the rows into memory. Most of the logic lived in a Go binary that we ran across languages to reduce duplication. Updates do not block queries: there is no networking in the query path, it’s a simple read from memory.

Per container architecture: application reads from library which reads from memory. Library reads updates from configpull binary, which reads from S3.

It’s also worth noting here how the design is shaped by the much looser constraints our configuration datasets have when compared against our application datasets:

They’re smaller: ~100 MB at most, easily able to fit into memory and very fast to pull down within AWS
They change much less frequently — at most a few writes a minute
Eventual consistency is far less of a problem — it’s not vital nodes agree (this also enabled us to run database-backed configuration systems with heavy client-side caching in the past)

An important principle in engineering: jump at opportunities to ‘cheat’, and solve an easier problem!

To summarize the important architectural attributes:

S3 is the only dependency for services to get access to an initial version of the data
Once a service has the initial version of the dataset, it never loses access to it. S3 can go down and the services continue to operate
Data is read from memory — there is no network request to fail or to impose unexpected latency

Validation & Approvals

We programmatically validate writes to either the schema or data before merging. To support this without redeploying the Go service that performs the validation, we read the schema protobuf dynamically using the jhump/protoreflect package. This schema is transformed into JSON Schema, an implementation we chose so we can ensure validation behavior on the front and backend matches.

The syntax supports declarative validation rules, such as regular expression patterns, uniqueness, or mins and maxes. For instance, let’s say we wanted to ensure the ‘slug’ field above was a good fit for URLs using a pattern constraint:

string slug = 1 [
  (configdb.column) = {
    pattern: {value: "^[a-z]+(-[a-z]+)*$"}
  }
];

Storing the data in a Github repo allows us to use the same Github CODEOWNERS workflow we use for any other source code to enforce blocking reviews to configuration where necessary.

Why Protobuf?

We’re big users of protobuf and gRPC at Plaid. Anywhere we expose configuration data over an API we do it over gRPC. This meant our previous config systems required a large amount of code simply to take data from the database/disk representation and map it into protobuf generated types. So instead, it made sense for us to propagate the data as protobuf in ConfigDB. In cases where the data would be passed through RPCs, no conversion would be required, and we were already happy with our protobuf tooling for providing type-safe access to data across languages.

The forwards/backwards guarantees of protobuf are also ideal for our data-transfer protocol for propagating changed configuration values to application containers. Applications that had an older schema would have no issues loading data from a newer one — unknown fields are simply ignored. As elsewhere, our buf linting rules made sure that only safe changes could be made to the schema: not making an incompatible change in the type of a field, for instance.

Finally, investing more heavily in protobuf had benefited from and bolstered network effects: we benefited from our engineers’ existing knowledge of the protobuf model, and existing code-generation and linting tools.

Roads Not Taken

An alternative design for this system could use MySQL rather than Git as the system of record. Config could be stored as protobuf blobs in the DB and propagate via S3 in the same way. This would have provided a faster write path (with a better SLA) and a more familiar programming model. On the flip side, we’d lose the clarity of having config files for authoring and referencing, forcing us to rely on the UI even for local development. We’d also have to reimplement the features we wanted from Git’s history model and GitHub’s UI and review workflow.

Where We’re At

We’re using ConfigDB in production for several services at Plaid and are happy with the results. For instance, having the ability to change our API rate-limits at runtime has already proved powerful in mitigating load spikes or urgent requests for more capacity. Operating it has been a lot more straightforward than our existing MySQL-based stack. Even if we take down GitHub or the configpush service, application services continue to have access to their configuration and run unaffected.

Our next investment in ConfigDB will be a generic edit user-interface. This will use the JSON Schema generated from the schema along with to make CRUD edit UIs come “for free”, allowing for easy customisation in cases where more control is required.

Acknowledgments

Thanks to the rest of the team: Andrew Chen, Andrew Yang, Wil Fisher, John Kim, Ioana Radu and Mike Rowland! Beyond that, thanks to the product teams who highlighted this was a pain-point — Roemer Vlasveld and Aditya Sule in particular; engineers who shared experience with similar systems; the many reviewers of the spec; and our beta-testers.

We’re also grateful to other companies that shared their experience designing Git-backed config systems, e.g. Facebook’s Configurator. While we didn’t end up with a design that resembled them particularly closely, reading about them was invaluable and gave us confidence in the approach.

Originally published at https://plaid.com on June 6, 2023.

ConfigDB: from chaos to confidence with our unified app config stack was originally published in Plaid Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Remote at Plaid

Plaid Eng — Fri, 29 May 2026 14:01:01 GMT

Engineering manager April Goldman shares Plaid Engineering’s approach to remote work during COVID-19

By April Goldman

COVID has forced us to reimagine what “work” means to us. I’ve been working as a fully remote engineer for over five years and want to offer my perspective on what I’ve seen work well. At its heart, remote work isn’t that different from being colocated: after all, we’re all people. Like my teammates who are usually colocated (but are remote for now), I drink too much coffee, wrangle git, and have a love/hate relationship with Jira workflow configurations. But, most importantly, we’re all passionate about growing, teaching and learning from each other, and building impactful tools for our users and customers

Creating Our Remote Office

Why did we hire remote engineers in the first place?

We saw several opportunities:

Retention: talented team members who had to move away from San Francisco for personal reasons, and we wanted to keep them as Plaids.
Culture: we celebrate an autonomous and impact-focused culture and saw that as a natural fit for remote work.
Diversity: there’s amazing talent everywhere — and we wanted to hire amazing engineers all over the world to help us build world-class products.

When we decided to embrace remote engineering teams, we wanted to be intentional in creating a culture that would set our remote team members up for success. Since our remote engineers tackle the same kinds of problems as our colocated teammates, one way to help them feel engaged as a fully distributed “office” was to establish its own culture, management team, design, product, and support partners, and the power to shape its own ways of working. To that end, we created core surface areas per office/remote team so that they are all invested in the same north star. For our remote team, that surface area is the consumer experience, which focuses on building products like Plaid Link SDKs on web, iOS, and Android, and our consumer portal. We focus on everything that enhances the end-user experience with Plaid-integrated apps.

Because each of our offices is responsible for core aspects of our business, our teams work together to overcome friction across geolocations: this can manifest as more async collaboration, or scheduling meetings mindful of timezones, or remembering co-workers across offices when sourcing technical input. We didn’t want to create a remote island at Plaid, we wanted to create a cohesive remote culture made up of people who support each other, enjoy working together in a different kind of way, and rely on and are relied on by the rest of our engineering team.

About two years into our experience with remote work, we hope sharing our approach will be helpful to others who are starting their own remote work journey. In light of the global impact COVID-19 has had, with many teams transitioning to remote work for the first time, we hope this can be a useful resource to our community.

Creating Our Remote Values

When it comes to describing our remote culture, we’ve identified five pillars that have helped us grow.

Take Advantage of the Good
There are challenges to remote work (we’ll get into that), but there are so many upsides, and we want our remote team to fully enjoy them.

Keep flexible schedules — Our remote team sets our own hours, which means we are online for the hours that work best for us. Parents can start the day early and leave in time to pick up kids from school. Night owls can shift their work to the wee hours. Personally, I love to go on a family trail run around 3pm and catch up on work in the evening. We encourage our team to enjoy flexibility and create the work life balance that enables them to thrive. Each team designs a team schedule for efficient synchronous and asynchronous work, with a set of committed overlap hours to be online that the team revisits when a new member joins.
Make relocations low-friction. When a member of our team needs to move, say for their partner’s work, or to be closer to family, or simply to start a new adventure, we support them on the next leg of their journey. Our remote team has work options across the US, Canada, and Europe, and our managers and People team are here to help manage logistics around relocation.
Bond over the wfh stuff. I know that my teammate Andy’s cat likes to go outside during our 9am Wednesday meeting, and my teammate Jan is growing a tiny avocado tree in his kitchen. I know who has a Peloton! I know that all of our Designers have beautiful home offices, which is definitely not a coincidence. Remote bonding can go deep, and that’s part of the fun of getting a peek into each other’s homes, and lives, each day.
Celebrate together. We share and celebrate the things that make our work arrangement special. Slow cooking a brisket all day? Getting into homemade bread? Fostering a kitten? We use Slack channels and lightning talks to encourage everyone to share; we think wfh fun is awesome.
Highlight our diverse pool of people. We know that there are many people who can’t move to a tech hub city, and our remote team enables recruiting to reach out to great talent across broader geographies. We believe our diversity makes us stronger.

That’s some of the fun stuff, but there are also real challenges to remote work, and our culture includes the ways we support each other through them.

Tend to Relationships
We don’t see each other nearly as often. Nowhere. Nearly. As often. We spend a lot of time looking at our computer screens, and, if we’re lucky, maybe one of our cats. It can be lonely, so we build in ways to nurture our relationships with each other. And this is true to our core principles at Plaid, it’s extremely important that we grow, together and to do that, we want to foster a culture of communication and feedback so that we’re always learning and growing together.

We have daily morning talk time, which runs 30 minutes, longer than the usual stand up, when we talk through decisions and share updates.
We start almost all our meetings with team “vibrations,” checking in on how everyone is doing before getting down to work.
We have virtual events: online game nights, get to know you questions, virtual lunches, and holiday gift exchanges via the actual mail, to name a few.
We use peer 1–1s to build relationships within teams.
Our in-person off-sites, which we hold twice per year, are one of our favorite ways to bond as a team. We spend five days developing the next evolution of Plaid’s consumer experience, playing board games, and cooking food together. We look forward to resuming our off-sites when gathering together becomes safe again.

There’s no one way to build relationships, and we’re always trying new ways to get to be humans together.

Choose the Right Tools
We can’t grab a whiteboard and talk things out, and we can’t lean over someone’s desk to see what they are pointing at on their screen. We can’t ask someone over string cheese in the micro-kitchen whether they are planning to pick up that bug ticket. There are a lot of great tools out there for remote work, and they help us fill in the communication gaps.

The most important tool of all is just remembering the remote truism: You can’t over-communicate. But please try. That said, we have all kinds of tools we use to help us try to over-communicate.
Tuple is our favorite for remote pair programming.
We prefer the Standuply Slack integration for async stand-ups; handy across timezones.
Even things as simple as leaving daily updates in our Jira comments go a long way for cross-time zone communication.
Another favorite? The phone. We love taking meetings while we go on walks together.

Any tool that makes communication easier and more pleasant, we’ll experiment with in our team toolbox.

Practice Self Care
The worst version of wfh is when life and work blend together. We look at our clock and it’s 7pm, then 8pm, and we’re still working. Our meetings are back to back, and hours have passed since we last stood up. Blood is pooling in our legs. What day is it? This is not what we want for our remote team.

We encourage our team to put bounds on working hours and block off on the calendar the start and end of their days — and their lunches.
We model as leaders and celebrate when people take breaks.
We are serious about our unlimited PTO and we remind people to take time off.
Company-wide meetings are scheduled to becross-time zone friendly, and we use AMA forms and recordings so that remote team members can contribute to and experience company-wide events without stretching their work hours.
We make sure team members have time zone neighbors so no one has to stay online late to get help.

We expect our team members to be with us for the long haul, and that means creating sustainable practices and having open conversations about how we can continuously improve.

[For Managers] Commit to Growth
A common question we hear from candidates is what the upper limit is for growth on our distributed team. There’s no upper limit to the contributions you can make at Plaid Engineering, no matter where you work.

We have executive buy-in and support all the way from our CEO Zach, our head of engineering Jean-Denis, to all the engineering managers who know our remote team members are crucial to the success of our business.
We know our remote engineers need opportunities to grow just like their in-office peers. Our managers advocate for the avenues by which remote team members can contribute to company-wide work. They connect their people to opportunities and sponsor them for projects.
Some of Plaid’s most senior technical contributors are on our distributed team today, modeling for all remote engineers what leadership and impact can look like as their careers grow.
We hold a twice annual, in-person off-site for senior technical contributors across offices to gather for discussion and direction setting. Because our technical leadership isn’t geographically centralized, this gathering creates a technical vision inclusive of all our offices and product domains. This is another tradition we look forward to resuming when travel and meeting up become safe again.

We bring the Plaid growth spirit to our distributed team and want to see all technologies, regardless of which office they are in, shaping Plaid’s future.

What’s Next?

First, we hope this is a helpful resource to anyone supporting their team through remote work due to COVID-19. We know that going remote is a journey, it takes practice, and defining your own remote culture is a task that’s never really done; you’ll always be learning and improving. Plaid’s remote team is growing, and we’re excited for the new team members who will help shape our remote culture’s next iteration. Apply for one of our open remote roles on our Careers page.

Originally published at https://plaid.com on July 23, 2020.

Remote at Plaid was originally published in Plaid Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

How Plaid reconciles pending and posted transactions

Plaid Eng — Thu, 28 May 2026 17:01:01 GMT

By Kevin Hu

Plaid’s API helps developers provide financial services to tens of millions of consumers across North America. These services help consumers manage their personal finances, let them transfer funds and make payments, and allow them to access loans and mortgages. Our mission is to improve people’s lives by delivering access to the financial system.

We work toward this mission not only by helping consumers to access their financial data, but also by improving the quality of that data. Enriching data with machine learning is one of the objectives of our Data Science and Infrastructure team. In this post, we’ll talk about one of the ML models our team built, as well as the stepping stones it took to get here.

The pending-posted problem

One way Plaid adds value to transactions data is by identifying when pending transactions from a consumer’s account become posted. A transaction is pending when it is being processed by the bank. While it is pending, the amount is deducted from the account owner’s available funds but not from the account balance. Once the transaction settles, the transaction goes from “pending” to “posted”. A posted transaction is mostly finalized, and the monetary amount has been withdrawn from the account.

When Plaid takes snapshots of accounts, we receive a list of transactions with descriptions, monetary amounts, and whether the transaction is pending or posted. While we know if transactions are pending versus posted, banks often do not tell us which pending transactions from a previous snapshot correspond to the new posted transactions from the current snapshot. This matching is critical to clients. If they send notifications to consumers with each new transaction, it’s important that they don’t duplicate those notifications.

Unfortunately, it’s often not obvious which posted transactions map to which previously pending transactions. A common difficult matching problem is restaurant bills. When a consumer’s credit card is charged for their bill at a restaurant, the restaurant initiates a pending transaction. It does not include service charges and tips. Once the restaurant’s receipts are batched (often at the end of the business day), they finalize the transactions by adding on the gratuity to the transactions. This is when the transactions become posted.

There are other situations in which corresponding pending and posted transactions may look different. Hotels often leave higher pending charges as holds on the account for incidental fees. Bars create single-dollar pending transactions for open tabs, which settle to the actual bill amounts once the transactions post. Merchants, payment processors, and financial institutions each may change the transaction descriptions.

Our high-level approach to this problem is to build a model to predict the likelihood, or match score, that a given pending and posted transaction represent the same underlying transaction. If a pending transaction disappears from one account snapshot to the next, we match it with the “most likely” posted transaction that appeared on the new snapshot. We greedily continue matching while match scores are above a certain threshold.

The crux of the problem is choosing a model for determining this match score.

Trees

To solve this problem, we initially thought of rules that would tell us how likely a given pending and posted transaction are to match. Here’s a visual representation of example rules to match pending and posted transactions initiated by restaurants:

This rule-based approach is called a decision tree, which segments the space of independent variables, like information about the transactions, and attempts to find the regions of this space likely to correspond to matching transactions.

While the decision tree in the above visualization outputs boolean predictions, the decision trees usually used in more powerful machine learning, including in our models, output likelihood predictions instead.

Algorithms exist for training decision trees, but in practice, standalone trees are rarely used. This is because they tend to learn the noise behind the training data instead of the underlying relationships within the data. For example, suppose our training dataset included many different transactions whose descriptions were simply the names of ridesharing services. A decision tree might erroneously learn that descriptions don’t matter, since so many pairs of non-matching transactions would have similar descriptions.

This issue is called overfitting.

Overfitting

Excess model complexity results in overfitting as it allows the model to contort to the training data. Overfitting is known as “high variance” because overfit models are strongly dependent on training data, and small changes in input will result in large changes in predictions. On the other hand, insufficient variables and insufficient model complexity results in underfitting, in which the model is too inflexible to find meaningful relationships within the training data. Underfitting is known as “high bias” because underfit models have significant systematic prediction errors, or bias.

A fundamental challenge in data science is the bias-variance tradeoff. Carelessly increasing model complexity leads to higher variance and lower bias. If our models optimize purely on bias measurements such as accuracy on the training set, they will tend to overfit.

Bagging

To solve the pending-to-posted matching problem without overfitting, our first model augmented the concept of decision trees using bagging and feature sampling. Let’s first discuss bagging, which refers to bootstrap aggregat ing.

“Bootstrapping” is the process of training models on random samples of the training data. By limiting the amount of data used in the training process, bootstrapping combats overfitting by providing different noise profiles during training.

“Aggregating” is the process of combining many different bootstrapped models. With bootstrapped trees, the aggregation process typically lets the trees “vote” by computing the average of the likelihoods predicted by the trees. Since the training subsets are randomly sampled, the decision trees still fit the dataset on average, but the voting gives a more robust prediction.

Combining bootstrapping and aggregating results in bagging.

The bagged model reduces variance more if the component models are uncorrelated. However, only bootstrapping on different samples of training data often results in trees that have highly correlated predictions, because the most informative branching rules are often similar across sampled training data. For example, because transaction descriptions are a strong indicator of whether or not a pending and posted transaction match, most of our trees will rely heavily on this indicator. In this case, bagging has a limited ability to reduce the variance of our overall model.

This is where feature sampling comes in.

Random Forests

To reduce the trees’ correlation, our model also randomly sampled features in addition to randomly sampling training data, resulting in a random forest. A staple in data scientists’ toolkits, random forests are powerful predictors with low overfitting risk, high performance, and high ease of use.

This was the model that Plaid used for several years to match pending and posted transactions. Over time, this method proved to be effective, but not excellent: we noticed a high false negative rate when we evaluated the model against human-labeled data. We needed to improve the model so it would more reliably find matches.

When Random Forests Fail

Random Forests, and bagging in general, are susceptible to underfitting imbalanced datasets. Our random forest models for pending-to-posted matching suffered from this problem. Since each pending transaction has at most one posted transaction in a training set, most candidate pairs of pending and posted transactions are not a match. This meant our training sets had an imbalance in which a large majority of the data was “not matching”; as a result, our random forest model erred on the side of predicting lower probabilities of matching, resulting in a high false negative rate.

Boosting

To solve this problem, we used boosting. Boosting restricts decision trees to simple forms — for example, trees that aren’t very deep — in order to reduce the bias of the overall model. The boosting algorithm iteratively explores the training data, adding the restricted trees that maximally improve the aggregate model. As with bagging, the trees vote to come up with a final decision.

This process eventually learned that improving performance on the minority case — pairs of transactions that do match — would maximize model improvement. The algorithm dove deep into identifying what conditions predict that case. With well-tuned hyperparameters, we finally saw a major improvement in our false negative rate.

Another advantage of boosting was the ability to flexibly define the “model improvement” metric during training. By assigning asymmetric penalties to false positives and false negatives, we trained a model more aligned with how those model errors asymmetrically impact consumers.

Results

Our new boosting model lowered our false negative rate by 96% compared to the random forest model, ultimately providing higher quality transactions data to our clients and consumers. In addition to internal metric improvements, we also saw a significant reduction in support tickets filed by our customers about pending-to-posted transaction matching.

It is essential to understand how the characteristics of machine learning model archetypes lead to different strengths and weaknesses. While our new model has made significant improvements in the quality of data we provide, it has tradeoffs of its own. Boosting is sensitive to the model improvement metric and to other hyperparameters that restrict how simple the trees must be. In this case, the improved consumer experience was well worth the careful training procedure and meticulous tuning.

There’s much more we have yet to explore. For example, which boosting algorithm is best, given that we work with a large number of categorical variables? Given that we process transactions multiple times daily for one in four Americans with bank accounts, how do we ensure our matching algorithm is fast enough to keep up? Are deep neural networks a worthwhile investment for this problem, given the difficulty in interpreting and explaining the reasoning behind their output?

If you want to help us answer these challenging questions and many others, or if you’re interested in learning more about how we use data science to empower financial services, e-mail me at kevin@plaid.com or check out plaid.com/careers.

Originally published at https://plaid.com on May 31, 2019.

How Plaid reconciles pending and posted transactions was originally published in Plaid Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Universal Transaction Categorization: How Plaid unified four ML systems into one

Wen Yao — Wed, 27 May 2026 15:19:22 GMT

By Wen Yao, Melody Zhao, Kevin Supakkul, Christine Zhou, Han Yu, Nick Sundin, Ozgur Seckin, Raghu Chetlapalli

One model change. No product code touched. Accuracy up 13–23% across every downstream product simultaneously.

That’s the payoff of UXC (Universal Transaction Categorization), Plaid’s unified categorization system. Here’s how we built it, and why unification turned out to be as much an organizational win as a technical one.

The problem: Four systems doing the same job

Plaid processes data associated with hundreds of millions of financial transactions daily. A raw description like ctlpquality inn debit hold needs to become something meaningful (in this case, a vending machine purchase at Quality Inn) before it can power a consumer’s budgeting apps, credit decisions, or fraud detection.

To serve the distinct needs of different consumers and customers, Plaid developed four specialized categorization systems:

Personal Finance Categories (PFC): 16 primary / 104 detailed categories, designed for budgeting and financial wellness apps.
Credit Categories (CC): 25 primary / 95 detailed categories, tailored for credit underwriting.
Income categories: 13 income-related categories supporting Credit’s Income Insights.
V1 categories (legacy): 600+ categories powering legacy products, never retrained

Each maintained its own ML model, rule engine, labeling pipeline, and monitoring infrastructure. This made sense as each product evolved. But as Plaid’s transaction intelligence matured, we saw a bigger opportunity: improvements to the underlying categorization model that could benefit every product at once. That led us to UXC.

Designing the UXC taxonomy

A unified system requires a unified label space, but it also has to work for product teams that already have their own taxonomic languages. Our solution was the shim layer: a deterministic, product-owned mapping from UXC labels to each downstream taxonomy. This meant product teams could keep speaking their own language while the underlying model was upgraded without their involvement.

With that architecture in place, we established four criteria for every UXC category:

Unambiguous definition. Each category must have a clear, singular meaning. For example, a Starbucks transaction is FOOD_AND_DRINK_COFFEE, not FOOD_AND_DRINK_FAST_FOOD or FOOD_AND_DRINK_RESTAURANT. The boundary is explicit, accompanied by a clear description and representative examples. This matters because ambiguous categories produce inconsistent labels, which corrupt training data.
Maximum granularity. UXC operates at a finer level of detail than any downstream taxonomy, enabling lightweight many-to-one mappings. LOAN_DISBURSEMENTS preserves seven subcategories at the UXC level (LOAN_DISBURSEMENTS_AUTO, LOAN_DISBURSEMENTS_CASH_ADVANCES, LOAN_DISBURSEMENTS_EWA, LOAN_DISBURSEMENTS_MORTGAGE, LOAN_DISBURSEMENTS_PERSONAL, LOAN_DISBURSEMENTS_STUDENT, LOAN_DISBURSEMENTS_OTHER) and in the PFC taxonomy, these collapse into a single TRANSFER_IN_CASH_ADVANCES_AND_LOANS category.
MECE (Mutually Exclusive, Collectively Exhaustive). Every transaction maps to exactly one detailed category. If no meaningful category applies, it falls into an explicit “Other” bucket.
Backward compatible by design. Each downstream product implements a shim layer: a deterministic mapping from UXC labels to its own taxonomy. This is where product-specific logic lives.

The resulting UXC taxonomy contains ~130 detailed categories, sourced from the union of existing taxonomies where PFC and CC were already 80%+ overlapping.

Bootstrapping labels: AI annotation at scale

Building a new taxonomy creates an immediate cold-start problem: where do the training labels come from? With ~130 categories and transaction descriptions that are often cryptic (AMZN MKTP US*AB123, DIR DEP ACME CORP PAYR, POS DEBIT CHKFILA 333222121 NY NY), even expert human labelers struggle with edge cases.

We built an AI annotation pipeline to solve this. The system takes a transaction as input, including a normalized description, posted date, and amount, and assigns a UXC label through two stages.

First, an LLM scans the transaction description and extracts key descriptors such as merchant name, income source, payment type, and general location. For unfamiliar entities, the system performs targeted web searches to gather context. A cryptic description like VIDRINE AUTO PRT gets resolved to an auto parts store through search results, which then informs the categorization.

Second, a label assignment LLM receives the transaction metadata and any enriched context, along with the full UXC taxonomy definitions, and assigns the most appropriate UXC label.

We validated quality through iterative evaluation rounds against a human-labeled holdout set: running annotation, identifying disagreements, analyzing error patterns, and refining prompts until AI and human labels agreed more than 90% of the time.

With that quality bar met, we generated ~1 million labeled transactions. To avoid over-representing common merchants, we used embedding-based stratified sampling: embed a large transaction sample, cluster by semantic similarity, sample proportionally from each cluster, and supplement with high-volume transactions. This balanced head-of-distribution coverage with long-tail diversity.

Model architecture: From BERT to a domain-specific foundation model

With the taxonomy and training data in place, we took a deliberate two-phase approach to the ML model: ship a reliable V1 quickly, then invest in a more powerful V2.

UXC V1: A BERT-based classifier

The first model fine-tuned a BERT encoder on our AI-annotated training data for multi-class classification, following the same approach described in our earlier posts on transaction categorization. It takes transaction descriptions and metadata as input and outputs a probability distribution over UXC labels.

V1 validated the core thesis: the unified taxonomy worked, downstream shims mapped cleanly, and accuracy already improved over the fragmented systems it replaced. It shipped to production within weeks.

UXC V2: Fine-tuning a transaction foundation model with CLERT

With the taxonomy validated, we turned to improving the model’s representations. The V1 BERT base encoder had no prior understanding of the cryptic, domain-specific language of bank transactions. UXC V2 replaces that generic encoder with CLERT (Contrastive Learning-enhanced Encoder Representations of Transactions), a domain-specific foundation model built by Plaid’s Data Foundations and AI team.

How CLERT works

A standard language model treats DIR DEP ACME CORP PAYR as a meaningless string of tokens, because it has never encountered truncated merchant names, abbreviations, and formatting conventions found in bank transaction descriptions. CLERT solves this by learning transaction-specific representations through contrastive learning before being fine-tuned on UXC labels:

Transaction interpretation. An agentic interpretation pipeline translates raw transactions into plain-English explanations. For each transaction, the system generates two correct “positive” interpretations and one plausible but incorrect “hard negative”:

2. Contrastive pretraining. Using a Multilingual-E5-Large encoder as its backbone (chosen for its strong performance on semantic similarity benchmarks and cross-task generalization), CLERT is trained on ~1M transaction-interpretation pairs. The result is a model that maps cryptic transaction strings and their plain-English meanings into a shared embedding space, so that semantically similar transactions end up close together, regardless of how they’re formatted or abbreviated. This helps CLERT learn the language of financial transactions, not just their surface tokens. The figure below illustrates this with “positive” examples as correct interpretations and “hard negatives” as plausible but incorrect.

Overview of the CL training objective

A qualitative view of the learned embedding space is shown in the figure below. Before contrastive learning, the E5 encoder keeps transaction descriptions (blue) and their interpretations (orange) in separate regions, with the orange points forming tight clusters far from the blue cloud. After contrastive learning, the tighter alignment confirms that the encoder has learned to recognize the semantic equivalence between a cryptic transaction string and its plain-English explanation.

3. Fine-tuning for UXC. A single linear classification layer is attached to the pretrained CLERT encoder and fine-tuned on the labeled UXC transactions as mentioned in the previous section. The pretrained representations give the model a massive head start on understanding transaction semantics.

Why a foundation model matters

CLERT’s pretrained representations unlock three practical benefits:

Data-efficient fine-tuning. Because CLERT already understands the language of transactions, downstream tasks require a fraction of the labeled data that a generic encoder would need to reach the same performance.
Fast adaptation to new domains. The same data efficiency makes expansion to new markets practical. Adapting to a different country’s transaction formats only requires a small set of local examples because the model already understands the structure of financial transactions; it just needs to learn local merchants and conventions.
A foundation for multiple tasks. Categorization is just one application. The same pretrained CLERT encoder can be fine-tuned for entirely different tasks like merchant name extraction with minimal additional data. Invest once in pretraining, then adapt cheaply to many downstream problems.

Results

UXC V2 delivers up to 13% higher accuracy on primary categories and 23% higher accuracy on detailed subcategories, with F1 gains of 10–30% on key categories like credit card payments, wages, and loan payments.

The V1-to-V2 upgrade required no changes to shim layers or downstream products. The model swap was fully contained within UXC: one fix, one deployment, universal impact.

Serving a 560M-parameter model at Plaid’s scale required care on the infrastructure side. FP16 quantization via ONNX Runtime produced no accuracy degradation versus FP32 and halved our GPU hosting footprint, keeping cost and latency within budget despite a ~5x parameter increase over v1.

The real payoff: Shared vocabulary and faster iteration

The most immediate benefit of unification is the ability to improve the underlying model without requiring changes from downstream product teams. We proved this with the V1 → V2 upgrade: CLERT replaced the generic BERT encoder, every downstream taxonomy saw accuracy improvements, and no product team changed a line of code. One model upgrade, one deployment, universal impact.

But the technical gains were only part of the story. Aligning taxonomies across PFC, Credit, and Income forced teams to debate definitions, reconcile edge cases, and agree on what transaction labels should actually mean. The result was more than a shared model architecture but also shared vocabulary.

What’s next

Merchant name parsing and normalization. We’re applying the same pretrained CLERT encoder to merchant name normalization, parsing and normalize merchant names from raw transaction descriptions. CLERT’s understanding of transaction structure means this requires significantly less labeled data than a standalone Named Entity Recognition model would.

Sequential foundation models. CLERT understands individual transactions. The next frontier is understanding sequences over time, like regular paychecks followed by rent payments, or rapid fund cycling that may indicate fraud. This behavioral understanding will power the next generation of Plaid’s risk and insights products.

Conclusion

UXC taught us that the hardest part of unifying systems isn’t the model — it’s getting teams to agree on what words mean. In the end, the organizational alignment turned out to be as durable as the technical architecture.

If you’re interested in working on problems like this, we’re hiring!

Universal Transaction Categorization: How Plaid unified four ML systems into one was originally published in Plaid Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

The Plaid internal MCP server

Plaid Eng — Tue, 26 May 2026 17:01:03 GMT

Maximizing the leverage of internal AI applications

By Zach Keller

The Internal MCP server and this blog post were created thanks to the hard work of many individuals at Plaid including: Jainil Ajmera, Allen Chen, Peter David, Evan Fuller, Zach Keller, Seyoung Kim, Charles Shinaver, Nathan Tindall, Roy Xu and many others.

At Plaid, we have supercharged our AI efforts by building a foundational system to give AI applications the best possible context. Built on top of the Model Context Protocol (MCP), developed by Anthropic, this system allows a variety of AI clients to access the data they need from our systems, and provides a consolidated, secure platform for working with AI at Plaid.

Now engineers are building agentic workflows that seamlessly integrate data used by employees in their day-to-day workflows like JIRA, application logs, and internal debugging interfaces. The platform has enabled us to build agents to triage bugs to improve support ticket resolution, look up data schemas to help data scientists write queries faster, and more! Let us show you how we did it.

AI context problems

It’s no secret that the better context you give to AI systems, the better they will perform. If you are trying to debug issues with a production system, out-of-the-box LLMs can only take you so far — they need the specific information from and about your system, like your Prometheus metrics or recent server logs.

Retrieval Augmented Generation (RAG) systems were one of the early attempts to solve this problem, and they are certainly useful. However, mental models of context as document-only, or even document-first, have not been able to keep pace with the problem solving capabilities of the latest LLMs.

A conceptual diagram of a simple RAG system

As the models advanced, and as our understanding of them grew, practitioners have gravitated towards more agentic systems that rely on tool use and other meta-primitives like Prompts and Queries. All of these concepts are represented in the MCP specification.

A conceptual example of MCP integration in a typical MCP client system

Still, all of these ideas — from MCP servers generally to MCP primitives to ordinary RAG systems — are still fundamentally targeting the same underlying idea: delivering the right context to the underlying LLM at the right time. Most of the focus on context availability has been externally oriented; helping users access data that exists in managed services or services outside of the user’s owned systems, such as data from Github, Glean, or JIRA.

What’s missing

We felt this picture was still incomplete. Claude Code and Cursor, AI tools used by over 80% of Plaid engineers, have robust interfaces for connecting MCP servers already, but a few key problems limited their full efficacy in terms of velocity and security:

Managing one-to-many arrangements of MCP servers: There is a lot of variation in stability and quality when every engineer is managing their own arrangement of MCP servers in their local development setup. This introduces additional overhead and setup time, and ultimately limited how many people we saw experimenting with MCP.
Availability of MCP servers with service providers: Any integration with a third-party tool requires a new MCP server to be set up in order to be made accessible to the dev. So, for any individual tool we’d need to hope our vendor offers an MCP server, test it out, provide feedback to the vendor, and then harden the integration. This didn’t allow us to move fast enough — and it wouldn’t enable internal data use cases anyway.
MCP authentication and authorization: Authentication and Authorization with MCP servers is still fairly immature; not all of them support OAuth, and even if they did, our SSO integration or our enterprise self-hosted integration might not be directly supported.
Enabling access to internal data remains a challenge: Third party MCP servers don’t help us access data from our own internal systems.

Of these, the fragmented local setups and internal data access were the most acute blockers. Engineers were spending more time wrestling with server configs than with code. The other issues compounded that drag, creating manual workarounds and interdependencies that further eroded our speed. Altogether, these frictions forced engineers to divert development time into chores, or to abandon AI tools entirely, directly undermining our mandate to accelerate development.

Problem space

When thinking about how to solve this problem, we started by outlining the resources that we had, and the things that constrained us.

We realized Plaid already had existing security infrastructure to scalably manage user-based access to specific production resources and internal tools. For example, our internal permissioning system controls which gRPC methods a given user is allowed to call within our global gRPC debugging UX. Similarly, a constellation of existing services already managed the authorization token generation and verification portion of enabling that access at the user level.

Internally, Plaid runs an identity aware proxy that protects internal tools. A centralized authorization server for employees checks the access that an individual has to production resources. Since Plaid primarily runs gRPC services, we allow service owners to selectively enable employee debugging access by gRPC method. Our identity aware proxy validates that the employee is running a Plaid managed device and has authenticated through our identity provider before allowing the employee through to the gRPC service.

A signed identity token is then validated and parsed to get the employee’s identity. Finally, our centralized authorization server validates whether the employee has access to the gRPC method. We support CLI based authentication with Device Authorization Grant flow with DPoP and short lived bearer tokens. Bearer tokens are signed with a private key pair initially generated and validated during the device flow. The CLI based auth is used in the Internal MCP server to provide auth through a locally running proxy.

On the constraint side, Plaid has a robust LLM data access policy that dictates what sorts of data can go to which kinds of LLM and when. Respecting this data access policy is a core constraint of any LLM system at Plaid, and we needed to ensure granular control over data access.

So, to enable us to move quickly and securely, we needed an approach that let us directly control all of the third party integrations and document context and data from production services. But we don’t want engineers to have to manage all of these connections themselves.

What we did

Given our needs for velocity and security, we implemented one central internal MCP server: a server that would connect engineers and their AI tools to the data they needed.

Our philosophy was to leverage the existing components, along with our LLM data access policy, to safely and securely abstract away all of the overhead that comes with adding context to user-driven internal LLM applications, and to do this in a way that minimizes dev-local tool management.

The diagram below illustrates an example of how the internal MCP server can be implemented for a typical use case.

Internal MCP diagram

This design has a few interesting points:

The internal MCP server is separated from the existing LLM gateway, which can be a separate component that lets internal users create specialized AI agents, but have them share the same tool library. This approach feels somewhat unnatural at first; the internal LLM gateway could alternatively connect to the internal MCP server as an MCP client. However, this approach has some advantages:
The tools that should be allowed for the internal MCP server and for the LLM gateway form an intersecting set, but not a subset in either direction. That is, there are tools that should be available in the internal MCP but not to agents, and vice versa. Enforcing this restriction at the service level would involve difficult to parse logic that is likely to become a footgun.
By pushing the tool definitions out to a library, data usage restrictions can be implemented at the tool level that address the root of where the allowed tool differences stem from.
The existence and importance of an LLM data usage policy — common at many firms — imposes certain constraints that make a centralized system of access control and verification, caller identity inspection, and audit logging more attractive than decentralized alternatives.
This system is built on top of the existing security framework for user-level permissions. While not universal, our view is that this sort of access control framework likely does exist at many companies today, and can be naturally replicated.

At a high level, this consolidated, one to one to many approach maximizes future extensibility. Accessing context sources directly, rather than waiting for plugins to become available, allows fine grained control not only over how those integrations work, but also provides discretion in using availability constraints, for example, with respect to which tools publish plugins.

Another supporting consideration is the LLM data usage policy. This design enables enforcement of controls on these restrictions at the tool level at call-time, ensuring policy compliance. By connecting directly to data sources directly via API and feeding that data straight into our internal LLM interface, we can move at full speed with zero external barriers.

All told, we have integrated more than 20 tools, a half-dozen internal services, our documentation, and more into the internal MCP server. Plaids have made thousands of tool calls and created dozens of agents across the engineering, product, and support spectrum that rely on the internal MCP server.

What’s next

You might have noticed some comments about an internal LLM gateway. The next step for our internal MCP server is to continue to build agent-building components on top of it that allow Plaids to create their own agents right out of the box, using the same tools that are in the internal MCP.

We’ve already built a UX service for creating and interacting with these agents, and expect to continue adding capabilities for agent creation, interaction, and integration into live services — all powered by the data access of the internal MCP server.

As with everything AI, the future is not assured. It is difficult to see how these quickly-evolving systems will develop. However, the need for immediate, task-specific, high quality context seems likely to remain important well into the future. And for that, there’s the internal MCP.

Originally published at https://plaid.com on July 29, 2025.

The Plaid internal MCP server was originally published in Plaid Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

AWS SSO in a DevOps first world

Plaid Eng — Fri, 22 May 2026 17:01:02 GMT

By Ashish Kurmi

At Plaid, we believe in baking in security best practices at every step of the DevOps workflow. We have an automated CI/CD pipeline to manage AWS & Kubernetes resources and the production platform runs on it. This means in many cases, engineers at Plaid do not need to interact directly with AWS resources for daily production management and updates. However, at times, Plaid engineers use their AWS user identities for accessing AWS resources and Kubernetes clusters for development, support, and troubleshooting.

Before we enabled AWS SSO at Plaid, we used another third-party solution to federate our corporate user identity with AWS via Okta’s SAML federation. However, it did not provide good support for temporary CLI/API access as it did not provide an official CLI tool. Additionally, it was blocking us from utilizing some important protection controls for advanced MFA that were functionally incompatible with the older solutions. For these reasons, in 2021, we planned to replace our solution with AWS SSO.

Due to some constraints described below, we took an unconventional approach (which is not uncommon in the industry) compared to a standard AWS SSO deployment. In this post, we’ll talk about how we built end-to-end automated solutions for our DevOps scenarios and our key learnings so far.

An unconventional approach to AWS SSO

In the older solution, each user role was defined as an AWS IAM role. Okta allowed us to map such IAM roles to specific Okta groups. Our initial approach was to convert all such IAM roles to SSO permission sets. However, we quickly realized that this approach would not work for a couple of reasons.

At Plaid, we author the least privileged IAM policy document for a given task. As AWS- managed policies grant broad access, we have multiple reusable snippets of custom IAM policies that allow users to achieve specific goals. For example, we have a custom IAM policy for granting read-only access to our AWS billing data. A typical user role has access to several custom policies. In 2021 H2, AWS SSO did not support customer-managed policies for permission sets. Furthermore, it only allows one custom inline policy with a maximum of 10 KB of policy content. These constraints made it difficult to migrate several of our existing team IAM roles to permission sets as highly restricted policies may include numerous resource restrictions or complex conditionals to ensure the least privileges are granted. AWS recently included support for customer-managed policies in AWS SSO, which alleviates some of these pain points.
For development and troubleshooting purposes, a few special user roles allowed certain teams to select their service IAM roles at login. As SSO creates dedicated IAM roles for user access, it won’t allow these teams to log into their service roles without performing additional manual steps.

Requirements

We have a high bar for optimizing the developer experience at Plaid. Engineers work collaboratively to reduce friction to maintain high development velocity. Wherever reasonable, we create easy-to-use self-service scenarios so engineers can complete engineering tasks and operations without relying on others. We also simplify our tooling wherever possible so that engineers even without the relevant domain knowledge can accomplish their everyday tasks. Because of these reasons, certain solutions that required manual steps or specific IAM knowledge were eliminated from consideration early on.

Solution

We knew the new federation system would need to eventually assume the existing team IAM roles until the constraints mentioned above are mitigated. To enable this scenario, we took the following approach for creating SSO permission sets.

We use Terraform for managing our AWS infrastructure including our AWS SSO deployment. We authored two internal AWS SSO Terraform modules to help us manage our AWS SSO Terraform templates with ease.
For every existing team IAM role, we created a new empty SSO permission set named {Team IAM Role Name}-Proxy. These proxy SSO permission sets don’t grant any privileges themselves. We mapped these proxy SSO permission sets to the relevant Okta groups.

// SSO Proxy role for Plaid-Security-Ops-Team IAM role (../../iam/security_ops_team.tf)
// Do not add any policy to this permission set. This should be an empty permission set.
module "security_ops_proxy_permset" {
  source                          = "git@"
  name                            = "Plaid-Security-Ops-Team-Proxy"
  proxy_destination_iam_role_name = "Plaid-Security-Ops-Team"
}

# Account Assignment
module "securityops_account_assignments" {
  source         = "git@"
  principal_type = "GROUP"
  principal_name = "SecurityOps Team"
  account_assignments = [
    {
      account_id          = local.awsacct_west,
      permission_set_arn  = module.securityops_permset.arn,
      permission_set_name = module.securityops_permset.name
    },
    ...
  ]
}

Once the above Terraform change is deployed, our CI/CD pipeline creates an SSO permission set. In addition, inside all AWS accounts that this permission set is assigned to (e.g., In the awsacct_west account as shown in the screenshot), AWS creates an IAM role that represents the proxy SSO permission set.

We then update the trust policy of the existing team IAM role so it could be assumed by this newly created IAM role. For development and troubleshooting, we also created these proxy roles for a few service IAM roles. Essentially, the sole purpose of these proxy SSO permission sets is to assume the correct team IAM roles. These existing team IAM roles had all the access policies defined on them, and as such, once assumed, result in zero changes to the end user’s permissions when logged in via AWS SSO.

data "aws_iam_policy_document" "plaid_security_ops_team_assume_role_policy" {
  statement {
    actions = ["sts:AssumeRole"]
    principals {
      type = "AWS"
      # This is the IAM role ARN for the Plaid-Security-Team SSO Proxy role defined at ../global/sso/securityops.tf
      identifiers = ["arn:aws:iam::account_id:role/aws-reserved/sso.amazonaws.com/AWSReservedSSO_Plaid-Security-Ops-Team-Proxy_177ef47fe4c34086"]
    }
  }
}

# Security Operations Team IAM Role
resource "aws_iam_role" "plaid_security_ops_team" {
  name                 = "Plaid-Security-Ops-Team"
  max_session_duration = "28800"
  assume_role_policy   = data.aws_iam_policy_document.plaid_security_ops_team_assume_role_policy.json
  tags = {
    Terraform = "Managed by Terraform"
  }
}

To access AWS resources, an AWS user would log into the proxy SSO permission set first. They would then assume the correct team IAM role before performing any operations.

AWS doesn’t have the functionality to automate the last assume role step in the login workflow described above. Asking Plaid engineers to manually perform the assume role operation would have resulted in user friction and dissatisfaction. To complete the entire login workflow automatically, we employed the following strategy.

CLI

The Plaid Infra team offers an internal CLI utility named megabin that allows engineers to perform common infra tasks with ease such as bootstrapping a new backend service or accessing an RDS instance for troubleshooting. Plaid developers were already using megabin to create AWS CLI sessions using the previous solution. We extended itto allow engineers to set up their local AWS CLI environment using the proxy roles. When users set up their AWS CLI environment via megabin, the utility performs the following tasks:

Make sure that AWS Vault is installed and configured for storing and accessing CLI auth tokens in the key chain securely.
Initialize AWS credential and configuration files
Execute the AWS CLI command to walk the user through the process to set up the AWS configuration file.
Once this is done, adjust the configuration file to assume the correct team IAM role if required.

Users only need to complete this workflow once for a given role. When this is done, AWS CLI & SDKs can automatically renew expired sessions by launching a browser renewal workflow.

Web Console

To assume the correct team IAM role when using AWS’s web console, we created an internal Google Chrome extension. The extension is internally published on the Google chrome marketplace

and is installed on all Plaid-owned user machines by default. The extension gets activated for AWS web console URLs. It extracts the account ID, role name, and user name from the page using screen scraping techniques. It then checks if the user is logged in as a proxy SSO permission set. If yes, then it assumes the correct team IAM role. These steps are completed transparently without needing any input from the user.

We have published internal documentation so that users can request IAM changes by submitting PRs for user access.

After migration, we prefer defining user access policies in SSO permission sets itself for new SSO roles instead of creating IAM roles. Once all the constraints have been remediated via AWS SSO service updates, we will migrate all access policies to AWS SSO permission sets in the future.

Lessons Learned

Integrate with existing DevOps tools

Because we have integrated the AWS authentication workflow into megabin, we can deliver a rich developer experience. For example, to perform certain operations in our Kubernetes environment, the user needs to authenticate with AWS first. As the user initiates this activity via megabin, megabin can create an AWS SSO session if required as part of that activity implicitly.

Add troubleshooting and support scenarios in your automation

Our new AWS access model is substantially different from the last model. When we rolled out the new access model, initially we received many user questions that had straightforward troubleshooting and remediation steps. We later extended our AWS SSO tooling to take care of most of these scenarios. For example, we added an option in our chrome extension that allows users to define custom proxy SSO permission set to team IAM role mappings to handle corner case scenarios. We added a reset option in megabin to allow users to start from scratch. All megabin CLI scenarios include robust self-help directly while running to ensure that all requirements are met and solutions for common configuration challenges are suggested.

Have backup options

The way our chrome extension extracts user details and assumes the correct team IAM role is not officially supported by AWS. As a backup, we published detailed documentation for users so they can follow official steps manually if required. Even though our extension simplifies the way Plaid engineers access the AWS management console daily, AWS web console changes can have unintended consequences. We had to go in the firefighting mode a couple of times in the past due to AWS pushing out web console updates that changed the underlying DOM. These documents have been useful for our users when our extension was down. A feature was also added to the chrome extension to support coloring the assumed roles, to enable the user to quickly identify if they were or were not successfully escalated. It also helps them use the recently assumed role list in AWS’s console to pick the correct role quickly if they’ve done it at least once via the extension.

In our experience, building custom tools on top of the AWS CLI and management portal has been largely beneficial due to increased developer velocity and better security. You can consider this approach If you use AWS SSO in your environment and want to build custom user authentication scenarios.

We are hiring for several security roles. Join us!

Originally published at https://plaid.com on July 25, 2022.

AWS SSO in a DevOps first world was originally published in Plaid Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.