Nike Engineering - Medium

Moving Faster With AWS by Creating an Event Stream Database

Adam Farrell — Thu, 06 Jun 2019 21:01:01 GMT

by Adam Farrell

As engineers at Nike, we are constantly being asked to deliver features faster, more reliably and at scale. With these pressures, it’s easy to get overwhelmed by the complexity of modern microservice stacks. Through the functionality provided by AWS-managed services, our team has found ways to ease the burden of development. In the following paragraphs we’ll outline how we learned and leveraged several AWS services to deliver the Foundation Invite Event Stream to Audiences (FIESTA), a service for sending triggered notifications to millions of Nike’s users, under budget and ahead of schedule.

Changing requirements are not new to our team. In fact, we were continuing to evolve our platform when we heard about a feature gap from the business that required us to pivot quickly. A reasonable response to this sudden change in direction would be to reach for trusted tools in our tool belt to address the gap and reduce the risk of shipping late. Instead, our team took an uncommon approach: Rather than forging ahead with what we knew, we chose to work with AWS tools that were unfamiliar to us. This is because, unlike anything we had previously worked with, these tools would enable us to directly address the missing pieces we identified in our architecture.

The Feature Gap

The team took responsibility for handling offers that the Membership team planned to send to Nike’s mobile users. These offers come in the form of push notifications and/or as inbox messages in Nike’s mobile apps. Users are informed that they have qualified for a special product offer — or “unlocks” as we call them. These offers go out to large audiences at specific times, and redemption state needs to tracked. Missing from our architecture was the ability to orchestrate timing and state for each user’s offer.

This new feature posed some unique challenges for our team. We had two traffic patterns we needed to adjust for: the development of offers by our stakeholders at WHQ during normal business hours and the actual notification send for users. During business hours, we needed to support a much higher volume of traffic, while usage considerably dropped in the off-hours. It’s like our service needed to wake up from a nap and line up to run the 100-meter dash at a moment’s notice. We needed to find an AWS solution for our data store that would accommodate this usage pattern.

Would Traditional Databases Fit?

We first looked at Amazon RDS to solve our use case. RDS offers two types of methods for scaling: instance sizing and serverless. Scaling by instance size, however, doesn’t let the service “nap.” Instead, it’s ready to race at any time with our provisioned capacity. We would likely greatly under-utilize these instances, leading to wasted capacity and dollars. Alternatively, we could use serverless to allow the database to scale down, giving the service a chance to “nap” and scale up to “sprint” for offers. Since auto-scaling for serverless Aurora would only trigger every few minutes with incremental increases in capacity, we would likely need to orchestrate a pre-scale on the database with code inside our orchestration service. This timing of scheduling offers with the scheduling of database scaling could easily become a demanding DevOps task on the team.

We then explored DynamoDB to see if it would be a better fit. Dynamo’s read/write scaling pattern allows us to adjust scaling on the fly. But, just like RDS, an orchestrator is needed to pre-scale our database, driving additional operations cost for our team. We were also concerned with higher-level questions around the service like, “How do you pull a large record set for a single key without getting throttled requests for an individual partition?” Finally, at 10,000 read units, it would take about eight minutes to pull five million records from the table, placing Dynamo just outside the bounds of our performance requirements.

Event Streams as a Solution

At this point, the team began to think that a traditional database may not be the right approach for our problem space. So, we asked ourselves: “What if we treat our data as an event stream instead?” Our database would need to serve as a log of what happened in our application. Instead of trying to keep track of the state of each individual offer, what if a service could query the event stream to find out the state of an offer? Event streams create some unique advantages. One is that ingesting data needs minimal compute resources, as each invite is an event that is added to the stream with no need to calculate state. Another is that, since each event is recorded on the stream, we can explore the history of how the data got into its current state, dramatically increasing the observability of our solution.

Luckily, Amazon offers a few solutions in the event stream space that our team looked into. The service that best fit our use case was the Amazon Kinesis Data Streams platform. Specifically, we turned to the Kinesis Data Firehose service, which looked like it would fit the event stream need nicely. Firehose is a service that handles loading streaming data and pushing into data stores and analytic tools. Firehose supports writing data it captures to Splunk, Amazon S3, Amazon Redshift or Amazon Elasticsearch. Choosing Redshift or Elasticsearch would have introduced similar performance concerns, as our traditional database solutions, so S3 became the logical data sink. Firehose writes to S3 in a format of /////. With minimal setup, we now had an event stream that output data to S3 in a predictable location. To keep the data file size small and compact, we leveraged Firehose’s ability to transform data from the stream to the Apache Parquet format.

With our infrastructure in place, we now needed a way to query our data. Amazon provides a service called Athena that gives us the ability to perform SQL queries over partitioned data. To enable queries in Athena we needed to provide table metadata and partitions via AWS Glue, as Athena would not discover this information for itself. With our Firehose stream constantly sending new data to S3, it was critical that we have an automated solution to enable Athena to see new partitions in our data set. This can be solved with a feature in AWS Glue called, “the crawler.” The crawler traverses an S3 location and can update table schema to discover new columns as well as partitions in your data. With the ability to schedule a crawler to run every few minutes, the service can have data ingested from Firehose, sent to S3 and discovered by the crawler made available as a single queryable event stream.

Putting all of this together, our architecture looks like this:

So how does our solution compare to more traditional architectures using RDS or Dynamo? Being able to ingest data and scale automatically via Firehose means our team doesn’t need to write or maintain pre-scaling code. What about data durability? AWS S3 has eleven 9’s of durability, which the FAQ explains as, >Amazon S3 Standard, S3 Standard–IA, S3 One Zone-IA, and S3 Glacier are all designed to provide 99.999999999% durability of objects over a given year. This durability level corresponds to an average annual expected loss of 0.000000001% of objects. For example, if you store 10,000,000 objects with Amazon S3, you can, on average, expect to incur a loss of a single object once every 10,000 years.

Data storage costs on S3($0.023 per GB-month) are lower when compared to DynamoDB($0.25 per GB-month) and Aurora($0.10 per GB-month). As a managed service hosted by AWS, Athena is scaled to handle SQL queries for an entire AWS region, charging by query data scanned instead of units or instances, and is ready to run a query at full speed whenever requested. In a sample test, Athena delivered 5 million records in seconds, which we found difficult to achieve with DynamoDB. Connecting all of these AWS services together enables the service to go from napping to sprinting without intervention from the team on either the ingest or query sides.

Challenges

That said, this AWS-heavy architecture has its limitations. One limitation is that Firehose batches out data in windows of either data size or a time limit. This introduces a delay between when the data is ingested to when the data is discoverable by Athena. An additional delay is introduced when a new partition is created by Firehose. If the Glue crawler has not added this partition to the meta store, Athena won’t see it. Queries to Athena are charged by the amount of data scanned, and if we scan the entire event stream frequently, we could rack up serious costs in our AWS bill. For our use case, however, we’ve determined these limitations are acceptable. Invitations to offers will be received hours ahead of their scheduled delivery time, making Firehose buffering windows acceptable. Queries to the event stream will be minimal, as the event stream will only be requested when offers are sent to a group of users. The team decided on this approach, which allowed us to give the service the fun, acronym-based name of FIESTA.

On-time and Ahead of Schedule

Using the above data store style, the team was able to complete the FIESTA project ahead of the scheduled integration date. While integration was occurring, the team used that time to further harden the service and improve its observability.

Using new technologies can, at first, seem like you’re adding risk to your project. With managed AWS services, your team can evaluate the use case and shift as much work as possible to AWS. This allows your team to move faster, and focus on making premium experiences for your users.

Moving Faster With AWS by Creating an Event Stream Database was originally published in Nike Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

We Code Hackathon Empowers Portland Women

Nike Engineering Staff — Fri, 03 May 2019 16:06:00 GMT

by Arthy Ferguson, Lars Fjelstad and Will Hough

In 2015, Nike Software Architect Amber Milavec set her sights on empowering women, encouraging diversity and promoting inclusivity in the tech industry. With these ideals in mind, she founded the We Code Hackathon, an annual event in Portland, Oregon that is now a co-production of Beaverton-based Nike and Portland software company, Puppet. We Code is intended for participants of all levels and expertise to grow as engineers and shape the field of technology in Portland. In 2017, We Code added the Non-Profit Challenge, inviting local community organizations to present their challenges in the technology space and asking hackathon participants to brainstorm novel solutions.

Last month, eighteen teams of more than one hundred software engineers and designers came together for the fourth annual We Code Hackathon for Women and Friends. This year, hackers were asked to come up with technical solutions for two local non-profit organizations: Immigrant and Refugee Community Organization (IRCO) and Growing Gardens.

IRCO’s mission is to promote the integration of refugees and immigrants into the community, while Growing Gardens uses the experience of growing food to cultivate healthy, equitable communities. The mission of hackathon participants was to use public and open source technology to build a digital experience that improves the organizations’ abilities to better serve their communities.

After powerful opening presentations from Portland tech leaders — including Puppet’s senior director of product operations, Padmashree Koneti, and Nike’s VP of Platform Engineering Courtney Kissler — We Code hackers got to work. They were buoyed by the opening performance of Nike software engineer, Snigdha Roy, who wrote and performed an awesome, inspiring rap called “Imposter Syndrome."

Over the course of two days, teams worked to address various technical challenges that the two organizations face. For Growing Gardens, they tackled problems related to website accessibility and improving the digital experience that connects food growers to consumers. For IRCO, the focus was to improve the website’s user interface and accessibility. Teams also looked to enhance the websites’ user engagement through notifications.

At the end of two fast and furious days of work, each team got three minutes to present their project. Judging criteria included creativity and innovation, benefits and impact and consumer experience.

The Winners

First-place winners were team Cookie Monsters whose “Growing Gardens Donate Portal” was developed using the Ember.js framework. It offers tools to streamline the donation process, with features to manage in-kind donations, enable Spanish localization, increase accessibility and improve the user experience. The Cookie Monsters team were also excited about plans to move to the cloud and use Express as the web application framework.

The “IRCO Class Notifier” project nabbed second place. This team created an easy-to-use web application using the JavaScript framework React. Using just their phone numbers, it enables users to sign up to be notified about classes available through the SUN after-school programs. Using text messages makes it accessible with any type of mobile phone, not just smartphone users. Looking forward, the team said IRCO could easily expand this beyond SUN to include their other programs and volunteer opportunities. This could greatly increase access and exposure for the many resources offered by IRCO.

The people’s choice award went to “Growing Gardens Together”, which brings producers and consumers together through shared interests and proximity. Consumers can locate nearby producers and tap on the producer icon to send a notification to them. The producer icon also displays the produce that the producer has in store. Built on a LAMP stack and prototyped with Adobe XD, the demo was a working web application with an innovative integration with Google Maps. Future steps might involve creating a mobile app, while expanding exchangeable goods to gardening supplies, equipment and recipes.

All eighteen projects that came out of the hackathon are licensed under either the Apache or MIT open source licenses, so that the non-profits can continue to benefit from them. Check them out on We Code’s DevPost submissions page.

Join Us

Join in the #wecodeforgood conversation on twitter to get involved. We look forward to seeing you there next year!

We Code Hackathon Empowers Portland Women was originally published in Nike Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Nike’s Cloud Journey at AWS re:Invent

Murali Narahari — Thu, 18 Apr 2019 16:16:00 GMT

Nike has been aggressively marching towards cloud-native, micro-service architecture to enable speed, scale, and stability across the enterprise. Over the last five years, we have re-imagined our entire technology stack using observability, security, reliability, availability, and performance as core principles for software development. With a focus on Agile, CI/CD and DevOps practices, we built autonomous teams organized by domain for speed and agility. The challenges that result from this kind of fundamental shift in our organization have been met head-on by building a strong sense of community to learn and share the latest concepts and techniques in software development.

Nike realized early that technology is a strategic priority — not an afterthought. We carefully examined many areas of our engineering organization where off-the-shelf vendor software was routinely used, and we found that, in many cases, the software did not meet our strategic needs for functionality, security or scale. Our engineering teams quickly pivoted to develop cloud-based software, resulting in cutting-edge applications and platforms that serve at a global scale. We turned to open source solutions in order to move at the speed of changes in technology and to continually innovate. As part of our journey to the cloud, we started utilizing NoSQL, serverless, containers, AI/ML, GraphQL and others. Patterns like CQRS were leveraged to enable scale, while multi-region architectures became a necessity for providing unsurpassed consumer experiences. Considering the size and breadth of the company, several cloud acceleration teams were created to help build a common set of tools and standards for any team to leverage. We quickly built many capabilities in-house for speed and strategic edge, resulting in increased activity and contributions back to the OSS community. You can find our contributions here at Nike OSS.

As part of our journey to the cloud, we took an active role in creating a culture of sharing what we learn, both internally and externally. Creating best practices and tools for all to leverage is important for us, given the size of our technology team. For example, we created charters for coming up with API principles to enable API-first thinking. We encouraged “community of practice (COP)” groups to share ideas, and we promoted concepts of inner sourcing to spread internal knowledge and inspire developers. We hold monthly engineering forums and Tech Talks, where engineers across teams (internally and externally) share their learnings. We have innovation days and hackathons within and across teams to collaborate and concoct new ideas through lighthearted competition. All of this work really propelled teams as they moved to the cloud. It has also helped to attract amazing talent and build best-in-class services.

At re:Invent this year, we were able to share many of our learnings to the wider community. Teams across Nike were proud of what they’d accomplished and excited to share their stories. Below, you can read about and watch videos of the five Nike teams that presented. I hope you enjoy watching them and get some insights into what we’ve done to power our engineering teams at Nike.

How DynamoDB Serves Nike at Scale — Zach Owens and Adam Farrell

https://medium.com/media/5034eacb779ff77c55e738fb6a90687c/href

In this session, Zach Owens and Adam Farrell discuss how Nike Digital migrated its large clusters of Cassandra and Couchbase to fully-managed DynamoDB tables to reap the benefits of cloud native architecture. They share how Cassandra and Couchbase proved to be operationally challenging for engineering teams and failed to meet the scaling needs of our high-traffic product launches. They also discuss how DynamoDB’s flexible data model allows Nike to focus on innovating for our consumer experiences without managing database clusters. They share best practices learned for how to effectively use DynamoDB’s TTL, Auto Scaling, on-demand backups, point-in-time recovery, and adaptive capacity for applications that require scale, performance, and reliability to meet Nike’s business requirements.

Nike’s Journey to Real-Time Monitoring of its Digital Business — Adam Nutt & Demond Jackson

https://medium.com/media/8240666d9832b13fe8410071a36184d6/href

Nike recognized the need to accelerate its digital transformation and strengthens its connection to consumers through individually-tailored content. The company launched several new digital platforms, including the SNKRS app and NikePlus, which is expected to double in the next three years. Nike relies on AWS to provide personalized apps, better features, up-to-date content and responsive shopping experience for its customers. In this session, Adam and Demond discuss the DevOps culture that empowered the team to measure what matters. They go through the crawl-walk-run journey in the space of observability around digital commerce. Check out Adam’s blog on observability here.

Search at Nike with Amazon Elasticsearch — Andrew Mossbarger

https://medium.com/media/1e50ad3c17701b3fa8da8ad18a3823f6/href

In this presentation, Andrew Mossbarger, Director of Search at Nike Digital Engineering, discusses how Nike digital has leveraged Elasticsearch to enable mission-critical search capabilities for Nike’s online store. This enabled Nike engineers to focus on the business capabilities around search and not worry about the scalability of the underlying platform. Leveraging AWS capabilities around “infrastructure as code” etc., the team was able to move from one deploy every two months to more than two deploys a day. Andrew discusses the solution options that were in front of the team, the reason for choosing Elasticsearch and the architecture upon which Nike’s Search as a Service (SaaS) is built.

Building a Social Graph at Nike with Amazon Neptune — Marc Wangenheim and Aditya Soni

https://medium.com/media/c40e802511dce742b9a6ef0c6229146c/href

In this session, Marc and Aditya discuss how Nike stepped up its game on Nike’s own social network. Using the Amazon Neptune graph database, Nike is unlocking the possibility for world-class athletes and millions of their followers to have unique Nike experiences. Marc and Aditya showcase Nike’s journey of migrating more than 100 million users from Cassandra to Amazon Neptune, while remaining up and available 24/7.

Nike Retail’s Journey to the Cloud — Barend Kuperus and Murali Narahari

https://medium.com/media/1838f5fd6e22c2014572031695ea4530/href

The retail industry is going through an incredible transformation. Customers are more informed than ever and a desire to engage with connected retail experiences through the online, mobile, connected car, connected home, office, etc. In this session, hear Barend and Murali talk about how Nike Retail has leveraged AWS to provide seamless, channel-agnostic consumer journeys that enable speed, stability, and scale to the business. These innovations are part of Nike’s most innovative retail concepts, like Nike’s House of Innovation where consumers can use the Nike app on their own devices to do self-checkout and more.

Nike’s Cloud Journey at AWS re:Invent was originally published in Nike Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Reducing DynamoDB Costs in AWS

Nike Engineering Staff — Thu, 14 Feb 2019 16:56:00 GMT

by Ken Smith

SIM is Nike’s in-house store inventory management solution. It is the
source of truth for stock on hand in Nike stores —among its other
responsibilities. SIM calculates stock on hand based on inventory
movement events in the stores. These events include product receiving,
transferring, sales, returns, etc. Recently, our team was given better
visibility into the costs of our services by tagging our resources and
leveraging our in-house cost tools. We observed many
higher-than-expected costs, but the most expensive Amazon Web Services
(AWS) resources were DynamoDB and they made up a very small percentage
of our total resources. Those DynamoDB resources accounted for far over
half of our total AWS cost. For this post, we’ll focus on one of the
SIM tables; however, these issues exist with many of the tables in SIM.
The majority of the services that use these tables were stood up
following a cookie-cutter pattern adopted by the team using examples
influenced by other practices across Nike Digital Engineering.

Nike has two store inventory activities that cause significant traffic
spikes —they’ll be referred to in this post as Physical Inventory
and Cutover. Physical Inventory is the process of counting all products
in a store and adjusting the stock on hand to align with the verified
count. This is done to account for drift that happens naturally in the
stores due to theft, receiving shipments with inaccurate content lists,
etc. Cutover is the process of loading all stock data from the legacy
system of record into SIM. The loads generated in the SIM system by
these activities aren’t massive in comparison to what some other
systems take but are enough to cause issues if not handled properly.

The four main issues with our DynamoDB configurations:

Write capacity units (WCU) were set too high in anticipation of spikes
Read capacity units (RCU) were set too high in anticipation of spikes
Heavy use of Global Secondary Indexes (GSIs)
Poorly composed partition keys

The supply-stock-events table belongs to the service in our system that
persists all of the events that impact inventory in our stores. It is
part of a Command Query Responsibility Segregation style micro-service
that is responsible for the stock on hand in our stores. This was our
most expensive table and seemed like the most impactful place to begin
cost-reduction efforts.

Write Provisioning

When looking at the traffic patterns of our supply-stock-events table in
the us-east-1 region of AWS for efficiency opportunities, the most
obvious issue was over-provisioned WCU on the table.

The over-provisioning of the WCU was in place for a few reasons:

During the Cutover process, moving from the old inventory system to SIM causes a spike of write load to this table.
During the Physical Inventory process, the data causes a spike of write load to this table.

SIM is a global platform, so the Cutover and Physical Inventory
processes happen across many time zones for many stores in a short
window of time. Cutover and Physical Inventory processes are planned
activities, and the tables have been scaled manually in the past.
However, we have had cases where the activities were not communicated
clearly to our team, so scaling wasn’t coordinated, and problems in
production occurred.

An obvious solution to this problem might be to leverage DynamoDB
autoscaling, but the team has been reluctant to do this as analysis by
other teams showing long scale-up times are prohibitive. We tested the
DynamoDB autoscaling in our test environment under loads similar to what
we experience during activities that cause spikes, like the Cutover and
Physical Inventory processes. The burst capacity we consumed and the
time to scale up during these activities was good enough to make
autoscaling a great solution for the variable loads this table needs to
support.

Autoscaling is probably not a solution for all cases, but it worked
quite nicely for this case. The minimum capacities needed to support
production were upped a bit from what was found to be acceptable in the
test environment, but it wasn’t by much compared to what the capacity
was set to before moving to autoscaling. Using autoscaling, this table
now runs in production with an average provisioned WCU well under 10
percent of what was provisioned in production before.

Another factor considered for optimizing this table was the number of
GSIs. The provisioned throughput for the four GSIs (Global Secondary
Index) on this table were configured individually, which was quite
expensive. Fortunately, autoscaling was applied to the GSIs as well.

Read Provisioning

The over-provisioning on the RCU was in place for a few reasons:

Cutover process as discussed above
Physical Inventory process as discussed above
Data Pipeline backups

Autoscaling clearly solves two out of the three problems and was put in
place along with the write autoscaling configuration. The complexity
here lies in the last reason: Data Pipeline backups for disaster
recovery. This is a job scheduled to run hourly that pulls all of the
table’s data and puts it in S3. This poses a problem because it
requires a lot of RCU in order for the job to complete before it kicks
off again. There is also no TTL on the data in this table, so it is
currently growing without bound. This means the job is going to take
longer and require more capacity as the table grows.

Why not just use native DynamoDB backup and turn those jobs off? To add
a little complexity to the Data Pipeline job, a few of our data
aggregation jobs for reporting run on AWS EMR and require that the data these Data Pipeline jobs are handling landing in S3. To address the disaster recovery concern of the Data Pipeline job, we leveraged the On-Demand Backup service running in our AWS account, that only requires a couple of tags on the table: one to specify the cadence of backup and another to
specify retention time of the backups. To address the EMR job concern,
the schedule for the EMR jobs were identified, which ended up being once
per day, and the Data Pipeline job will be adjusted to meet that need
soon. This will reduce the Data Pipeline execution count per day from 24
to 1. The Data Pipeline spikes RCU on this table to around 8,500. The
average RCU consumed, excluding Data Pipeline capacity, is less than 25.
Reducing these jobs by 24x will make quite a difference. Another way we
can create efficiency is by eliminating the Data Pipeline jobs
completely and switching to ETL jobs that access the data from the
tables directly. But this is a larger investment of time and a more
complex effort.

Global Secondary Indexes (GSIs)

To get even more optimized, we can look at things like migrating our
tables that have a lot of actively used GSIs to a relational datastore.
AWS Aurora is an option as a relational database replacement for our
DynamoDB that supports automatic backups and drops cost even more based
on some back-of-the-napkin math. However, the migration option will take
careful consideration, since migrating and refactoring data sources is a
fairly large and complex effort.

Partition Keys

Partition keys on some of our tables have caused some provisioning
challenges that impact our cost as well. As mentioned above, we have
“GSI bloat” on our tables. Some of those GSIs could have been avoided
by not using Universal Unique Identifiers (UUIDs) (e.g.
02f44553–37fe-3070-bc04–59d65433968a) as partition keys. Granted, UUIDs
theoretically give a good distribution over DynamoDB partitions, but
they also make it difficult to query tables as efficiently as when there
are more meaningful key values to query.

On the flip side of the UUID key strategy, we have tables that use
legacy store identifiers for the partition key value, which causes a
different issue altogether: partition hotspots. Hotspots are caused when
the values used to partition the data land the data in a small subset of
the partitions of a table far more than the other partitions. RCUs and
WCUs are distributed evenly across the number of partitions supporting
the table. If the data being written is bunched into a single partition,
capacity on the table can be increased drastically without seeing
significant gain in the provisioned capacity on the partition that is
causing the throttling. In order to prevent throttling, the table needs
to be provisioned high enough that the instances supporting each
partition can support the hotspot. This can be enormously expensive and
wasteful. An example of a table doing this in SIM is the
supply-stock-journal (called “supply-stock-ledger”) which is an
aggregate view of the store stock events.

Neither of these partition key strategies are recommended, and the
outcomes of using them are well-documented in AWS blogs. To solve these
issues, we’d have to redesign the partition keys on the tables, which
would be a significant and complex refactor. As called out above in the
GSI section, putting this kind of effort into migrating to a relational
datastore would prove to be a better investment.

Savings

Initial cost analysis on the supply-stock-events table was pretty high
when we started and with the proper autoscaling in place we were able to
drop DynamoDB costs by 85%. Once the Data Pipeline jobs are
only running once per day, the cost is projected to drop to nearly 5% of
the original cost.

Learnings

To recap what we learned in our efforts to optimize SIM tables:

Over-provisioning tables to support heavy traffic spikes can be difficult to coordinate and requires manual processes to support.
A lot of GSIs on tables to support many access patterns is not optimal.
Poor partition key construction causes hotspots and over-provisioning to support load on those hotspots.
It is not optimal to use a batch job strategy to copy data and use the single source for both disaster recovery backups and as data sources for EMR jobs.

The type of datastore is the major consideration I think would have
prevented a lot of the over-spending seen here. Looking at the load
requirements, the access patterns, and data recovery needs, relational
databases would have been a better fit for a lot of the SIM use cases at
the time. When DynamoDB was chosen for SIM, the AWS RDS product
supported configurable snapshot backups. The backups would have resolved
the data recovery concern and being a relation database it would have
addressed the key composition and GSI issues as well. The Autoscaling
and On-Demand Backups were not available when the decision to use
DynamoDB was made, but they have proven to be valuable in reducing the
impacts we’ve seen here. If there was an architectural requirement to
use DynamoDB when it was chosen for SIM, I think a less read-intensive
backup solution would have been something to consider as well as better
composition of partitions keys.

The last point to cover is the cookie cutter aspect of our services and
how using the rinse-and-repeat model replicated these issues throughout
the SIM system. To prevent spreading impact throughout a system like
this, it is important that teams reusing patterns know why they are
using the patterns and what problems the patterns are solving.

SIM is still being feature developed in production and has been a great
success for Nike. As we move forward in building features and
maintaining our system we have taken careful thought into what
technologies we choose to implement solutions for our consumers and
refactor our older services. The learning covered here have helped us
understand the value in upfront exploratory investment.

Interested in joining the Nike Team? Check out career opportunities here.

Reducing DynamoDB Costs in AWS was originally published in Nike Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Moirai: A Feature Flag Library for the JVM

Stephen Duncan Jr — Mon, 14 Jan 2019 16:56:00 GMT

by Stephen Duncan

Software is really complicated. A lot can go wrong, especially when you’re working with cloud-based, distributed systems. You accidentally introduce a bug into your logic while refactoring. You update your code and inadvertently create performance problems in your service. You could make a small change that performs well in your service but creates performance problems in another service. Perhaps you make an update that seems compatible, but actually causes problems in clients of your service.

Here I will discuss some of the techniques we have at our disposal to limit the risk of negative impact to our consumers. Then I will introduce you to a library I created in order for my team (and others) to implement feature flagging: Moirai.

Minimizing Risk

We can write tests to run against our code in isolation, like unit tests. But these tests don’t cover a lot of the integration problems we see in practice.

We can deploy our service to a test environment and run tests against that system. But it’s expensive and difficult to replicate a realistic environment and — even when we try — it’s never a perfect match for production. Even if we do get close, there are still a lot of problems that we can’t predict with our tests.

To limit the scope of impact for a change, we deploy new versions to take a fraction of the traffic before taking 100% of the traffic. This is good for catching widespread problems that will show up in our monitoring tools, but it’s not great for problems that show up more subtly.

Feature Flagging

With feature flagging we make our change and enable it by some condition (or flag), so that we can control how the new code is exposed. Also called “feature toggles,” feature flags can be used for many purposes and in many ways: Release Toggles, Experiment Toggles, Ops Toggles or Permission Toggles are all categories described in this Feature Toggles article. Especially of interest is Ops Toggles, which is described as in the article as:

These flags are used to control operational aspects of our system’s behavior. We might introduce an Ops Toggle when rolling out a new feature which has unclear performance implications so that system operators can disable or degrade that feature quickly in production if needed.

We use feature flags to mitigate risk of deploying changes. By placing conditions on exposing changes to our code — instead of doing so via deployment mechanisms — we are allowed more fine-grained control over who is affected and by which changes. For example, we can set the condition that our feature is only used by internal alpha testers on our team, or we can set the condition that allows our feature to be used by a wider beta testing group. Essentially, we can roll out our feature to a percentage of our users at a rate we feel comfortable with in order to gauge the performance impact.

Moirai

When my team wanted to start using feature flags as part of our development and deployment process, I briefly looked for existing open-source solutions for the JVM (we primarily use Scala for our services as well as some Java). Most of what I found seemed heavy-weight with a lot of assumptions that didn’t fit our needs. For instance, I found solutions that are Java Servlet-based or database-backed, and some that rely on thread-locals that interact poorly with reactive code patterns where logic runs on many different threads. I couldn’t find anything that quite fit our needs.

As a result, I created my own library — one meant to be light-weight (usable with minimal dependencies for any JVM project) and flexible (functionality composable to meet a wide variety of usage patterns). I call it “Moirai” (the Fates from Greek mythology). Moirai has been used in dozens of services by multiple teams and is now part of Nike’s open source contributions.

Moirai consists a few main features. The first is a ResourceReloader for periodically fetching updated configuration from some source. This allows us to adjust the settings for our feature flags within a minute, instead of requiring a whole new deployment to adjust. The use of the ResourceReloader is optional, you may prefer to stick with making changes only through deployments.

Second, Moirai contains modules for configuration sources and formats. Currently, it has Amazon S3 as a configuration source, and it supports HOCON via Typesafe Config as a format. We deploy our config file through a Jenkins job that uploads the file to S3 after verifying the config file is able to be parsed. Contributions for other sources or formats are welcome.

Finally, Moirai supports some common patterns for deciding if a feature flag should be enabled. Currently, it supports an explicit list of user identifiers that should have the feature enabled as well as a proportion of users that should have the feature enabled using a modulo of the hash for the user identifier. These are expressed as predicates, which you can then combine together to fit your requirements. We typically just combine these together with or to make them additive (as shown in the Moirai README). It’s also easy to add custom logic. For example, it’s simple to just toggle a feature on or off completely rather than to do so based on the users, or to use some other aspect of your data (for instance, by a particular entity being requested instead of the user making the request).

Putting it all together, a typical configuration for your project might look like this:

moirai {
  data.useNewService {
    enabledUserIds = [
      8675309
      1234
    }
    enabledProportion = 0.01
  }
}

This lets you test in production with some specific users and then start roll-out to a percentage of users over time, so you can monitor impact on performance and stability while minimizing the risk of a major negative impact.

In your code, checking for the feature flag would look like:

public int getData(String userIdentity) {
  if (featureFlagChecker.isFeatureEnabled("data.useNewService", FeatureCheckInput.forUser(userIdentity))) {
    return dataFromNewService();
  } else {
    return dataFromOldService();
  }
}

Moirai is designed to have minimal dependencies. The core module only depends on SLF4J, and the other modules add in the dependencies for the particular technology they use. It’s also designed to let you flexibly combine together different sources, formats and decision patterns. See the README for all the details. You can even use the ResourceReloader to reload any kind of configuration or data that you need, not just feature-flags.

Final Thoughts

We have found Moirai and feature flags in general to be a very useful addition to our toolbox to mitigate risk. We’ve used it to roll-out new implementations of features for performance, for controlling the load from back-calculating data for our whole user-base, and even for adding whole new services into our data-processing flow. Multiple times, this has let us find problems and fix them with little or no impact to our user base.

Moirai: A Feature Flag Library for the JVM was originally published in Nike Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Continuous Learning — Part II

Courtney Kissler — Mon, 07 Jan 2019 23:51:57 GMT

Continuous Learning — Part II

I strive to be a lifelong learner. I continuously seek out new information, so I can grow, acquire knowledge and expand my understanding, with the goal of always learning. I am proud that technology leaders at Nike put an emphasis on continuous learning and believe it’s a critical component to our success and evolution as innovators in the digital space. Leaders and employees alike are constantly pushing the boundaries of what’s possible, expanding their competencies with a diversity of thought — and technology — in order to get after new opportunities in the marketplace. At Nike, this means leading retail’s digital transformation and creating the future of sport for athletes* globally.

Events like the DevOps Enterprise Summit provide an incredible opportunity for learning. I have been attending the summit since it began in 2014. That year, I spoke about Nordstrom, my previous employer, and our journey to transition its software development practice from one that optimized for cost to one that optimized for speed. That was also the year that I met people from Target, Disney, Nationwide, CSG and Capital One, among many others, all of whom have served as inspiration and sounding boards for my own continuous learning, as we all work to drive transformation at our respective companies.

This year’s DevOps Enterprise Summit in Las Vegas did not disappoint. Below, I’ve summarized several key learnings and their importance to me, personally, and to Nike’s digital engineering organization.

DAY 1

SHARING AND LEARNING

Nike sent a whole team to this event — all of whom were first-time attendees. Our strategy was to split up and cover as many of the talks and workshops as we could, using a Slack channel to share our learnings. After we returned to the office in Oregon, team members started practicing things they had learned. I’ve already received feedback about how excited our teams are to see leaders demonstrating their commitment to creating a learning organization.

BUSINESS BUY-IN AND PARTNERSHIP IS CRITICAL FOR TRANSFORMATION

Anyone can simply start practicing DevOps techniques within their organization, but doing so without wider buy-in will only get you so far. Having a strong business partnership is a critical component; it accelerates results — and sustains them. At Nike, our engineering leaders are focused on building close ties with our business stakeholders, like Anne Bradley, Nike’s Chief Privacy Officer and Legal Counsel. She found great value in attending the event, so much so that she even started a new hashtag: #bringyourlawyertoatechconference. Below is a photo of Anne and me with members of the Capital One team (Topo Pal, Distinguished Fellow and speaker at the summit, and Jamie Specter, legal counsel) https://twitter.com/chawklady/status/1054533642901278720?s=21

Topo closed his presentation sharing his #DevOpsHashtags, including a new one — #MVC (Minimum Viable Compliance) — which came from our Nike presentation.

Suggested Content: CSG Capital One

INDUSTRY BURNOUT

Throughout my career, I have seen the industry reward heroics. Pulling all-nighters to deploy releases has been considered a badge of honor. I believe this leads to burnout in employees, and I’m working hard to change this culture within our engineering organization at Nike. My goal is to create an environment that does not encourage this behavior and, instead, focuses on more healthy and sustainable practices. Exhaustion, cynicism and professional inefficacy are all symptoms of burnout. Exhaustion is individual stress (“can’t take it anymore”). Cynicism is a negative response to the job (“socially toxic workplace”). Professional inefficacy is negative self-evaluation (“no future”). One small but critical step I have taken is to do my best to avoid sending emails outside of work hours. This can be really difficult to do, especially in organizations with meeting cultures, where the only time you have to send emails is often after hours. Another tactic I have used is to create a daily stand-up with my leaders. It gives us the opportunity to check-in, align on the plan for the day, balance workloads, if needed, and minimize emails and other communication mechanisms.

Not many people will raise their hands and say they are burned out. So, rather than focus on signals and indicators, my preference is to focus on making sure my teams are getting time to recharge, eliminating toxic behavior and leading with intent and purpose. In this way, teams are connected to the work they are doing, to the organization’s mission (in Nike’s case, “to bring inspiration and innovation to every athlete* in the world”) and to the consumers they are serving. In January, I will be starting daily stand-ups with my direct reports. I’m really excited to try it at Nike because I have seen it help a lot in past roles. I’m hopeful we will see the same outcomes. And, if not, we will learn and adjust. #alwayslearning

Suggested Content: Dr. Christina Maslach Industry Panel Josh Atwell

DAY 2

MAKING LIFE EASIER FOR ENGINEERING TEAMS

A lot of enterprises have silos and continue to deliver technology in a traditional way — development teams develop, test teams test, release teams release, operations teams operate, security teams secure, architecture teams architect, project managers manage projects — you get my point. I believe that all roles in organizations need to evolve towards a DevOps model, and it was refreshing to see some of the ways big companies are doing this with a focus on optimizing for engineers, speed and moving to a product model.

Technical architects, along with a product manager, from Target talked about moving away from governance and toward guidance. This involves eliminating governing bodies, like ARBs (Architecture Review Boards), and having a crowd-sourced recommended technologies approach with a focus on accessibility, transparency, flexibility and culture. This resonated with me because, at Nike, we are focusing a lot on creating a strong internal community of practice, focused on inner sourcing and making architecture everyone’s job.

Operations is another area requiring evolution. Damon Edwards gave a great talk about the forces undermining operations. This also resonated with me because I started in operations, and I know that we still have opportunities to build trust between teams, improve feedback loops and break down silos. At Nike, we are focusing on aligning outcomes between teams. For example, my engineering teams have OKRs around MTTR, MTTD and change failure rates, and we are sharing outcomes with our operations teams, including deployment frequency and cycle time. This will help our teams focus on outcomes and find ways to improve. It’s important for us to use data because every team is different; some are further along, and some are still requiring investment in capabilities.

Moving from project to product is a very popular topic, as organizations transform, and it was great to hear from Rob England and Dr. Cherry Vu about how “project management was the worst thing that ever happened to IT.” I’ve experienced this in my career, as I’ve seen three big companies try to move away from projects to products. Often, attributes of the project approach are challenging to change. Projects require you to plan, plan, plan, plan, know everything up front, provide estimates for the length of the project (often 12–24 months) and deliver in a big bang approach. It does not promote a dynamic, learning organization and makes it really challenging to adjust or change course. Traditional project management is an anti-pattern for practicing DevOps.

LEARNING FAST AND LEADERSHIP EVOLUTION

At Nike, our technology teams are focused on creating a dynamic, learning organization — one that can out-learn the competition. Creating a culture of continuous learning is strategic and creates competitive advantage. In this context, I don’t like using the phrase “fail fast.” It’s more important that we “learn fast.” When issues arise, I try to shift the conversation away from “who” and, instead, focus on what in the system needs to be improved. It’s important to move the focus away from finding root cause to finding the contributing factors in order to understand them and move forward with a problem-solving mindset. Finally, I’m working to eliminate “human error” from our vocabulary when discussing root cause. Human error is never the root cause, and there is never a single root cause. It’s more important to focus on the contributing factors that led to a breakdown in the system.

Suggested Content: Dr. Spears Scott Nasello

BE PERSISTENT, CELEBRATE THE WINS AND BECOME STORYTELLERS

The fireside chat with Chris O’Malley, CEO of Compuware, was one of my favorite parts of the conference. He talked about creating value and not making excuses. The DevOps community is too focused on talking about traditional metrics (MTTD, MTTR, deployment frequency, etc…), he said. Instead, we should be focused on the ideas that our teams bring to life and share those stories with product management. We need to become storytellers, he said, and we need to celebrate the people who take risks. O’Malley shared that, every two weeks, he holds a town-hall to showcase the ideas and products being delivered. Teams share their blockers and constraints and put problems front and center and, in doing so, move the organization forward.

He closed by telling all of us to stay persistent and continue to drive change. At any given point, 2/3 of the organization will either be passively or actively resisting change. Stay courageous. This made me think of a phrase I use a lot, “Keep fighting the good fight!”

Suggested Content: Chris O’Malley

All videos from the 2018 DevOps Enterprise Summit in Las Vegas can be found here.

If you have a body, you are an athlete.

Continuous Learning — Part II was originally published in Nike Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Accelerate: Transforming Nike Digital APIs with GraphQL

Austin Corso — Fri, 21 Dec 2018 16:56:00 GMT

“Four weeks of engineering saved.”

“7,500 lines of client code and tests eliminated.”

“16x reduction in data-over-wire.”

“Quicker mobile releases.”

These are just a few of the exciting accomplishments teams are achieving with GraphQL at Nike. Officially released by Facebook in 2015, GraphQL is an open source specification that provides a more efficient, powerful and flexible alternative to REST for front-end web and mobile experiences. GraphQL enables two paradigm shifts for client teams.

First, it empowers a flip in the client-server REST model. Rather than procedurally piecing together a multitude of endpoints and schemas themselves, client teams simply define their exact data schema requirements across services in a declarative query. This reduces the domain-model learning curve, the number of expensive client-server network requests and the over- and under-fetching of data.

Second, GraphQL’s declarative interface, along with our growing API graph, facilitates reusability across client teams. Teams leveraging the same shared resolvers and API schemas no longer need to stand up their own aggregation layers, greatly speeding up time to market. This all but eliminates custom, one-off aggregation code that would otherwise need to be developed, tested and maintained redundantly by each client platform and experience team.

As Nike continues to innovate and explore GraphQL, teams have started a cross-org working group called “GRAND”, which stands for GraphQL at Nike Digital. GRAND aims to provide a common set of stateless aggregation gateways on top of Nike’s many hundreds of microservice APIs. Further, GRAND’s objectives are to:

Improve time-to-market through thinner clients with reduced network calls and data orchestration; no more overhead to build and support one-off aggregation layers
Provide shared functionality through reusable GraphQL schemas and resolvers
Improve client performance by reducing the number of network calls needed for client-applications

Teams at Nike innovating with GraphQL include Checkout, Cart, Wishlist, Nike App, CMS, Nike By You (NikeID), to only name a few. Here is a brief account of the teams’ experiences with the open-source API query language.

Nike.com Checkout

The Checkout team began their GraphQL journey in late 2017 to deliver the best Nike.com cloud checkout experience for consumers. The team needed to aggregate data across a multitude of microservices. An early GraphQL proof of concept quickly demonstrated the value they were hoping for, and by early 2018, they had limited releases to production.

When the Cart and Wishlist Nike.com experiences also needed aggregation, it was clear that the time had come to evolve the GraphQL POC into a common stateless gateway, reusable across experiences. Rather than develop and maintain custom code and infrastructure for each individual experience, a reusable GraphQL implementation was chosen.

Adapting their learnings from the open-source community as well as from Nike’s inner-source GraphQL community, the GraphQL working group and Checkout team started GRAND. Much of the early work on GRAND came from abstracting the successes of Checkout using GraphQL. The goal for GRAND is to provide a set of horizontally-scalable stateless GraphQL aggregation gateways. Once a Nike microservice is integrated by a single team, it becomes part of Nike’s larger graph, reusable by all Nike web and mobile experiences.

With GRAND, the Cart team was able to deliver on tight deadlines for the 2018 holiday season — something that would have been tough with a built-from-scratch experience aggregation service. Having used the same underlying schemas, the Cart and Checkout teams were able to share data via web storage to allow a user to skip redundant API calls when navigating between web applications. This vastly boosted performance for users. This initial work on GRAND set the stage for the production GraphQL gateway, which was leveraged by several additional cloud experiences in late 2018.

GraphQL and GRAND provide a number of benefits to the Checkout team, including:

Reusability: Rather than building custom, one-off aggregation layers for every new experience, GraphQL API integrations are reused across experiences. This significantly speeds up time to market, serving consumers more quickly.
Experience-First APIs: The Checkout team is able to declaratively define the precise data they need across microservices. This increases client performance as less data is transferred between client and server with fewer round trips.
Lighter client: As more services have become stateless, clients have become more stateful. This leads to thicker clients with more code being pushed to the front-end. GraphQL simplifies Checkout’s front-end state management, accelerating their team and making continuous updates easier.
Observability: Planned GraphQL tooling allows for a level of in-depth analysis and visualization. The team can easily observe how underlying services are performing down to the individual data fields being requested by clients. Service teams, too, will have a clearer understanding of how and who is calling them.
Automation: The open source community tooling allows for automated generation of mocks and tests based on GraphQL schemas. The plan is to also automate the generation of GraphQL schema from our microservice OpenAPI 3 Swagger docs. (The robots are coming!)

Nike App at Retail

The Nike App team leveraged the same shared solution in GRAND to quickly integrate with downstream APIs as they brought the Nike App to retail doors. The goal of the Nike App and its localized product recommendations is to provide personalized, contextual recommendations near consumers’ real-time locations.

Network calls are particularly expensive and a drain on battery life for mobile apps. The best practice is to reduce the number of API calls to deliver the desired user experience with the fewest number of network requests possible. In the world of microservices, this is a challenge.

The solution was to use GRAND and GraphQL to provide a single network call for the Nike App, retrieving data across three nested microservices and reusing API integrations provided by other teams that use this stateless gateway.

Using GraphQL and GRAND amounted to a triple win for the Nike App:

Code reduction: The single GraphQL API call reduced hundreds of lines of client code needed to orchestrate calls across many microservices.
Reusability and consistency: Both iOS and Android Nike apps are able to reuse the same gateway for data aggregation.
Performance gains: Expensive client network requests are reduced. Further, GraphQL eliminates the over-fetching of data by tailoring the response to their mobile experience requirements. Nike App reduced their payload from 500KB to 11KB!

CMS

The CMS team began using GraphQL when migrating their platform from Aurelia to React (another popular Facebook OSS library). One difficulty with the previous platform was the multiple API calls needed due to Nike’s microservice architecture. The team knew an aggregation layer would reduce network calls and greatly simplify the front-end development process. When they decided to migrate to React, they, too, implemented GraphQL as their aggregation layer.

Today, there are more than 17 microservices that the CMS team aggregates with GraphQL to deliver a premium content authoring experience. Previously, in a monolithic architecture with fewer services, the CMS team would manage the asynchronous nature of multiple API calls on the front-end, calling each service and dealing with responses one at a time, as they resolved in a promise chain. With GraphQL, the CMS front-end just needs to make one request to retrieve a data model optimized for the front-end experience.

Nike By You (NikeID)

Nike’s Customization team began their GraphQL journey in 2017 to simplify how the Customization Lab, an internal operations web application, communicates with a complex network of back-end microservices.

Similar to the CMS team, the NikeID team adopted GraphQL alongside a front-end re-architecture. Transitioning from a large Redux store containing many modules made up of hundreds of lines of asynchronous data orchestration to a declarative, query-based model was a breath of fresh air. Using Apollo Client, the NikeID team mapped GraphQL queries to their respective front-end components and fetched data only as needed.

On the server, composing dozens of REST endpoints into a clean set of GraphQL queries and resolvers felt natural and intuitive compared to the often-contrived custom aggregation services of the past. Past custom aggregation layers were essentially tech debt before they were ever deployed, quickly becoming maintenance nightmares. Downstream service contracts would change, new business requirements would arise, and there would be limited bandwidth to maintain them. With GraphQL, the team has a flexible, reusable interface to satisfy their data requirements.

Order Futures Buy, GameDay, and API+

Nike’s Commercial Content team, in conjunction with the Nike Technology Order Futures Buy, GameDay and API+ teams, identified a recurring issue in how product data was managed. For years, these teams created their own data lakes of cached product data. Time and again, they ran into data inconsistencies, resulting in negative consumer impact.

The team made a unanimous decision to use a single, flexible product API using GraphQL. After creating a number of proofs of concept for GraphQL frameworks, including a Spring Boot/Java implementation and a Scala/Akka implementation, the team landed on using the same NodeJS Apollo Server implementation being used by the teams above. Since aggregating microservices often introduces an I/O problem for clients, where they must call many APIs asynchronously, NodeJS and JavaScript became a clear choice for them.

The team uses GraphQL to provide a better user experience. GraphQL allows them to return smaller payloads by specifying exactly the data schema needed. Their UI no longer needs to maintain a set of APIs to fetch various data points. Their GraphQL implementation aggregates five REST microservices: their Image Library service, Copy Service, Product Attribute Service, Video Library and the Tech Sheet Service. The team is excited for its Q4 release and are scaling to meet the needs of their various teams.

Conclusion

At Nike, we obsess over optimizing the consumer journey — aiming to deliver premium experiences throughout our digital ecosystem. We constantly evaluate new technologies that may improve these experiences. As the software industry produces new technologies daily, it’s important to carefully weigh the benefits of any new technology with the cost of adoption. GraphQL is proving to be a winner for both Nike and our consumers.

A cross-org working group continues to utilize a shared graph of Nike APIs to facilitate reusability across experiences. As Domain microservice APIs are integrated, any experience will be able to effortlessly retrieve data without any additional custom aggregation code. Stay tuned, as teams at Nike continue to explore leveraging GraphQL to redefine the Nike shopping experience at global scale.

Interested in joining the Nike Team? Check out career opportunities here.

Accelerate: Transforming Nike Digital APIs with GraphQL was originally published in Nike Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Why Engineers Should Be Invested In Their Product

Sobhagya Jose — Fri, 07 Dec 2018 16:56:00 GMT

by Sobhagya Jose

Nike Run Club (NRC) on the Apple Watch

Are you an engineer? When was the last time you had a conversation with your Product Manager about the product you’re building? If your answer is longer than 2 weeks, I challenge you to rethink your motivation for writing code.

Engineers have a responsibility towards creating great consumer experiences, and engineers who are invested in their product’s vision can help a good product become a great product. Engineers can embrace this responsibility by influencing product direction and building the right things — things that may not necessarily be identified by a product manager working alone. In this blog post, I will share the fundamentals of product management, and how engineers can participate in the product management cycle keeping these fundamentals in mind.

A lot of backend engineers will tell you that they are proud of writing great code– from their code base to the architecture of their systems. For example, as part of the Motivation team supporting gamification for Nike Run Club (NRC) and Nike Training Club (NTC), we write code that is clear and functional, using new open-source libraries such as Akka and cats. Additionally, using a mix of internal and open-source frameworks, we make sure our architecture is scalable, available, robust, and easy to operationalize and maintain. However, good product direction is just as important, if not more, as good technology. Without good product direction, a large fraction of your users may not even make it down the feature funnel of a new feature you just shipped– resulting in wasted talent, effort, and opportunity. One way to help avoid this is for engineers building a product to be mindful of what makes a great product, in addition to great code. To do that, channel the mindset of a product manager. Here’s how.

Pillars of product management

First, a good product manager will consider two key pillars of while guiding the development of a product : 1) vision (heart), and 2) data (science). To be a good shepherd of the product you are building, consider keeping these in mind throughout development.

THE POWER OF VISION

To start, the heart of any product or feature is a vision. A vision outlines what the product wants to be in the future, not only for the benefit of the customers, but also the people building the product. Think questions such as: Who are you building this for? When are they going to use it? Are you solving a problem or delighting the user?

Take Nike’s mission statement for example: “to bring inspiration and innovation to every athlete* in the world.” That little asterisk indicates that if you have a body, then you’re an athlete. This motto, this vision, is the North Star for Nike employees regardless of the role they play– from directors to designers to engineers to program managers to product managers.

Having a vision is crucial to everyday decision making. For example, you can use your vision to resolve a common source of conflict in prioritization of engineering work: tech backlog versus new features. Consider the Nike Run Club (NRC) app redesign in 2016– a major overhaul of its platform architecture motivated by the desire to better support for our users at scale and increase the ability to innovate within the app faster. Despite some initial pushback from long-time users, in the last 2 years since the release, NRC steadily climbed the charts with an app rating of 4.8 stars as of Sept 17, 2018, compared to 4.43 stars (July 1, 2016) before the redesign. NRC has also released (and re-released) an array of great features such as Achievements, Audio Guided Runs, Challenges, Cheers, and Shoe Tagging, all built on the new cloud-enabled infrastructure. Decisions like these — balancing scalability and agility — serve the vision of the product to bring inspiration and innovation to every athlete in the world.

As engineers look to embody the principals of a product manager, ask yourself or your core project team what the vision is for the product or feature set you are creating is. Every decision should clearly come back to your vision. It’s this clear connection that will keep your end-product pure and the end-user experience premium.

Dig. 1: It is possible to scale while remaining agile by implementing systems that incentivize innovation and gaining alignment around a clear vision. Source: Kevin Wilkins @ trepwise in Are you Netflix or Blockbuster? How a Clear Vision is Vital to Innovation.

LEVERAGE THE DATA

The second pillar of product or feature development is data– quantifying the value provided by the product or feature to the users in a metric that business stakeholders understand. These consumer focussed measurements, such as metrics and Key Performance Indicators (KPIs), give both the engineering and business sides of the product a standard metric to talk about the success of the product. Choosing the right metric to drive will serve the customers as much as the business, and it’s important to balance out both needs. A good product manager uses metrics to steer the development of a product as well as gain broad business buy-in, both skill sets that engineers could use to champion their work.

This balance is necessary– a good product can only grow so much without funding, and a bad product with a ton of money (marketing, etc.) thrown at it will never sustain growth. For example, the Lifetime Value (LTV) of app users is a meaningful high-level metric for evaluating the profitability of an app. For a revenue focussed app, LTV can be boiled down to a dollar amount per user, whereas for non revenue focussed app, you can instead use user engagement, or a combination of both, as a metric to drive. That’s not to say that you should shoot down “silly” or “fun” ideas that you don’t expect to make a significant change to the metrics. Be aware of who your audience is, and always do a post-launch analysis; there is always scope for it to turn into a great idea.

Take the example of the Personalization team at Nike Digital working on the Workout recommender for the Nike Training Club (NTC) app. The team, which is made up of engineers, data scientists, and business/product, collectively decided that the key KPI would be driving engagement by encouraging more users to start their first workout. The engineering team leveraged a wide variety of data and iterated on strategy, before moving forward with their technical solution. With the support of their product team, they utilized A/B testing and goal driven development to build the right product so as to Serve the Athlete*, while being scalable and robust in order to be more maintainable. This means open communication between engineers and product teams, inside and outside meetings, as equal stakeholders. The team demonstrated perfectly how to leverage data: “A/B testing results showed that workouts from the Picks for You section were started five percent more often when powered by the Neural Network model than Collaborative Filtering. This way we were able to gain five percent more engagement on top of the 57 percent engagement gain from the initial model!”

Think of your KPI as your anchor. Keep coming back to it throughout the development to see if what you are creating will drive that KPI. If it doesn’t, ask yourself “Why are we building it?”

Participating in the Product cycle as engineers

Now that we know the main pillars of good product management, how do you, as an engineer, start embodying the role of a product manager while creating your product? Here are a few pointers to incorporating into your everyday work mindset:

Remember the vision. Think big-picture, even while executing the details. Understand how the feature that you’re working serves the consumer, and how it fits into the business goals of the organization.
Participate in the process. This can be done as early as the ideation stage, by way of internal hackathons. For example, the delightful Custom Cheers feature in the NRC app was originally the winner of an internal hackathon! You can also participate in the product cycle at a later stage such as the product design/ engineering design by participating in feature review meetings, which are often organized by product owners.
Know the priorities. Product owners are paid to ensure the team, including engineering (and design, quality testing, etc.) are working on the most important problems that provide the most value. Get coffee with your product owner to talk about what your team’s priorities are and why.
Win as a Team. While executing the details of your feature, brainstorm and iterate over execution by talking with the other disciplines working on the same feature, such as backend developers, mobile developers, and designers. Doing so can avoid wasted effort by having everyone on the same page as much and as often as possible. Don’t hesitate to have an exchange of ideas with members of different disciplines, as long as you are respectful of each other’s role and domain expertise in the team.

Dig. 2: An example of Product-Engineering relationship in the context of a product lifecycle. Source: Sherman Leung @ PatientPing in How to Avoid Over-Engineering the Product Role.

PRODUCT MANAGEMENT WITH ENGINEERING IN ACTION

A good example of a healthy PM/ Engineering relationship is Asana. Asana is a web and mobile application designed to help teams organize, track, and manage their work. In her article titled How we develop great PM / Engineering relationships at Asana, Jackie Bavaro, Head of Product Management at Asana, describes the key components of a successful working environment for software development: “At Asana, engineers are involved and can lead product direction from the beginning of the product lifecycle. [1] We gather input from across the company in planning our roadmap, and [2] include all the engineers on a team in research and design brainstorming at the beginning of each project.”

The product launched commercially in April 2012 and was most recently valued at $900M. Besides being a successful company, Asana also received a rare perfect rating on Glassdoor and a place on Glassdoor’s Top 10 Best Places to Work in 2017. It is not surprising then, that the company claims to be a team of peers on a bold mission. The engineering team has a set of guidelines for making decisions that are driven by the company vision, and a culture of goal setting, where engineering and product is equally invested in the final product, while contributing with their respective domain expertise. At Asana, “Engineers appreciate understanding the reasoning behind product decisions. PM’s love when engineers come up with valuable ideas enabled by creative technical solutions.”

Creating an environment of inclusivity between disciplines, driven by mutual respect and a common goal, can result in both a great product and a great work environment. Boom.

Summary

Engineering has the power to affect the community for the better and– as a consequence– the responsibility to question the motivation and value of what they are building. There is an opportunity for engineers that are working as part of a larger organization that is not necessarily found on the same scale when working on say, individual side projects. In a large organization, there is a common leadership and in an ideal world, that means everyone is driven by a common goal. Also, for a software development team to be content, productive and delivering results, inclusive decision making and mutual respect to other disciplines are not nice to have, but crucial.

References:

To learn more about jobs at Nike visit jobs.nike.com.

Why Engineers Should Be Invested In Their Product was originally published in Nike Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

How Nike’s Digital Transformation Is Monitored

Adam Nutt — Wed, 28 Nov 2018 18:15:08 GMT

Nike is in the middle of one of the most significant digital transformations since it began doing business in 1972. The modern consumer wants products and content tailored to their individual needs. It is no longer sufficient to provide a curated set of offerings that are available to anyone; instead, we must create a personal experience, tailored for the specific ways our consumers engage with the brand. This can include things from offering early access to products to tailoring individual workout recommendations.

Offering these various experiences also requires a shift in the way Nike thinks about its digital platforms. Five years ago, almost all of the platforms were commercial, off-the-shelf products purchased from various vendors. Teams were responsible for integrating these products together to enable the various capabilities desired. Now, many of our capabilities are developed in-house, giving us increased speed-to-market and flexibility. This increased speed and homegrown approach has also meant rethinking what it takes to monitor these systems.

When my team started to build the next generation of Nike services, we needed to determine what to monitor and what platforms best met our needs for monitoring. The set of monitoring systems we use spans the three pillars of observability: metrics, traces and logs. (Cindy Sridharan has an excellent primer on this topic titled Logs and Metrics).

One of the main reasons to build this new set of services is to continue to support shoe drops. Our architecture needs to handle thousands of individual service instances appearing and disappearing in minutes, and our monitoring tools need to support this. Over time, we have learned that tools like Splunk are great for post-incident analysis, but with their current capabilities, they cannot tell us what is happening in real time. We needed a metrics platform that would allow for faster aggregation of the information, so we could see the activity much closer to real time.

Measure What Matters

In the summer of 2017, Nike started to use the platform that was originally developed for high-demand launches for its everyday online sales. This delivered on a promise that a single set of services developed for the cloud can be used for multiple experiences. As the team implemented several key features and headed into the holiday season, we needed to know that the services were working. Nike chose to implement a distributed microservices architecture, mainly for scaling and recovery concerns. But one of the challenges we faced with this architecture was how to monitor it.

We went through several standard monitoring frameworks before determining that custom metrics were needed to truly monitor the key performance indicators (KPIs). One of the key challenges in our search for the right metrics platform was finding one that could ingest large amounts of data quickly. This is because, during these launch events, we emit metrics at an extremely high rate. We evaluated several vendors before determining that SignalFx best fit our needs. The first version of these custom dashboards was implemented in late in 2017. They brought an immediate measure of confidence to the teams responsible for the services; if something was wrong, folks would know about it quickly.

Checkout Business Metrics dashboard

The Checkout Business Metrics dashboard was the first dashboard we created. It immediately brought to our team a level of confidence in our ability to determine whether or not our platform was selling product without interruption. After using this dashboard for a while, we realized that it didn’t go far enough as a monitoring tool. While it significantly boosts visibility into our service, it does not determine if consumers are having issues while purchasing product on our site. For example, our inventory service returns a 4xx error when an item is out of stock. We always expect some level of 4xx errors to be happening, but there is cause for concern if that error rate becomes too high or too low. While we still use this dashboard, we evolved the concept with subsequent tools in order to provide clearer visibility into consumer errors.

Shipping Options dashboard

During the holiday season last year, we had a couple of big public failures which could be attributed to just a few services. While the image above shows only a small portion of the entire dashboard, it is easy to determine the health of the service at a glance. You can see the number of requests from the internet as well as how many requests have been received by the service. The team responsible for this service has also exposed the latency of the services downstream, so they can quickly tell where a spike in latency originates.

The Checkout Business Metrics and Shipping Options dashboards represent different concerns to Nike. The first shows many consumer-level KPIs, number of checkouts, average duration of checkout, etc. The second dashboard shows platform-level KPIs, including the number of requests and latency. With custom metrics, we can also enable a third type of KPI dashboard. This is strictly a business-level KPI dashboard and is used during key events to show our business partners what the platform is capable of.

Shoes* Per Second chart

The chart above shows the number of items sold every 10 seconds during one of our high-demand events. There are several other business-facing KPI charts like this one, collected on a dashboard, that help leaders make key decisions in the moment about how well a particular event is performing. We have just started to enable many of these business-level charts and are currently implementing others.

Infrastructure As Code

One of the biggest challenges faced by Nike was that, once teams started to implement custom metrics, three different patterns emerged. One pattern was a team that wholly embraced custom metrics and took ownership of the metrics for themselves. Another pattern involved teams that didn’t see value in the metrics and didn’t prioritize the work to implement them. The third pattern was a team that saw the value, but lacked the skills or time necessary to implement them. For the latter two patterns, one of our teams worked closely with our metrics provider to develop signal_analog.

signal_analog enables teams to define, version and deploy monitoring resources. Additionally, it gives us a standardized library for common metrics, so every team that reports latency does so using the same naming conventions. This feature greatly simplifies things for our Core SRE team. We use this in conjunction with wingtips, which gives us distributed tracing based on the Google Dapper paper. The combination of these two tools means that teams that were reluctant or unable to implement monitoring now have a straightforward path with a low bar for implementing metrics for their services.

These two internally-developed, open source libraries (signal_analog and wingtips) offer the ability for any team to get up to speed quickly with tracing and metrics. They also offer a standardized dashboard to ensure consistency of reporting across teams and organizations.

Conclusion

Nike quickly realized that monitoring CPU load, free memory and other traditional infrastructure metrics did not answer the questions we ask ourselves, like, “Are our services contributing to a good experience for the consumer?” and “Are we continuing to sell product, or has there been an interruption?” Teams at Nike have now implemented custom metrics across many hundreds of microservices, including those using serverless architecture. This has allowed monitoring of customer, business and platform KPIs. Two years ago, it took teams almost 20 minutes to determine the source of an incident during times of high site traffic. Now, teams can see these problems within seconds and immediately begin to triage the errors. Enabling custom metrics has also given us the confidence to release code faster. The State of DevOps Report shows that high-performing organizations move faster and build more resilient systems. After analyzing our own internal data about release frequency, we estimate that teams at Nike that utilize SignalFx release five to eight times faster than teams that do not.

Want to join the Nike Digital Team? Check out the available jobs here.

How Nike’s Digital Transformation Is Monitored was originally published in Nike Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Safely Deploy AWS Lambdas with CloudFormation

Kendra Elford — Mon, 12 Nov 2018 19:36:01 GMT

by Kendra Elford

First, A Little Bit of Background

For some time now, my team has been well versed in building JVM-based services that run in EC2. We adopted continuous delivery years ago and are now comfortable deploying our software in a canary style, which has saved our bacon on several occasions. Recently, we’ve begun to dip our toes into newer AWS technologies, including Lambda and Amazon SQS FIFO (First-In-First-Out) queues. We have a long history of using SNS (Simple Notification Service) to deliver events to our various pipelines, but Amazon SNS lacks the ability for a subscription to write a “group id” for FIFOs. We felt this was a good place for us to start experimenting with Lambdas. We ended up building a re-usable AWS Lambda component to take care of these subscriptions for us that can be deployed using our existing deployment infrastructure.

A piece of the puzzle was still missing for us was: How were we going to address canary deployments? Here, I’ll share how we solved this.

Canaries for Lambda Functions?

If you’ve been practicing continuous delivery for your AWS EC2 based services, you’re probably already deploying changes as canaries. In order to limit potential impact to your customers, you likely send a portion of your traffic to new code. This same process can be applied to Lambda Functions.

They are based on two abstractions over the Function — Version and Alias. As presented in the AWS console, these can be a little confusing. Both appear under the “Qualifiers” menu.

A Version is basically exactly what it sounds like: a historical record of a code artifact. By default, whenever you upload new code to a function, it’s applied to a special Version called “$LATEST”. A new Version can be created by selecting the $LATEST version from Qualifiers, then selecting “Publish new version” from the “Actions” menu.

The Alias abstraction allows you to create a name that refers to one or two Versions. There is a default Alias called “Unqualified,” that refers to the $LATEST Version. When you select two Versions for an Alias to use, you can also now specify how much traffic the second one will receive from its event source (e.g.: Amazon SNS, Amazon Kinesis, API Gateway, etc.).

An important aspect to understand about Versions and Aliases is that any of them can be configured to receive events, and they all operate independently from each other. For our use case, we create a single Alias to use for our production canary and only configure to receive events. Make sure that none of your Versions, nor the Unqualified Alias, are configured to receive events.

Once you get to the point of understanding how Versions and Aliases work, setting up a canary in the console is pretty self-evident:

Select your production canary Alias
Pick the stable Version in the first box
Pick the canary Version in the second box
Select the traffic distribution.
Click Save
Check it out in the logs! You’ll see the Version in the log stream name, in the square brackets.

Automate this Business!

On our team the AWS automation tool of choice is AWS CloudFormation. This allows you to declare resources and their configurations however you want and lets AWS figure out how to make that happen.

To make our usage of CloudFormation repeatable, we use Python and Troposphere with our deployment framework and reusable recipes. We have a re-usable Lambda Function for connecting an SNS topic to a SQS FIFO and a recipe for deploying, configuring and canarying it.

CloudFormation has three types of resources for deploying Lambda Functions in a canary style:

Function — This is where the “code” for your function is maintained. What “code” means in this case depends on the runtime type for your function. In the case of Java, the “code” is a reference to a .jar file in S3, that contains all of the assembled .class files and resources to execute in the JVM. When this resource is created for the first time, or updated with new code any subsequent time, the “$LATEST” Version is automatically updated, like in the UI.
Version — A version can be created or removed, but not updated. Whenever it’s created, it becomes a “copy” of whatever “$LATEST” is currently. CloudFormation is smart enough to know that an update to the Function comes first. The version also allows you to specify a base64 encoded sha256 signature of the expected function code, so you can’t accidentally create a version of something you didn’t expect (e.g. create a new Version without updating the Function or upload the wrong artifact to the Function, etc.). CloudFormation will automatically create incrementing Version numbers for you.
Alias — The Alias is an updatable Resource with a name, a Version and a RoutingConfiguration, which may contain a VersionWeight that describes how much traffic is routed to your canary.

Here’s a diagram that describes how all of these Resources interact with one another:

Putting This All Together, In Practice

So, now that you understand what CloudFormation Resources to use and how they will behave, what does the process look like for working with Lambdas?

Deploy the Initial Version

So, you have your code ready to go.

Upload your code artifact to S3 so that it can be referenced by your Function in your CloudFormation template. Make sure you capture the SHA256!
Create your CloudFormation template with a Function, Version and Alias. Make sure to name your Alias something obvious, like “production-canary.” The Alias is only going to reference the single Version that you’ve created.
Create your CloudFormation stack with that template, and wait for all of the resources to be created.
Use your method of choice to subscribe your Lambda Function’s “canary” Alias to your event source. We typically use the console for this step to “flip the switch” and connect our code to the ecosystem.

You’re done! Check out the logs, and notice all the things you messed up!

Deploy an Update

Ok, so it wasn’t perfect. But now it is! . . . Right? Probably not. You’ll want to limit the impact to customers by deploying this as a canary.

Upload your new code artifact to a new location in S3.
Update your CloudFormation template. Change reference to the code in your Function. This triggers CloudFormation to update the $LATEST Version. Add a new Version, leaving the existing Version alone. Reference the new Version in the existing Alias’s routing configuration, with an appropriately small weight set on the canary Version
Update your existing stack with the new template. You should create a change set and review before executing the change.
Observe the log streams for your Lambda. You’ll see that some of them are for your canary.

All Done?

Oh no! It’s still not right! Don’t panic!

Remove the canary Version from your template along with its canary weight.
Re-deploy.
After the stack has completed updating (should be fast), everything should be as it was before you deployed the canary.

Alright! Everything is fixed up!

Change the Version reference in the Alias to the new Version in your template.
The previous Version should no longer be referenced anywhere in your template, so you can remove that.
Remove the RoutingConfiguration from your Alias, so that all traffic is now going to the new Version.
Update your stack.

Once you understand how the resources interact with one another, deploying a Lambda Function in a canary style with CloudFormation is a fairly simple matter.

Here is an example template which illustrates a complete canary stack: [here]

Safely Deploy AWS Lambdas with CloudFormation was originally published in Nike Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.