riselab - Medium

So you want to build an open source tool/library as a grad student

Devin Petersohn — Thu, 12 Aug 2021 15:25:21 GMT

This is a collection of experiences and recommendations for building an open source community as a grad student

Many grad students and professors have asked me for suggestions on how to build a functioning and thriving open source community while in grad school. This blog post appears as a chapter in my thesis, but ultimately I decided to extract those contents and put them here for easier retrieval and consumption.

This blog post is part history, part lessons, part advice. I don’t know everything, and the moderate success of my work does not mean that my advice is automatically good. I think there is value in hearing other people’s opinions, but not taking them as truth. I recommend that method of consumption here. Everyone has opinions, every situation is different, judge for yourself.

Courtesy of Unsplash

My History Building a Successful Open Source Project

During my grad school career, I built Modin (https://github.com/modin-project/modin), a full dataframe implementation that, as of writing, has over 6,000 GitHub stars. This effort has been supported by many people over the last few years, and definitely would not be as far as it is without that support. Berkeley is well known for creating some of the most used and most impactful software on the planet, so I did have an unfair advantage in terms of brand.

My approach toward promoting the work has been fairly successful, however I attribute that largely to luck. The first blog post I published (2018) got a lot of input and feedback from others working on the project at the time. It ended up getting shared on Twitter and HackerNews by many people (I had accounts with neither at the time) and generated a lot of interest. At the time, pandas on Ray (which would become Modin) was a 1 month hack I put together with help from several undergraduate students at Berkeley. It honestly wasn’t ready for the overwhelming interest it received, and yet it has continued to be developed and grown into something that I couldn’t have imagined at the time.

My role has transitioned away from being the main generator of code to more of a project management role, coordinate many disparate institutions and their contributions and making it easy to contribute. I spend a lot of time reviewing code and telling other people what to do rather than writing all of the code myself.

Lessons and Advice

This section is likely to be long and difficult to parse, so I’m going to make my advice section headers so it’s easier to skim to find the points you’d like to better understand. The points are not in any specific order.

[1] Make your system understandable to your target user, and don’t worry about anyone else

This point is something I think we’ve gotten right from the beginning with Modin. From the start, we have abstracted away complex details from the user, including in how we present the system. This has, of course, led many of highly technical people to discount the complexity of abstracting away these details, but that has never bothered me. I don’t care if someone thinks Modin is or isn’t technically interesting, I care that it solves a problem. Because of how I talk about Modin, most people have a simple understanding of the system. That is by design. While working on Modin, we have formalized a new data model, created a new dataframe algebra, and created a truly unique data layout and metadata management system. The people who could understand enough about the underlying system to appreciate it likely wouldn’t use Modin in the first place, because Modin is targeting a less-technical group of users. I think this is really important because when people talk about Modin, they generally focus on the problems it solves rather than the technically interesting parts of the implementation. I am okay with that, but make sure that you are. Do you want people to use your work or do you want them to think you’re really smart? Sometimes you can have both, but often not.

[2] Be prepared to defer your graduation and publications

This point is less applicable if you have a large team managing the open source, but in my case I was working mostly alone from the open source side. On the research side, we were able to bring together some of the best in databases and machine learning, but many deadlines were missed because of things that came up in the open source. I prioritized the open source community and development over my own graduation and publications. This is a decision you’ll have to make for your own situation. I’m not completely convinced now that you need to do this, but at the time I felt like it was necessary to keep the open source community alive and growing. There’s little overlap between open source community development and grad school requirements. You are going to have to respond to questions, issues, and promote your work.

[3] The fun parts of open source are front-loaded

At the beginning of any open source project you’re going to be able to move fast. There’s no technical debt, no new issues, and a lot of energy and excitement. As time goes on, your time will go from developing new features and building things to answering issues and emails. If you’re fortunate enough to have a lot of external contributors like Modin does, you’ll end up spending a lot of time reviewing code. These days, I spend maybe 20% of my time writing code and debugging. If you want to get into open source, be prepared to spend a large chunk of time on user issues and support after your project hits a critical mass. If you are mostly doing it alone, the project momentum can easily screech to a halt and feel like it’s not moving anywhere for weeks. My recommendations here are to (1) avoid romanticizing the idea of managing a highly visible open source project and (2) learn to bin your time. At first, it’s easy to answer issues as they come in, eventually you will not get anything done if you always answer issues immediately.

[4] Promotion is important

If you want people to use your project, you have to tell them about it. Part of the difficulty in deciding how to promote is around deciding when. Promotion takes away from development in a small team, and so there needs to be some good reason to promote. Early on, I had planned on curating a series of blog posts that could archive the journey. That was quickly thrown out after the reception from the first blog post. My main advice is to be careful not to overpromote: each update should be substantial. This is mostly personal taste, but I don’t like reading a lot of fluffy blogs that have hardly any new content. Honestly, most people aren’t going to read the blog anyway, they will skim the headers or scroll to the bottom to read the conclusion. In terms of promotion, getting multiple friendly people to tweet about your blog is probably the best way to promote. HackerNews is not what I would consider a good distribution channel, rather a good place for discussion about topics surrounding a blog’s title and content. Podcasts are another way to spread the word, and they are a common way that people hear about new projects.

[5] Make your work easy to install and use

The biggest hurdle to using something is getting started, and the lower the barrier the more people will try it. This might seem obvious but easy means different things to different people so I’ll try to be concrete here. Do you require users to pull your Docker container? Do you require users to build from source? Does installation change the user’s environment? Does installation take more than a couple of steps? If you answered yes to any of these questions you’re going to have a hard time getting people to do more than look at your README. Making something easy to use is important to getting a community off the ground, if people don’t use it they won’t tell their friends about it (probably).

The second component to easy use is examples: you need really good useful examples, not toys. Users want to see what they can do with your tool, and showing them how to do a trivial map over a list of integers is not going to give users a good idea of what they can do. Your examples should show off a variety of use cases and capabilities on actual workloads. Examples overlap a bit with the next point on documentation.

[6] Write documentation

Everyone says this, nobody does it.

[7] Primarily use communication channels that are Googleable

Slack is not internet searchable, and Gitter search results are terrible. When people have a problem they will often go to Google first to see if others have solved the problem. In Modin, we use GitHub issues and Discourse boards for discussions to make sure that people looking for solutions can find them. This also has the nice side effect of being able to point to those pages when someone asks the same question.

[8] Give talks at small venues and meetups, not just the big ones

Promoting work via talks at big venues does give you more visibility, but ultimately the smaller more intimate venues are where you’re more likely to make good connections with people who will actually use your project. It’s tempting at the beginning to try to go straight for giving talks at the big international conferences, but I’ve found that small meetups are both more focused (people are more likely to have the problem you’re solving) and more willing to talk. I gave talks to meetup groups as small as 6 people, and in those meetups I had more engaging conversations that in the larger venues. You need these relationships with your users early on to build momentum. Otherwise, once you do talk at these large venues and people will ask “Who is using your project?”. In the large conferences people claim to be looking for the next big technology to adopt, but really they just want to use what everyone else is using. Having users will get you more users, but you need the early adopters, and often you will meet them in small venues or meetups.

[9] Make it easy to reach you

This point will contradict with the point about communication channels being Googleable, but in general you need to be able to be reached by people who run into problems. Don’t make the communication overhead of reporting a bug a barrier to discovering the bug exists. In Modin, for example, we set up a couple of emails and if something went wrong internally we asked them to email us a bug report as a part of the error message. It has been pretty successful and there are several bugs we found that weren’t discoverable otherwise. We rarely get these today, which is a good indication that things are getting more stable.

Generally, to solve the issue of search indexability, I will ask people who email to open an issue if I triage it to be serious and new. I’ve found people are easier to go back/forth with over email, so you can get the simple stuff out of the way quickly as well (e.g. user environment issues or user errors).

[10] Scale your efforts later, build a community first

A lot of what I propose here doesn’t scale, and it will get worse as the number of users grows. This is by design. Even after you have a critical mass, it’s not likely you’ll have a large group of contributors outside of your organization (it took roughly 2 years for Modin to get serious outside contributor groups). Building a community is largely a social effort, and you need broadcasting on Twitter is not going to be enough to get the ball rolling. Everyone is doing that. If you want to actually build a community, you need to do things like talk to individuals and answer individual emails. The personal connections are more important than trying to make noise in a very noisy world, and they will get you farther than clicks or views on your tweets and blog posts.

[11] Make it easy to contribute and ask for contributions

Making your project easy to contribute to is a good way to help build a community. There are always people who are interested in working on projects on the side, and getting these people involved is important. Often, projects are too difficult to jump into years later. It is difficult to build a community when you’re the only person who knows how to do anything. You’re going to need people who can help with issues so you can take some time off every now and then to recharge. This is obvious, but it takes good design and a lot of DevOps work, which doesn’t necessarily equate to more code being output. In fact, often helping others will often reduce your own productivity and the overall code velocity of the project. This code velocity cost (on an individual basis) may actually never be recouped, but I argue that there are intangible benefits to working with other people:

You need to justify your designs
Producing code is not a good metric of productivity — there are more important things than new features
Excited contributors often also become evangelists

Adding contributors will not always yield more code or more bugfixes, but it helps build a community.

Concluding thoughts

I hope this has been helpful. It’s a lot of work to keep a community going, and the work is mostly social. There’s a lot of engineering in building something, but actually getting the word out and keeping in contact with users takes significant effort.

This list is by no means complete, but I hope it’s enough to help you get off the ground. Grad school is a great time to explore a bunch of different things, including open source community building and project management. I hope you can be as successful as I was (or more!) and that this list can help you plan how to execute on your great ideas. Please don’t hesitate to reach out to me directly if you have any questions (https://www.linkedin.com/in/devinpetersohn/). I don’t consider myself an expert on open source community building (having only done it once!), but I will do my best to answer any questions you might have. Good luck!

So you want to build an open source tool/library as a grad student was originally published in riselab on Medium, where people are continuing the conversation by highlighting and responding to this story.

Feature Stores: The Data Side of ML Pipelines

Sarah Wooders — Tue, 06 Apr 2021 18:56:04 GMT

We need a principled way of managing state in real-time ML pipelines.

Written by Sarah Wooders, Peter Schafhalter, and Joey Gonzalez

The RISE of Feature Stores

As more models are deployed in real-world pipelines, the recurring lesson is that data and data featurization matters above all else. The last generation of big data systems scaled ML to real-world datasets, and now feature stores are quickly emerging as a new frontier for connecting models to real-time data [1].

Keeping features up-to-date is critical for model accuracy, but expensive and hard to scale.

Feature stores, as the name implies, store features derived from raw data and serve them to downstream models for training and inference. For example, a feature store might store the last few pages a user browsed (i.e., a sliding window over the clickstream) as well as the current predicted user demographics (i.e., a model prediction) both of which would be high-value features for an ad targeting model.

Unfortunately, many feature stores being built today are Frankensteinian amalgamations of batch, streaming, caching, and storage systems.

In this post, we (1) define what feature stores are and how they are used today, (2) highlight some of the design limitations of the current generation of feature stores, and (3) describe how innovation in feature store designs could transform production machine learning by managing state across training and inference pipelines in a more principled way.

Background

Why feature stores?

A simple ML pipeline trains a model from a static dataset, then serves the model to respond to user inference requests.

Predictions are generated from model parameters and request data.

However, in order to adapt to a continuously changing world, modern ML pipelines need to make decisions which depend on real-time data [2]. For example a model predicting ETA might use features like the recent order fulfillment times of a restaurant, or a content recommendation model could consider a user’s most recent clicks. Model training and inference therefore rely on real-time features derived from joining, transforming, and aggregating over incoming streams of data. Because the featurization step can be expensive, features need to be pre-computed and cached to avoid redundant computation and to meet tight prediction latency requirements.

Predictions also rely on features derived from live streams of data.

What are feature stores?

Feature stores are used to store and serve features across multiple branches of the pipeline, allowing for shared computation and optimizations. While different feature stores vary in their functionality, they typically manage the following:

Serving features to meet varying query latency requirements — Features are usually placed in both a fast “online store” (to query during inference) and durable “offline store” (to query during training).
Making features composable and extensible — Once a feature is defined, it should be easy to connect it to downstream models, derive additional features from it, or redefine the feature’s schema or featurization function.
Maintaining features derived from real-time data — Maintaining features is resource intensive, but stale features can negatively affect prediction performance.

Certain features (e.g. a 1 minute time window aggregate) are very sensitive to staleness and need to be ultra-fresh, while others (e.g. 30 day windows) may only need periodic batch updates. As the system that interfaces with both updates to and requests for features, feature stores are well positioned to optimize the tradeoffs between freshness, latency, and cost.

Feature Stores Today: Challenges & Limitations

Many companies today have implemented feature stores internally to make features accessible to models deployed in production.

https://medium.com/media/64f005665bfa6f8737f57b55c18d3bcc/href

Feature stores today are built atop existing streaming, batch, caching, and storage systems. While each of these systems solve challenging problems in isolation, their constraints are problematic for feature stores.

Batch processing systems like Spark enable complex queries over static datasets, but introduce excessive latency when serving features and trigger total recomputations when backfilling data.
Streaming systems such as Flink and Spark Streaming enable low-latency pipelines, but fall short when asked to maintain large amounts of state. Lambda architectures combine both batch and streaming systems, but result in costly duplicate computation and complex maintenance of both streaming and batch codebases.
Streaming databases with materialized views can offer advantages of both rapid computation and storage, but these are difficult to adapt to arbitrary featurization operations. Their query latencies may also be too high for prediction serving.
In-memory key-value stores like Redis provide a fast way to access features, but these are typically difficult to update in a consistent manner and expensive to scale.

Many of the requirements for feature stores can be met with a combination of these systems. However, the resulting pipeline is rigid and hard to optimize end-to-end. For example, prioritizing featurization tasks based on their impact on overall prediction accuracy would require coordination between the data store receiving queries, the streaming system pushing live updates, and the batch processing system for processing historical data. Rather than awkwardly combining multiple compute engines with multiple databases to meet multiple latency targets, feature stores should take advantage of their access to incoming events and query patterns to optimize latency, compute cost, and prediction accuracy in a centralized way.

The Future of Feature Stores

We believe feature stores can offer centralized state management for ML pipelines, and have exciting potential for:

Lineage Management: Feature stores open the door to a new, data-centric abstraction for developing and tuning machine learning pipelines. The complexity of existing machine learning pipelines often makes it difficult to ensure basic reproducibility, apply pipeline changes, or perform optimizations across the pipeline. While meticulous versioning and synchronization can solve these problems to a certain extent, applying these techniques to constantly evolving datasets and pipelines is simply hard to think about. A data-centric view on pipelines (for example, treating data pipelines as materialized views) has the potential to introduce new abstractions which simplifies the process for propagating data and operator changes.
End-to-End Optimization: Feature stores are well positioned to enable new end-to-end optimizations across ML data pipelines. Current systems restrict computation to running in either an event-based or request-based manner, making it difficult to schedule tasks in a way that optimizes common metrics like prediction performance and cost. Practitioners should be able to configure their pipelines to optimize for cost saving (lazy computation/updates, approximate results), inference latency (eager computation), or overall prediction performance (update features with most impact).
Scalable State Management: Feature stores indicate the need to scalably maintain and persist state within ML pipelines. Real-time, production ML pipelines often need to maintain tens of million of features derived from multiple, dense incoming streams of data. Feature sets may be too large to persist in memory or update with every incoming stream event as a stream processing system would by default, but also need to be updated more frequently than a batch processing system allows.

Conclusion

We’re actively studying the design of feature store systems, so let us know if you’re interested in staying up-to-date or collaborating!

https://medium.com/media/2eecfd8942a608c3a01e5d1f86e20dd5/href

If you’d like to get involved with our research, feel free to reach out to wooders@berkeley.edu.

Notes

[1] By “real-time data”, we are referring to data that needs to be processed in real time, both in the context of online prediction serving and maintaining data freshness for features.

[2] Updates for “real-time” data typically need to be on the order of seconds, but can vary between workloads.

Acknowledgments

Thank you to Manmeet Gujral, Gaetan Castelein, and Kevin Stumpf from Tecton, as well as Joe Hellerstein, Natacha Crooks, Simon Mo, Richard Liaw, and other members of the RISELab for providing feedback on this post.

References

Feature Stores: The Data Side of ML Pipelines was originally published in riselab on Medium, where people are continuing the conversation by highlighting and responding to this story.

AI and Memory Wall

Amir Gholami — Mon, 29 Mar 2021 15:30:16 GMT

Update: An extended version of this blogpost is published in IEEE Micro Journal and is available online here.

(This blogpost has been written in collaboration with Zhewei Yao, Sehoon Kim, Michael W. Mahoney, and Kurt Keutzer. The data used for this study is available online.).

Figure 1: The amount of compute, measured in Peta FLOPs, needed to train SOTA models, for different CV, NLP, and Speech models, along with the different scaling of Transformer models (750x/2yrs)*¹ [Download This Image]

The amount of compute needed to train SOTA Transformer models, has been growing at a rate of 750x/2yrs. This exponential trend has been the main driver for AI accelerators that focus on increasing the peak compute power of hardware, often at the expense of simplifying and/or removing other parts such as memory hierarchy.

However, these trends miss an emerging challenge with training and serving these models: memory and communication bottlenecks. In fact, several AI applications are becoming bottlenecked by intra/inter-chip and communication across/to AI accelerators rather than compute. In particular, the flagship LLM model sizes has been increasing at a rate of 410x every 2 years. See Figure 2. Similarly, Large Recommendation System models have reached O(10) TB parameters. Contrast this with accelerator DRAM memory, which has only scaled at a rate of 2x every 2 years.

Figure 2: The evolution of the number of parameters of SOTA models over the years, along with the AI accelerator memory capacity (green dots). The number of parameters in large Transformer models has been exponentially increasing with a factor of 410x every two years*², while the single GPU memory has only been scaled at a rate of 2x every 2 years.*³ [Download This Image]

It is important to note that the memory requirements to train AI models are typically several times larger than the number of parameters. This is because training requires storing intermediate activations, and this typically adds 3–4x more memory than the number of parameters (excluding embeddings). This is illustrated in Figure 3, where the total training memory footprint is shown for training different flagship AI models throughout the years. We can clearly see how the design of SOTA Neural Network (NN) models has been implicitly influenced by the DRAM capacity of the accelerators in different years.

These challenges are commonly referred to as the memory wall problem, a term originally coined by William Wulf and Sally Mckee in 1995 [25]. The memory wall problem involves both the limited capacity and the bandwidth of memory transfer. This entails different levels of memory data transfer. For example, data transfer between compute logic and on-chip memory, or between compute logic and DRAM memory, or across different processors on different sockets. For all these cases, the capacity and the speed of data transfer has been significantly lagging behind hardware (HW) compute capabilities.

Figure 3: The amount of memory required to train different NN models. Here, the optimizer used for CV models is SGD+Momentum, and for NLP models is ADAM. There is an interesting trend in discovering/designing new models, based on the available GPU memory size. Every time the GPU memory capacity is increased, data scientists have designed newer models. As such, breaking this so-called GPU memory wall could further allow new innovations. See [2] for more details on checkpointing. [Download This Image]

One might hope that we can use distributed-memory parallelism by scaling-out the training to multiple accelerators to avoid the single HW’s limited memory capacity and bandwidth. However, distributing the work over multiple processes also faces the memory wall problem: the communication bottleneck of moving data between NN accelerators, which is even slower and less efficient than on-chip data movement. Similar to the single system memory case, we have not been able to overcome the technological challenges to scale the network bandwidth. This can be seen from Figure 4, where we show how the peak compute has increased by 60,000x over the past 20 years, as opposed to 100x for DRAM or 30x for interconnect bandwidth. Unfortunately, it has been very difficult to overcome the fundamental challenges of increasing DRAM/Interconnect bandwidth [1]. As such, scale-out only works for highly compute-bound problems with very little communication and data transfer.

Figure 4: The scaling of the bandwidth of different generations of interconnections & Memory, as well as the Peak FLOPS. As can be seen, the bandwidth is increasing very slowly.*⁴ [Download This Image]

Promising Solutions for Breaking the Wall:

“No exponential can continue forever,” and delaying an exponential scaling at the rate of 410x/2yrs is not going to be feasible for long, even for large hyperscalar companies. This coupled with the increasing gap between compute and bandwidth capability will soon make it very challenging to train larger models, as the cost will be exponentially higher.

To continue the innovations and break the memory wall, we need to rethink the design of AI models. There are several issues here. First, the current methods for designing AI models are mostly ad-hoc, and/or involve very simple scaling rules. For instance, recent large Transformer models are mostly just a scaled version of almost the same base architecture proposed in the original BERT model [22]. Second, we need to design more data efficient methods for training AI models. Current NNs require a huge amount of training data and hundreds of thousands of iterations to learn, which is very inefficient. Some might note that it is also different from how human brains learn, which often only require very few examples per concept/class. Third, the current optimization and training methods need a lot of hyperparameter tuning (such as learning rate, momentum, etc.), which often results in hundreds of trial and error sweeps to find the right setting to train a model successfully. As such, the training cost reported in Figure 1 is only a lower bound of the actual overhead, and the true cost is typically much higher. Fourth, the prohibitive size of the SOTA NN models makes their deployment for inference very challenging. This is not just restricted to models such as GPT-3. In fact, deploying large recommendation systems (which are similar to Transformers but which have much larger embedding and very few MLP layers afterwards [23]) that are used by hyperscalar companies is a major challenge. Finally, the design of hardware accelerators has been mainly focused on increasing peak compute with relatively less attention on improving memory-bound workloads. This has made it difficult both to train large models, as well as to explore alternative models, such as Graph NNs which are often bandwidth-bound and cannot efficiently utilize current accelerators.

All of these issues are fundamental problems in machine learning. Here, we briefly discuss recent research (including some of our own) that has targeted the last three items.

Efficient Training Algorithms

One of the main challenges with training NN models is the need for brute-force hyperparameter tuning. This includes finding the learning rate, its annealing schedule, the number of iterations needed to converge, etc. This adds (much) more overhead for training SOTA models. Many of these problems arise from the first-order SGD methods used for training. While SGD variants are easy to implement, they are not robust to hyperparameter tuning, and are very hard to tune for new models for which the right set of hyperparameters are unknown. One promising approach to address this is to use second-order stochastic optimization methods such, as in our recently-developed ADAHESSIAN method [4]. These methods are typically more robust to hyperparameter tuning, and they can achieve SOTA. However, current methods have 3–4x higher memory footprint, which needs to be addressed. A promising line of work for that is the Zero paper from Microsoft, which showed how one can train 8x bigger models with the same memory capacity by removing/sharding redundant optimization state variables [21, 3]. If the overhead of these higher-order methods could be addressed, then they can signficantly reduce the total cost of training large models.

Another promising approach includes reducing the memory footprint and increasing the data locality of optimization algorithms, at the expense of performing more computations. One simple example is to only store/checkpoint a subset of activations during the forward pass, instead of saving all activations, to reduce the feature map’s memory footprint shown in Figure 3. The rest of the activations could then be recomputed when needed. Even though this will increase compute, one can significantly reduce the memory footprint by up to 5x [2] with just 20% more compute.

Another important solution is to design optimization algorithms that are robust to low precision training. In fact, one of the major breakthroughs in AI accelerators has been the use of half-precision (FP16) arithmetic, instead of single precision [5,6]. This has enabled more than 10x increase in hardware compute capability. However, it has been challenging to further reduce the precision from half-precision to INT8 without accuracy degradation with current optimization methods.

Efficient Deployment

Deploying recent SOTA models such as GPT-3 or large recommendation systems is quite challenging, as they require distributed-memory deployment for inference. One promising solution to address this is to compress these models for inference, by reducing the precision (i.e., quantization) or removing (i.e., pruning) their redundant parameters.

The first approach is quantization, a method that can be applied at the training and/or inference steps. While it has been very challenging to reduce the training precision much below FP16, it is possible to use ultra-low precision for inference. With current methods, it is relatively easy to quantize inference down to INT4 precision, with minimal impact on accuracy. This results in up to 8x reduction in model footprint and latency [7,8,19,20]. However, inference with sub-INT4 precision is more challenging and is currently a very active area of research.

Another possibility is to completely remove/prune redundant parameters in the model. With current methods, it is possible to prune up to 30% of neurons with structured sparsity, and up to 80% with unstructured sparsity, with minimal impact on accuracy [9,10]. Pushing beyond this limit, however, is very challenging, and it often results in fatal accuracy degradation. Resolving this is an open problem.

Rethinking the Design of AI Accelerators

There are fundamental challenges in increasing both the memory bandwidth and the peak compute capability of a chip at the same time [1]. However, it is possible to sacrifice peak compute to achieve better compute/bandwidth trade-offs. This is not an impossible task, and in fact, the CPU architecture already incorporates a well-optimized cache hierarchy. This is why CPUs have much better performance than GPUs for bandwidth-bound problems. Such problems include large recommendation problems. However, the main challenge with today’s CPUs is that their peak compute capability (i.e., FLOPS) is about an order of magnitude less than AI accelerators such as GPUs or TPUs. One reason for this is that AI accelerators have mainly been designed to achieve maximum peak compute. This often requires removing components such as cache hierarchy in favor of adding more compute logic. One could imagine an alternative architecture in between these two extremes, preferably with more efficient caching, and importantly with higher capacity DRAM (possibly a hierarchy of DRAMs with different bandwidths). The latter could be very helpful in mitigating the distributed-memory communication bottlenecks [18].

Conclusion

The computational cost of training recent SOTA Transformer models in NLP has been scaling at a rate of 750x/2yrs, and the model parameter size has been scaling at 400x/2yrs. In contrast, the peak hardware FLOPS is scaling at a rate of 3x/2yrs, while both the DRAM and interconnect bandwidth have been increasingly falling behind, with a scaling rate of 1.6x/2yrs and 1.4x/2yrs, respectively. To put these numbers into perspective, peak hardware FLOPS has increased by 60,000x over the past 20 years, while DRAM/Interconnect bandwidth has only scaled by a factor of 100x/30x over the same time period. With these trends, memory — in particular, intra/inter-chip memory transfer — will soon become the main limiting factoring in training large AI models. As such, we need to rethink the training, deployment, and design of AI models as well as how we design AI hardware to deal with this increasingly challenging memory wall.

We would like to thank Suresh Krishna, and Aniruddha Nrusimha for their valuable feedback.

[Update]: This article was updated on Sep, 2, 2023 with newer hardware and model data available.

*¹ We are specifically not including the cost of training Reinforcement Learning models in this graph, as the training cost is mostly related to the simulation environment and there is currently no consensus on a standard simulation environment. Also note that we report the PFLOPs required to train each model to avoid using any approximation for hardware deployment utilization, as the latter depends on the specific library and the hardware used. Finally, all the rates in this document have been computed by solving a linear regression to fit the data shown in each graph.

*² The growth rate shown in Figure 2 is calculated by only considering the Transformer based models (blue circles), and not the recommendation systems.

*³ The GPU memory is plotted by dividing the corresponding memory size by 6 as an approximate upper bound for the largest model that can be trained with the corresponding capacity.

*⁴ We are normalizing hardware peak FLOPS with the R10000 system, as it was used to report the cost of training Lenet-5 in the seminal work of [24].

REFERENCES:

[1] Patterson DA. Latency lags bandwidth. Communications of the ACM. 2004 Oct 1;47(10):71–5.

[2] Jain P, Jain A, Nrusimha A, Gholami A, Abbeel P, Keutzer K, Stoica I, Gonzalez JE. Checkmate: Breaking the memory wall with optimal tensor rematerialization. arXiv preprint arXiv:1910.02653. 2019 Oct 7.

[3] Rajbhandari S, Rasley J, Ruwase O, He Y. Zero: Memory optimizations toward training trillion parameter models. InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis 2020 Nov 9 (pp. 1–16). IEEE.

[4] Yao Z, Gholami A, Shen S, Keutzer K, Mahoney MW. ADAHESSIAN: An adaptive second order optimizer for machine learning. arXiv preprint arXiv:2006.00719. 2020 Jun 1.

[5] Ginsburg B, Nikolaev S, Kiswani A, Wu H, Gholaminejad A, Kierat S, Houston M, Fit-Florea A, inventors; Nvidia Corp, assignee. Tensor processing using low precision format. United States patent application US 15/624,577. 2017 Dec 28.

[6] Micikevicius P, Narang S, Alben J, Diamos G, Elsen E, Garcia D, Ginsburg B, Houston M, Kuchaiev O, Venkatesh G, Wu H. Mixed precision training. arXiv preprint arXiv:1710.03740. 2017 Oct 10.

[7] Yao Z, Dong Z, Zheng Z, Gholami A, Yu J, Tan E, Wang L, Huang Q, Wang Y, Mahoney MW, Keutzer K. HAWQV3: Dyadic Neural Network Quantization. arXiv preprint arXiv:2011.10680. 2020 Nov 20.

[8] Gholami A, Kim S, Yao Z, Dong Z, Mahoney M, Keutzer K, A Survey of Quantization Methods for Efficient Neural Network Inference, arxiv preprint, arxiv:arXiv:2103.13630, 2021.

[9] Gale T, Elsen E, Hooker S. The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574. 2019 Feb 25.

[10] Hoefler T, Alistarh D, Ben-Nun T, Dryden N, Peste A. Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks. arXiv preprint arXiv:2102.00554. 2021 Jan 31.

[11] Iandola FN, Han S, Moskewicz MW, Ashraf K, Dally WJ, Keutzer K. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size. arXiv preprint arXiv:1602.07360. 2016 Feb 24.

[12] Gholami A, Kwon K, Wu B, Tai Z, Yue X, Jin P, Zhao S, Keutzer K. Squeezenext: Hardware-aware neural network design. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops 2018 (pp. 1638–1647).

[13] Wu B, Iandola F, Jin PH, Keutzer K. Squeezedet: Unified, small, low power fully convolutional neural networks for real-time object detection for autonomous driving. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops 2017 (pp. 129–137).

[14] Shaw A, Hunter D, Landola F, Sidhu S. SqueezeNAS: Fast neural architecture search for faster semantic segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision Workshops 2019.

[15] Wu B, Wan A, Yue X, Keutzer K. Squeezeseg: Convolutional neural nets with recurrent crf for real-time road-object segmentation from 3d lidar point cloud. In2018 IEEE International Conference on Robotics and Automation (ICRA) 2018 May 21 (pp. 1887–1893). IEEE.

[16] Iandola FN, Shaw AE, Krishna R, Keutzer KW. SqueezeBERT: What can computer vision teach NLP about efficient neural networks?. arXiv preprint arXiv:2006.11316. 2020 Jun 19.

[17] Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. 2017 Apr 17.

[18] Krishna S, Krishna R. Accelerating Recommender Systems via Hardware” scale-in”. arXiv preprint arXiv:2009.05230. 2020 Sep 11.

[19] Kim S, Gholami A, Yao Z, Mahoney MW, Keutzer K. I-BERT: Integer-only BERT Quantization. arXiv preprint arXiv:2101.01321. 2021 Jan.

[20] Patrick Judd, Senior Deep Learning Architect, Integer Quantization for DNN Acceleration, Nvidia, GTC 2020.

[21] Bottou L, Curtis FE, Nocedal J. Optimization methods for large-scale machine learning. Siam Review. 2018;60(2):223–311.

[22] Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. 2018 Oct 11.

[23] Naumov M, Mudigere D, Shi HJ, Huang J, Sundaraman N, Park J, Wang X, Gupta U, Wu CJ, Azzolini AG, Dzhulgakov D. Deep learning recommendation model for personalization and recommendation systems. arXiv preprint arXiv:1906.00091. 2019 May 31.

[24] LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proceedings of the IEEE. 1998 Nov;86(11):2278–324.

[25] W. A. Wulf and S. A. McKee, “Hitting the memory wall: Implications of the obvious,” ACM SIGARCH computer architecture news, vol. 23, no. 1, pp. 20–24, 1995.

AI and Memory Wall was originally published in riselab on Medium, where people are continuing the conversation by highlighting and responding to this story.

Why every data scientist using pandas needs Modin — Bringing SQL to Dataframes

Jorge Torres — Mon, 22 Mar 2021 16:35:49 GMT

Bringing SQL to Dataframes — Why every data scientist using pandas needs Modin

Written by Jorge Torres and Devin Petersohn

While recently speaking with a data scientist friend from the RiseLab in Berkeley who primarily operates in the pandas API using Modin. She mentioned that she was trying to solve a problem for a client who required her to write a query in SQL that she would normally write in pandas. As she was looking on StackOverflow, she noticed that there is a whole world of questions around “What is the SQL equivalent of this pandas query?” and vice-versa.

She also noted that sharing code and notebooks with other data scientists in her company was difficult when they were not comfortable with the pandas API. If her colleague had a follow-up question, they would either need to go back and forth with questions and answers or schedule a meeting to figure it out in-person with one person at the keyboard. Her colleagues could not just run queries in their preferred language without rewriting the entire notebook.

Data Scientists are increasingly required to do and learn more, but tools have largely lagged supporting all of these new requirements. Moving between languages and computing environments is expensive and costs data scientists hours of productivity every week.

To improve data science productivity, MindsDB has teamed up with Modin to bring in-memory SQL to distributed Modin Dataframes. Now you can run SQL alongside the pandas API without copying or going through your disk. What this means is that you can now have a SQL solution that you can seamlessly scale horizontally and vertically, by leveraging the incredible power of Ray.

Presenting modin.sql

Here is a summary by example:

Imagine you have data about reviews for apps in the google store, this information is in two tables, one for the store apps and another one for the reviews, that you can join by the app column, lets start with the apps table:

https://medium.com/media/f4d2fb18cf4484e503d2e23a9079088e/href

Imagine that you want to quickly select from ‘gstore_apps_df’ the columns App, Category, and Rating, where Price is ‘0’.

In SQL, this looks like this:

"SELECT App,Category,Rating FROM gstore_apps WHERE Price = '0' "

However, for many of us, the solution to many of these problems often starts like this.

In the end, you get something like this:

https://medium.com/media/fb5b8d6cce6a846b5b9c5c56647ded90/href

Which makes sense if Pandas is your way of doing these tasks. But, for those of you that know some SQL, we want to introduce an in-memory SQL engine that operates on Dataframes, so you can have more options when it comes to using the incredible power of distributed dataframes of Modin.

The function to access that engine is called “modin.experimental.sql.query”

https://medium.com/media/d9fb11c5a41c02378e9ff74de8c4cdd2/href

The in-memory SQL engine for data-frames, allows you to run complex queries. You can in a very explicit SQL statement, perform operations such as joins and aggregations.

Let’s bring the other table (reviews) to illustrate the powers of SQL on Dataframes:

https://medium.com/media/11b8f02b5c7b9b228cffa004e9d31b03/href

Imagine that you want to know what are the best-reviewed app categories where there is little subjectivity: Get the top 10 app categories ranked by best average ‘sentiment_polarity’ where the average ‘sentiment_subjectivity’ is less than 0.5.

Since ‘Category’ is on the gstore_apps_df and sentiment_polarity is on gstore_reviews_df this requires that we join the two tables, and operate averages on that join.

You can solve this by doing it all in one single SQL query:

https://medium.com/media/bd1c24a9f4379737873983d6e1acfb4d/href

Or, you can bring the best of doing this in python and do it in parts (it’s up to you), but we believe that this certainly gives you more powers than if you were to do this in say Redshift.

https://medium.com/media/ad070c58bbe1fbc2a6b8fcc71f49d6fa/href

The crazy thing here is that if you have a cluster or even a computer with more than one core, you can write SQL and Modin will run those queries in a distributed and optimized way. You can think of Modin + SQL as an Open-source alternative to Snowflake.

In our next article, we would like to present some benchmarks of running SQL on Distributed Modin Dataframes vs some SQL databases and Data-lakes. Thanks to our friends at Intel, that have provided us with some fancy computers where we can run Modin on many cores with lots of Memory.

In the meantime, you can check out, our Notebook with more examples and ideas https://github.com/mindsdb/dfsql/blob/stable/testdrive.ipynb

Special thanks to the Modin and MindsDB team, Boris Tseitlin, the Rise Lab UC Berkeley, and Intel.

Why every data scientist using pandas needs Modin — Bringing SQL to Dataframes was originally published in riselab on Medium, where people are continuing the conversation by highlighting and responding to this story.

How to ensure a data scientist is never productive

Devin Petersohn — Wed, 07 Oct 2020 15:13:34 GMT

Photo by Andrea Piacquadio (pexels.com)

We need to start placing a higher value on data scientists’ time than we do on machine time

While data science tools are being optimized to perform well on microbenchmarks, they are becoming more and more difficult to use. Is the benchmark performance worth the human time cost it takes to get there? (Spoiler: it would take up to 200 years to recoup the upfront cost to learning a new tool, even if the new tool performs 10x faster)

Time to recoup the cost of learning a new tool, see below for detailed calculation

Modin (https://github.com/modin-project/modin) is designed and optimized for Data Scientist time, enabling performance without code changes.

Pushing complexity onto the data scientist

Let’s design a system. If we want to ensure data scientists are not productive, the first thing we probably want to do is force them to learn a lot of new and unnecessary concepts for tuning performance, like partitioning and resource management. To further reduce data scientist productivity, let’s also introduce a completely new API. This has the nice side-effect of system lock-in, making it harder to leave once adopted. In any case, trading human time for machine time is the most effective way to ensure that data scientists are not productive.

I want to do a thought experiment to see exactly how much the overheads of learning an entirely new ecosystem and new distributed computing expertise actually cost. Then we can model how much computation a new system would need to save to begin to make returns on the time cost. This way we can see how much productivity we actually cost the user.

Modeling the cost of learning a new tool (that does the same thing)

To model the user, we will first simulate “proficiency” with a linear relationship to time. To make things simple, let’s say it takes an average of 2 years to be as proficient in a new tool as they are with an existing tool. This 2 years includes gaining an understanding of the new requirements of the system, like distributed computing, partitioning, etc. Let’s also say that proficiency and productivity are 1:1 correlated, so proficiency is a proxy for productivity.

Because of the linear relationship we are using the total productivity loss is 1/2 of the 2 years it took to become proficient. According to Glassdoor, the average yearly salary of a data scientist in the United States is $113,000 USD as of writing this. So by our back of the envelope calculation, we have an estimated total productivity cost of $113,000 per data scientist. The productivity loss for a team of 5 exceeds $500,000 USD.

How long will it take to recoup the $113,000 investment on compute?

For simplicity let’s use the per-hour cost of the AWS m4.4xlarge, as of writing it is $0.80 per hour. m4.4xlarge has 16 CPU cores and 64GB RAM. To recoup the $113,000 productivity cost of the one year lost, you would need to save, in aggregate, over 16 years worth of compute time on this instance. To get the number of CPU years per core, we just multiply 16 years x 16 CPU cores = 256 CPU years.

How many compute years does the average data scientist use in a given day? If we assume a single CPU is running 50% of work hours (which it isn’t), we get 4 hours/day, or 12.5% of the day. Extrapolating to the entire year, 12.5% of the year is spent running compute with these numbers, so it takes 8 real years to accumulate one CPU year in productive compute. Remember this number, it will be important shortly.

If we need to save 256 CPU years and the new system is 10x faster or with 10x more data, it will take about 25 CPU years in the new tool to make up for that time compared to the old tool. But wait, it takes 8 real years to accumulate one CPU year. At a 10x improvement, it would take 200 years to recoup the upfront cost of losing 1 year of productivity!

This simple calculation cannot possibly reflect all of the details of every data scientist’s reality, but the goal is not to perfectly model reality. Instead, its purpose is to demonstrate that the human time cost to come up to speed on a new ecosystem is so much higher than any compute cost saved that talking purely about benchmarking performance pales in comparison.

Improved performance does not equate 1:1 to improved productivity. The benchmarks presented in blogs and conferences always hide upfront costs.

Do you have to learn a new API to do something you can already do?
Do you have to change file formats to get performance?
Do you have to tune performance to avoid being punished by a new tool?
Do you need to provision resources or request workers for the new tool?
How much human time does all of this cost?

Modin: Putting the focus back on the Data Scientist

Modin (https://github.com/modin-project/modin) is a data science platform designed around empowering data scientists without adding complexity and new requirements. It exposes the pandas API, with many other APIs and modes of interaction in the pipeline.

# import pandas as pd
import modin.pandas as pd # a drop-in replacement!

Suddenly, our typical data science setup goes from this:

To a workflow without costly conversion between ecosystems:

Modin is disrupting the data science tooling space by prioritizing the data scientists time over hardware time. To this end, Modin has:

No upfront cost to learning a new API
Integration with the Python ecosystem
Integration with Ray/Dask clusters (Run on/with what you have!)
Scalability and performance with no changes to existing pandas code

Modin performance scales as the number of nodes increases (with no changes to existing pandas code). Maximum time to startup the cluster was 3 minutes in each case, data from NYC Taxi. No performance tuning was performed. Baseline of pandas was not possible at this data scale.

Remember, the goal of data scientists is not to execute individual queries as fast as possible, it is to extract as much value as possible from their data. Tools should work for the data scientist, data scientists shouldn’t have to work for their tools.

How to ensure a data scientist is never productive was originally published in riselab on Medium, where people are continuing the conversation by highlighting and responding to this story.

The State of the Serverless Art

Joe Hellerstein — Mon, 24 Aug 2020 19:54:55 GMT

Serverless computing is beginning to deliver on the vision of allowing developers to program the cloud.

Over the past 2 years, the Hydro team I lead at Berkeley’s RISELab has been running hard and fast in the area of serverless computing. We designed and built a stateful Functions-as-a-Service platform called Cloudburst. I like to say “Cloudburst puts the state into the state-of-the-art” in serverless computing.

In addition to the software prototype, papers on Cloudburst and serverless computing have been rolling out at quite a clip:

Cloudburst system architecture in VLDB 2020. (overview below)
Transactional causally-consistent caching in SIGMOD 2020. (overview below)
Atomic Fault Tolerance (AFT) in Eurosys 2020. (overview below)
Optimized Serverless ML Prediction Serving using CloudFlow over Cloudburst at arXiv. (overview below)
A critique of the (prior) state of the art serverless systems in CIDR 2019. (overview below)
All this on top of the two award-winning papers on the Anna serverless KVS at ICDE18 and VLDB19. (overview below)

In this post I go into the background of the serverless computing space, and how we got to where we are today. For better or worse, this is a long post. If you want to skip the background, you can jump straight to descriptions of the new results.

Programming for The Biggest Computer Ever Built

I got interested in serverless computing because of my ongoing obsession with techniques for programming the cloud. Why obsess about that, you might ask?

To put a fine point on it, the public cloud is the most powerful general-purpose computer ever assembled. Forget the race for the “biggest supercomputer”, secreted away in a government lab. The cloud is orders of magnitude bigger, and growing every day. Even better, it’s not locked up in a lab — anyone can rent out its power at any scale, on demand. Everyone in computer science should be excited about this, as it’s arguably the biggest game-changer in computing access since the rise of the PDP-11 and UNIX in the early 1970s.

Unfortunately, raw computing power does not translate directly to useful computation. The cloud is not just massive, it is also a data-rich distributed system, which raises notoriously difficult computer science problems, including parallel programming, mid-flight failures of participating nodes, and data inconsistencies across distributed machines. For general-purpose cloud programming, developers today are forced to (a) write sequential programs to run on each machine they want to use, (b) ensure that code works together in concert to achieve desired outcomes in the face of the core CS problems described above, and (c) figure out how to deploy and manage all that complexity. As a result, it is very difficult today for developers to harness the power of the cloud at scale. To continue the analogy, the cloud is like the PDP-11 without UNIX and C — we’re programming it with the distributed systems equivalent of assembly code (though honestly it’s far harder than that from a correctness perspective).

Background: Where Did a Decade Go?

One of the reasons we’re moving so fast in my Hydro team recently is because my students and I have been beavering away in this space for over a decade at Berkeley. Ten years ago this fall, at ACM PODS 2010, I issued a call to arms in a keynote talk:

It is now within the budget of individual developers to rent massive resources in the worlds’ largest computing centers. But … this computing potential will go untapped unless those developers can write programs that harness parallelism, while managing the heterogeneity and component failures endemic to very large clusters of distributed computers.

Given that imperative, I assembled the BOOM project team back in 2010 to explore and demonstrate new ways to write programs. We started designing programming languages like Dedalus and Bloom that use the data in the cloud to drive computation, rather than worrying about which computer is doing what and when. Our early message was not lost on the tech press, which covered the ideas and flagged the promise of our work quite a bit.

But the agenda of general-purpose cloud programming got surprisingly little uptake in the ensuing decade, either in practice or research.

In retrospect, the likely distraction was easier money. Amazon Web Services spent the better part of the ‘teens demonstrating that well-capitalized firms could disrupt the enterprise software market without third-party developers or radical new software. Forget cultivating an iPhone-style “app for that” developer community! It was easier to go after aging giants like Oracle and IBM, and offer traditional software to traditional use cases, exploiting the radical new platform solely to lower administrative overheads.

And so a decade went by, and we wrote a bunch of papers, built some prototypes, and graduated some new PhDs. We felt pretty excited about the work, and we got plenty of academic recognition. But as the old joke goes, “if you’re so smart, why ain’t you rich”? I have to admit that Jeff Bezos made more money on AWS in the last decade than I did at Berkeley doing research. So to be clear, I’m not arguing that the hundreds of billions of dollars of “boring” cloud revenue was a bad play for businesses.

Nonetheless, the deeper technical revolution in cloud programming still awaits. Now that the cloud market has real competition, and the on-premises software market is back on its heels, we’re entering a new era where enabling the new stuff is going to matter.

Commercial Serverless: FaaS

As part of that new era, the cloud vendors have finally made some moves to empower developers outside their walls. The moniker they’ve chosen? Serverless computing. It’s not my favorite term, but it will have to do for now.

In its first incarnation, the idea of serverless computing has been embodied with an API called Functions as a Service (FaaS). As expected, Amazon was first with their AWS Lambda offering, but Microsoft Azure Functions and Google Cloud Functions followed quickly. The idea is simple: a developer writes a function in their favorite traditional programming language. They then upload the function to the cloud, and are given APIs to invoke the function remotely at will. Whenever data arrives at the function input, computation spins up in the cloud, and the result is passed to the output. The developer spends zero time configuring servers. The cloud resources auto-scale up and down dynamically according to usage, and the developer pays as they go, according to that usage.

To be clear, FaaS is only a first step in cloud programming. It is targeted at launching single-threaded sequential code in traditional languages, i.e. the “assembly language of distributed programming” I mention above. Still, while programming may be rudimentary, at least I don’t need to be a cloud devops wizard as well! And I only pay for what I use. That is, without question, progress.

In late 2018, a bunch of us in the RISELab at Berkeley started looking at serverless computing. The systems folks in the lab began a writing-by-committee effort to describe the movement of this bandwagon in one of their “Berkeley View” assessment papers. Having already spent a decade thinking about the future of cloud programming, I had stronger opinions. As a counterpoint to the committee effort, my team laid out our frank assessment of the basic pros and cons of first-generation FaaS in a paper entitled Serverless Computing: One Step Forward, Two Steps Back. In a nutshell:

Forward: Autoscaling. Third-party software is automatically scaled up and down according to usage patterns, in a pay-as-you go manner.
Back: Slow Data Access. Serverless functions see embarrassingly high-latency and costly access to stored data.
Back: No Distributed Computing. Functions are not allowed to communicate with one another except through high-latency storage, making most distributed computing techniques impossible.

Some folks, especially at the orange website, cast the article as a hit job from clueless academics. But the Morning Paper, which has followed our work since the beginning, got the spirit of it:

[this is ] an appeal from the heart to not stop where we are today, but to continue to pursue infrastructure and programming models truly designed for cloud platforms

Also I like to think we’re not totally clueless (nor totally academic). While writing that paper we were already moving forward, getting past the challenges that the first-gen serverless offerings had dodged. In the papers and prototypes we’ve released since then, we are demonstrating what’s possible.

Stateful Serverless Infrastructure 1: Storage

In the early days of the RISElab, we wanted to demonstrate that the lessons of the BOOM project — notably avoiding coordination in the style of the CALM Theorem — could be realized in a high-performance system. So Chenggang Wu set out to build a key-value storage (KVS) database called Anna that embraced and extended those lessons.

The first goal of Anna—and the name of the original paper—was to perform well at any scale. What did we mean by that? Well, conventional wisdom said that systems have to be rearchitected every time they expand 10x beyond plan. Anna was designed to demonstrate that the lessons of coordination-freeness could result in a system that offered world-beating performance at the small scale on a single multicore box, and at massive scale on machines distributed across the globe.

The Anna story is richer than just the any-scale story. Anna is the subject of two earlier posts of mine (here and here) and two award-winning research papers (ICDE18 and VLDB19), and given the length of this post I’ll be brief here, focusing on technical headlines:

Anna is crazy fast. In simple workloads Anna is as fast as anything around at any scale. Under contention, Anna is orders of magnitude faster than the fastest KVSes out there, including Redis, Masstree, and Intel’s TBB hashtable. This is because Anna never coordinates (no locks, no atomics, no consensus protocols!), whereas those systems spend 90+% of their time coordinating under contention.
Anna offers flexible autoscaling. This is the hallmark of a good serverless infrastructure: scales up when you use it hard, scales down to save money and power when you don’t. Again, coordination-freeness is key: there’s no need to maintain distributed membership information, so the cost to add or remove nodes remains low at every scale.
Anna provides rich data consistency. Even under parallel and distributed execution, Anna can offer various consistency guarantees to allow programmers to reason about data across machines, including powerful classical notions including causal consistency or repeatable read transactional isolation.
Anna provides unified caching/tiering. Many KVS systems today are designed for one level of storage: either disks, or RAM. In contrast, you can deploy Anna as a caching tier in memory, as a database on disk, or as a multitiered system with a smaller cache on top of a larger database. Anna moves data up and down the tiers, and provides uniform consistency guarantees across both.

There is no storage offering from any cloud vendor today that compares with what Chenggang has done with Anna. I believe Anna identifies and can fill a significant hole in the current cloud architectures.

Stateful Serverless Infrastructure 2: Stateful Compute

As Anna was maturing, we were ready to move up the stack and contemplate programming. As our first phase, we decided to try and build a FaaS system that tackles the “two steps backward” that plague the commercial FaaS services. This means two things:

Allow cloud functions to communicate with each other over the network. Commercial FaaS systems prevent 2 functions from communicating directly; they have to share any information via some slow distributed storage system. This is true even for simple stuff like passing the results of g(x) to another function f so you can compute f(g(x)). Beyond the basics, fast point-to-point communication is absolutely essential if you hope to do any non-trivial distributed computing other than batch jobs. The potential problem here is that serverless functions come and go pretty often, so their IP addresses aren’t reliable endpoints. This is solved with a classic level of indirection: a lookup service, implemented as some kind of lightweight distributed database. DNS is arguably too heavyweight to deploy for this setting, which is perhaps why the cloud vendors refuse to support networking for FaaS. Fortunately we have Anna—a lightweight autoscaling database. So functions can look each other up by “name” in Anna, and get a current IP address for that name. In a direct sense, Anna serves both as a database and as a Distributed Hash Table overlay network, a duality we explored years ago.
Provide cloud functions with low-latency data access (LDPC). All the interesting challenges in distributed computing begin with data, or as some people like to say, the state of a program. Commercial FaaS vendors are targeted at stateless programs that simply map inputs to outputs with no “side effects” like data updates. But most applications of note these days manage data (state), often in complex ways. Adding to the complexity here is the trend towards disaggregation of storage from compute. In a big cloud environment, you don’t know when and how you need to scale out or upgrade your storage tier or your compute tier, so it’s best to keep them separate. The challenge is that storage services like DynamoDB or ElastiCache become very “far away” in latency terms. To get good latency, we still want some physical colocation of storage near our functions, even if the two tiers are managed and scaled separately. This is what we call Logical Disaggregation with Physical Colocation (LDPC). On this front we needed to innovate, and colocate a data cache on the same machines as the cloud functions, while still providing consistency in concert with Anna.

This is where a lot of our energy has been spent in the last year. I’ve learned a lot along the way — while the programming problem remains, the system infrastructure space was interesting in its own right, and I think we got a good handle on the big issues. Here is a rundown of the recent results:

Cloudburst System Architecture: The big ideas, overall architecture and some of the details are spelled out in our VLDB 20 paper on Cloudburst. We argue for the LDPC principle and describe the resulting architecture. Then the paper goes into detail on how we automatically encapsulate a developer’s mutable Python state in coordination-avoiding, composable lattice structures so arbitrary Python objects can be integrated into the coordination-free consistency model of Anna. We also describe how we achieve a simple version of causal consistency through these caches. Microbenchmarks show that we can outperform commercial serverless platforms by 1–2 orders of magnitude, and compete with hand-managed serverful distributed frameworks like Dask. We also show end-to-end numbers for two applications: ML prediction serving, and the Retwis twitter clone. Although we did nothing special to tune for ML prediction serving, we outperform AWS Sagemaker, a system specially designed for the task. (We also outperform AWS Lambda by quite a bit more.)
Hydrocache and TCC: The Hydrocache paper in SIGMOD 2020 delves deeper into the ways we keep caches and the database consistent, while still providing low latency. We set the consistency bar even higher in this paper, with the goal of offering transactional causal consistency (TCC). You do not get this level of consistency from the typical distributed caches or KVS systems (looking at you, Redis Cluster!) Yet we show it can be done with very low latency. There’s no question that this paper is quite technical, though. Enjoy :-)
Atomic Fault Tolerance (AFT): The question of fault tolerance should be on your mind when reading about any distributed system. The FaaS vendors are quite naive about it right now — they tell developers that any function may fail and have to be retried, so it’s up to the developer to ensure that their code is idempotent, meaning it has the same effect whether run once or more than once. That’s not very nice, nor is it very likely to be guaranteed. (OK pop quiz time. Stop what you’re doing. Did you write any code this week? Cool. Is it idempotent? How do you know? Is it reasonable to expect you to worry about that? I thought not!) But it gets worse. If your function modifies stored state (say by issuing a call to a database), and it fails a fraction of the way through execution, it will have visibly run a fractional number of times. That is, the partial execution of your function is now exposed in storage and may be read by other code. This paper points out that what’s needed for FaaS fault tolerance is Atomicity, i.e. the “A” from the ACID guarantees. All your function’s external effects should occur, or none should. Idempotence then becomes easy — just include a unique ID for the request, and regardless of how messy it is, we can run it 0 or at most 1 times. That’s how idempotence is supposed to be exposed. This paper leans on our prior work on Read Atomic isolation, and provides a surprisingly simple implementation as a “shim” layer that works in any FaaS architecture. We have it running in the Cloudburst/Anna stack, but the paper shows how to deploy it in the AWS Lambda/S3 stack.
Model Serving. Our first foray into model serving in the VLDB 20 Cloudburst architecture paper whet our appetite to do better. A few years back, when my co-conspirator Joey Gonzalez was leading the Clipper model serving project, I needled him by saying “hey I think all these optimizations you’re exploring — cascades and ensembles and whatnot — could be written as simple Bloom programs”. And I proceeded to sketch them as dataflows on a whiteboard. Well, with the Cloudburst infrastructure under his belt, Vikram Sreekanti took up that idea and made it real. He implemented a simple dataflow language called Cloudflow, and deployed it over Cloudburst. Then he proceeded to explore optimization opportunities exposed by the combination of explicit dataflow and stateful serverless computing, including things like (a) placing code on the right HW resources (i.e. GPUs) or colocated with the right data (i.e. in a Hydrocache), (b) autoscaling different stages of an ML pipeline differently, (c) fusing operators so they run colocated with each other, and (d) running competing versions of operators in parallel to let the fastest execution win. What’s really nice here is that the ML code remains a black box, so this is compatible with your favorite ML libraries (Tensorflow, PyTorch, MXNet, Scikit-Learn, etc.) Joey and I feel like Vikram really made the case that this is the right way to architect a model serving system.

In sum, Cloudburst is our answer to the critiques of FaaS we raised 2 years ago. Cloudburst shows that FaaS can provide 3 steps forward, and provide an underpinning for general-purpose cloud programming. Most programming tasks that can benefit from the world’s biggest computer absolutely require efficient and consistent management of program state, and that’s where much of the hard computer science lies in this space.

Summing Up

Obviously all this work was done by a team. The lion’s share was done by the lead PhD students, Vikram Sreekanti and Chenggang Wu, who are truly a dynamic duo. Joey Gonzalez was my co-conspirator as faculty advisor. Other contributors include Saurav Chhatrapati, Charles Lin, Yihan Lin, and Hari Subbaraj, with wise input from Jose Faleiro, Johann Schleier-Smith, and Alexey Tumanov.

Our ability to slay some dragons in this space in recent years is also thanks to a long line of research from an even bigger group of collaborators from BOOM and P2 days. There’s more to come from our end, and I expect to see more good stuff from the community at large. Programming the cloud is one of the biggest challenges and opportunities in computer science, and we’ll continue pushing forward.

In addition to NSF CISE Expeditions Award CCF-1730628, this research is supported by gifts from Alibaba, Amazon Web Services, Ant Financial, CapitalOne, Ericsson, Facebook, Futurewei, Google, Intel, Microsoft, Nvidia, Scotiabank, Splunk and VMware.

The State of the Serverless Art was originally published in riselab on Medium, where people are continuing the conversation by highlighting and responding to this story.

Estimating the fatality rate is difficult but doable with better data

Anastasios Angelopoulos — Tue, 28 Jul 2020 16:33:52 GMT

A. N. Angelopoulos, R. Pathak, R. Varma, M. I. Jordan. On Identifying and Mitigating Bias in the Estimation of the COVID-19 Case Fatality Rate. Harvard Data Science Review Special Issue 1 — COVID-19: Unprecedented Challenges and Chances. 2020.

Summary

The case fatality rate quantifies how dangerous COVID-19 is, and how risk of death varies with strata like geography, age, and race. Current estimates of the COVID-19 case fatality rate (CFR) are biased for dozens of reasons, from under-testing of asymptomatic cases to government misreporting. We provide a careful and comprehensive overview of these biases and show how statistical thinking and modeling can combat such problems. Most importantly, data quality is key to unbiased CFR estimation. We show that a relatively small dataset collected via careful contact tracing would enable simple and potentially more accurate CFR estimation.

§1 What is the case fatality rate, and why do we need to estimate it?

The case fatality rate (CFR) is the proportion of fatal COVID-19 cases. The term is ambiguous, since its value depends on the definition of a ‘case.’ No perfect definition of the case fatality rate exists, but in this article, I define it as the proportion of deaths among all COVID-19-infected individuals.

The CFR is a measure of disease severity. Furthermore, the relative CFR (the ratio of CFRs between two subpopulations) is a useful target for data-informed resource-allocation protocols because it measures relative risk. In other words, the CFR tells us how drastic our response needs to be; the relative CFR helps us allocate scarce resources to populations that have a higher risk of death.

Although the CFR is defined as the number of fatal infections, we can not expect that dividing the number of deaths by the number of cases will give us a good estimate of the CFR. The problem is that both the numerator (#deaths) and the denominator (#infections) of this fraction are uncertain for systematic reasons due to the way data is collected. For this reason, we call that estimator “the naive estimator”, or simply deaths/cases.

§2 Why are (all) CFR estimators biased?

Fig. 1 Dozens of biases (§2) can corrupt the estimation of the CFR. Surveillance data gives partial information within the ‘sampling frame’ (light blue rectangle). Edges on the graph correspond roughly to conditional probabilities; e.g., the edge from D to DF is the probability a person dies if they are diagnosed with COVID-19.

In short, because the data is biased, we are losing at least 99.8% of our sample efficiency. There’s a well known ‘’butterfly effect’’ in statistics: a tiny correlation between your sampling method and the quantity you’re seeking can have huge, destructive effects on your estimator. Even assuming a tiny 0.005 correlation between the population we test and the population infected, testing 10,000 people for SARS-CoV-2 is equivalent to testing 20 individuals randomly. For estimating the fatality rate, the situation is even worse, since we have many reasons to believe that severe cases are preferentially diagnosed and reported. In the words of Xiao-Li Meng, ‘’compensating for [data] quality with quantity is a doomed game.’’ In our HDSR article, we show that in order for the naive estimator to converge to the correct CFR, there must be no correlation between fatality and being tested — but severe cases are much more likely to be tested. Government and health organizations have been explicitly reserving tests for severe cases due to shortages, and severe cases are likely to go to the hospital and get tested, while asymptomatic ones are not.

The primary source of COVID-19 data is population surveillance: county-level aggregate statistics reported by medical providers who diagnose patients on-site. Usually, somebody feels sick and goes to a hospital, where they get tested and diagnosed. The hospital reports the number of cases, deaths, and sometimes recoveries to local authorities, who release the data usually on a weekly basis. Of course, this is an idealized model, and in reality, there are many differences in data collection between nations, local governments, and even hospitals.

Dozens of biases are induced by this method of surveillance, falling into roughly five categories: under-ascertainment of mild cases, time lags, interventions, group characteristics (e.g. age, sex, race), and imperfect reporting and attribution. An extensive (but not exhaustive) discussion of the magnitude and direction of these biases is in our article. Without mincing words, this data is extremely low quality. The vast majority of people who get COVID-19 go undiagnosed, there are misattributions of symptoms and deaths, data reported by governments is often (and perhaps purposefully) incorrect, cases are defined inconsistently across countries, and there are many time-lags (for example, cases are counted as ‘diagnosed’ before they are ‘fatal’, leading to a downward bias in the CFR if the number of cases is growing over time). Figure 1 has a graphical model describing these many relationships; look to the paper for a very detailed explanation of what biases occur across each edge.

Correcting for biases is sometimes possible using outside data sources, but can result in a worse estimator overall due to partial bias cancellation. This is easier to see through example than it is to explain. Assume the true CFR is some value p in the range 0 to 1 (i.e., deaths/infections is equal to p). Then, assume that because of under-ascertainment of mild cases, there are too many fatal cases being reported, which means deaths/cases converges to bp>p (in other words, it is higher than it should be by a factor of b). But at the same time, assume that because of the time-lag between diagnosis and death causes the proportion of deaths to diagnoses to be too low by the same factor, b. Then, deaths/cases converges to b(p/b)=p, the correct value. So, even though it might seem to be an objectively good idea to correct for time-lag between diagnosis and death, it would actually result in a worse estimator in this case, since time-lag is helping us out by cancelling out under-ascertainment.

The mathematical form of the naive estimator allows us to see easily what we need to do to make it unbiased. With p being the true CFR, q being the reporting rate, and r being the covariance between death and diagnosis, the mean of deaths/cases is:

This equation is pretty easy to understand. We wanted μ to be equal to p. Instead, we got an expression that depends on r, q, and N. The r/q term is the price we pay if people who are diagnosed are more likely to eventually die. We want r/q=0, but in practice, r/q is probably much larger than p. (Actually, if we assume the CFR is around 0.5% and the measured CFR is 5.2% on June 22, 2020, then r/q≥0.047>>0.005.) In other words, r/q is the bias, and it can be large. The term p is, of course, the true CFR, which we want. And the factor (1−(1−q)N) is what we pay because of non-response; however, it’s not a big deal, because it disappears quite fast as the number of samples N grows. So really, our primary concern should be is achieving r=0, because — and I cannot stress this enough — r/q does not decrease with more samples; it only decreases with higher quality samples.

§3 What are strategies for fixing the bias?

In our article, we outline a testing procedure that helps fix some of the above dataset biases. If we collect data properly, we think even the naive estimator can be a good estimator of the CFR within a particular population. In particular, by following a procedure like the following:

1. Diagnose person P with COVID-19 by any means, like at a hospital.
2. Reach out to contacts of P. If a contact has no symptoms, ask them to commit to getting a COVID-19 test.
3. Test committed contacts after the virus has incubated.
4. Keep data with maximum granularity while respecting ethics/law.
5. Follow up after a few weeks to ascertain the severity of symptoms.
6. For committed contacts who didn’t get tested, call and note if they are asymptomatic.

This protocol is meant to decrease the covariance between fatality and diagnosis. If patients commit to testing before they develop symptoms, there cannot be a covariance between disease severity and diagnosis. However, there may still be issues with people dropping out of the study; if this is a problem in practice, it can be mitigated by a combination of incentives (payments) and post-stratification.

Fig. 2 Assuming data collection induces no correlation between disease severity and diagnosis, as the true CFR decreases, it requires more samples to estimate. The variable p is the true CFR, and q is the response rate. Each histogram represents the probability the naive estimator will take on a certain value, given N samples of data (different colors correspond to different values of N). The three stacked plots correspond to different values of p; the smaller p is, the harder it is to estimate, since death becomes an extremely rare event.

Figure 2 represents an idealized version of this study. In the best case scenario, there is no covariance between death and diagnosis. In that case, we only need N=66 samples for our estimator of the CFR to be approximately unbiased, even if p=0.001 (1/1000 cases die). Problems remain in the case that p is small; namely, death is so rare that we need tons of samples to decrease the variance of our estimator. This will require lots of samples. But even if no deaths are observed, that gives us lots of information about p; specifically, if N=1000 and we have not observed a single death, then we can confidently say that p<0.01 within the population we are sampling. This is simply because in the second panel of Figure 2, there is nearly zero mass in the N=1000 histogram at deaths/cases=0. With this in mind, we could find the largest possible p that is consistent with our data — this would be a conservative upper bound on p, but it would be much closer to the true value than we can get with current data.

This strategy mostly resolves what we believe is the largest set of biases in CFR estimation — under-ascertainment of mild cases and time-lags. However, there will still be lots of room for improvement, like understanding the dependency of CFR on age, sex, and race. (In other words, the CFR is a random quantity itself, depending on the population being sampled.) Distinctions between CFRs of these strata may be quite small, requiring a lot of high-quality data to analyze. If p is extremely low, like 0.001, and we take a purely frequentist approach as in Figure 2, this may require collecting N=100,000 or N=1,000,000 samples per group. Perhaps there are ways to lower that number with Bayesian hierarchical modeling. Even though making correct inferences will require careful thought (as always), this data collection strategy will make it much simpler.

I’d like to re-emphasize a point here: collecting data as above will make the naive estimator unbiased for the sampled population. But the sampled population may not be the population we care about. However, there is a set of statistical techniques collectively called ‘post-stratification’ that can help deal with this problem effectively — though not perfectly.

If you read our academic article, we provide some thoughts on how to use time-series data and outside information to correct time-lags and relative reporting rates. Our work was very heavily based on one of Nick Reich’s papers. However, as I claimed earlier, even fancy estimators cannot overcome fundamental problems with data collection. I’ll defer discussion of that estimator, and the results we got from it, to the article. It’s best parsed by experts looking for a perspective on how to perform these estimations honestly. I’d love to hear your thoughts.

CFR estimation is clearly a difficult problem — but with proper data collection and estimation guided by data scientists, I still believe that we can get a useful CFR estimate. This will help guide public policy decisions about this urgent and ongoing pandemic.

https://medium.com/media/19da5b2dccb63cf1029d718e7036eb4b/href

A. N. A. was partially supported by the National Science Foundation Graduate Research Fellowship Program. R. P. was partially supported by a UC Berkeley University Fellowship via the ARCS Foundation. A. N. A. and R. P. are RISELab/BAIR members and M. I. J. is a core faculty member in both groups.

Estimating the fatality rate is difficult but doable with better data was originally published in riselab on Medium, where people are continuing the conversation by highlighting and responding to this story.

Secure Collaborative XGBoost on Encrypted Data

Rishabh Poddar — Thu, 16 Jul 2020 20:01:33 GMT

A library for multi-party training and inference of XGBoost models using secure enclaves

Photo by Markus Spiske on Unsplash (modified).

We recently released Secure XGBoost, a library that enables collaborative XGBoost training and inference on encrypted data. Secure XGBoost is part of the umbrella MC² project, under which we are working on a variety of tools for privacy-preserving machine learning.

In particular, Secure XGBoost facilitates secure collaborative learning — where mutually distrustful data owners can jointly train a model on their data, but without revealing their data to each other. Secure collaborative learning is a powerful paradigm that could be the key to unlocking more resilient and robust models.

We’ve been partnering with some teams in industry, including Scotiabank and Ant Financial, to deploy Secure XGBoost for efforts towards anti-money laundering and fraud detection.

For more information, please read our detailed writeup on Secure XGBoost here. The source code for all MC² projects is available on Github.

Acknowledgments

This work was supported in part by the NSF CISE Expeditions Award CCF-1730628, and gifts from the Sloan Foundation, Bakar Program, Alibaba, Amazon Web Services, Ant Financial, Capital One, Ericsson, Facebook, Futurewei, Google, Intel, Microsoft, Nvidia, Scotiabank, Splunk, and VMware.

Secure Collaborative XGBoost on Encrypted Data was originally published in riselab on Medium, where people are continuing the conversation by highlighting and responding to this story.

Context-Aware Fast Food Recommendation at Burger King with RayOnSpark

Jason Dai — Wed, 08 Jul 2020 15:36:56 GMT

Authors: Luyang Wang (lwang1@rbi.com), Kai Huang (kai.huang@intel.com), Jiao Wang (jiao.wang@intel.com), Shengsheng Huang (shengsheng.huang@intel.com), Jason Dai (jason.dai@intel.com)

Deep learning based recommendation models have been widely used in real world recommendation systems. Common methods perform concatenation of user and item embedding vectors, then feed them into MLP (multilayer perceptron) to generate final predictions. However, these methods fail to capture real-time user behavior signals and do not take the important context features (such as time and location) into consideration; as a result, the final recommendations are not ideal to reflect the real-time user preferences. User behavior sequences and context features become even more important for fast food recommendation because:

Users are not likely to purchase another soft drink when they already have soft drinks added in the cart.
User purchase preference can drastically change given location, time, and current weather conditions. For example, people almost never buy kids meals at midnight and are very unlikely to buy frozen drinks on a cold rainy day.

In this blog post, we present our Transformer Cross Transformer (TxT) model that exploits the sequence of each order as well as the context information to infer a user’s preference at the moment. The key advantage of our model is that we apply Transformer encoders to capture both user order behavior sequence and complicated context features and combine both transformers through latent cross to generate recommendations.

In addition, we have leveraged RayOnSpark in Analytics Zoo to build an end-to-end recommendation system using Ray*, Apache Spark* and Apache MXNet*. It integrates data processing (with Spark) and distributed training (with MXNet and Ray) into a unified analysis and AI pipeline, which runs on the same cluster where our big data is stored and processed. We have successfully deployed the recommendation system at Burger King, and our solution achieves superior results in the production environment.

TxT Model for Recommendation

We propose the Transformer Cross Transformer model (TxT), which uses a Sequence Transformer to encode guest order behavior, a Context Transformer to encode context features (such as weather, time and location), and then uses an element-wise product to combine them (the “cross” part) to produce the final output, as shown in Figure 1. We implement our model code leveraging MXNet API.

Figure 1: TxT Model architecture.

Sequence Transformer

We construct a Sequence Transformer, based on the Transformer architecture, to learn the sequence embedding vector of each item in the guest order basket, as shown in the lower left part of Figure 1. To ensure that the item position information can be considered in its original add-to-cart sequence, we perform positional embedding on input items in addition to the item feature embedding. The embedding outputs are then added together and fed into a multi-head self-attention network.

To extract the vector representation of the whole guest order basket information from the hidden vectors of each item, we concatenate mean pooling and max pooling separately against final sequence transformer output. In this way, pooling output can consider all products contained in the product sequence while focusing on a small number of key products and their salient features.

Sequence Transformer can be constructed using the API in Analytics Zoo below:

https://medium.com/media/e869bbbbcffd1015a73d13af8f8467eb/href

Context Transformer

A common way to incorporate context features is to directly concatenate them with sequential inputs. But it is less meaningful to simply concatenate non-sequence features with sequence features. Some previous solutions use element-wise sum to deal with multiple context features. However, sum can only represent how context features aggregately contribute to the output, but most of the time these context features do not contribute equally to a user’s final decision.

Therefore, we use a Context Transformer to encode the contextual information, as shown in the bottom right part of Figure 1. Using Transformer’s multi-head self-attention, we can capture not only the individual effect of each context feature, but also the internal relationship and complicated interactions across different context features.

Context Transformer can be constructed using the API in Analytics Zoo below:

https://medium.com/media/5eadd327e1051d9a87ec084399a9adf9/href

Transformer Cross Transformer

To jointly train Sequence Transformer and Context Transformer, we perform an element-wise product between these two transformer outputs. Through this cross Transformer training, we are able to optimize all the parameters such as item embeddings, context features embeddings and their interactions at the same time. Finally, we apply relu as the activation function followed by a softmax layer to predict the probabilities of each candidate item.

TxT, which consists of Sequence Transformer and Context Transformer, can directly be constructed using the API in Analytics Zoo below:

https://medium.com/media/cf981f8c5ec002922a157c0ecc30cd6d/href

End-to-End System Architecture

Conventional approaches to build a standard recommendations pipeline would set up two separate clusters, one for big data processing, and the other dedicated to deep learning (e.g., a GPU cluster). But this not only introduces a lot of data transfer overhead, but also requires additional efforts for managing separate workflows and systems in production. To address these challenges, we have built the recommendation system on top of RayOnSpark in Analytics Zoo, which integrates Spark data processing and distributed MXNet training (using Ray) into a unified pipeline that runs on a single Xeon cluster.

Figure 2 illustrates the overall architecture of our system. In the Spark program, a SparkContext object is created on the driver node and it is responsible for launching multiple Spark executors to run Spark tasks. RayOnSpark additionally creates a RayContext object on the Spark driver, which will automatically launch Ray processes alongside each Spark executor and create a RayManager inside each Spark executor to manage Ray processes (e.g., automatically shutting down the processes when the program exits).

Figure 2: Overview of the recommendation system based on RayOnSpark.

In our recommendation system, we first launch Spark tasks to extract our restaurant transactions data stored on distributed file systems, followed by data cleaning, ETL and preprocessing steps using Spark. After the Spark tasks complete, the processed in-memory Spark RDD are directly fed into the Ray cluster through Plasma for distributed training.

Inspired by the design of RaySGD, we have implemented an MXNet Estimator that provides a lightweight shim layer to automatically deploy distributed MXNet training on Ray. Both MXNet workers and parameter servers run as Ray actors, and they communicate with each other via the distributed key-value store provided by MXNet; each MXNet worker takes its local data partition in Plasma to train the model. As a result, the user can seamless scale the MXNet training code from a single node to production clusters through Ray, using a simple scikit-learn style API below:

https://medium.com/media/f79b51326fbafb46a71fa01d7c10571e/href

Such a unified design architecture integrates Spark data processing and Ray-based distributed MXNet training into an end-to-end, in-memory pipeline, which runs on exactly the same cluster where our big data is stored. Consequently, we only need to maintain a single cluster for the entire AI pipeline, with no extra data transfer across different clusters and no extra cluster maintenance efforts. This achieves the full utilization of the cluster resources and significantly improves the end-to-end performance of the whole system.

Model Evaluation

We conducted offline experiments using the customer transaction records of Burger King in the past 12 months. The historical data of the first 11 months is used as training data and the last month is used for validation. The models are trained based on these data to predict the next best product for the guest to purchase. From Table 1, We can see superiority of TxT over baseline models (including Association Rule Learning and GRU4Rec). When comparing TxT and GRU4Rec, we can see that incorporating various context features greatly improves the Top1 and Top3 accuracy (by approximately 5.65% and 7.32% respectively).

Table 1: Offline training results of different recommendation models.

To evaluate the effectiveness of our TxT model in the real-world production environment, we ran our recommendation system in Burger King’s mobile application side by side with Google Recommendation AI*, a state-of-art recommendation service provided by Google Cloud Platform (GCP)*. We evaluate online performance from two aspects: recommendation conversion rate and add-on sales. We ran A/B testing for 4 weeks. For the control group, we randomly select 20% users and present them with a previous rule-based recommendation system. As shown in Table 2, TxT improved recommendation conversion on the checkout page by 264% and add-on sales by 137% when compared to the control group. This also stands for +100% conversion gain and +73% add-on sales gain when compared to other test groups running GCP Recommendation AI service.

Table 2: Online results of different recommendation solutions.

Conclusion

This blog post describes how we build and productionize an end-to-end recommendation pipeline in Burger King. It successfully captures user order behaviors and complex context features through the Transformer Cross Transformer (TxT) model, and implements a unified data processing (with Spark) and DL training (with Ray) pipeline using RayOnSpark. Both the TxT model and RayOnSpark have been open sourced in the Analytics Zoo project.

*Other names and brands may be claimed as the property of others

Context-Aware Fast Food Recommendation at Burger King with RayOnSpark was originally published in riselab on Medium, where people are continuing the conversation by highlighting and responding to this story.

Making Decision Trees Accurate Again: Explaining what Explainable AI did not

Alvin Wan — Fri, 17 Apr 2020 20:47:24 GMT

Combining neural networks and decision trees for accurate and interpretable computer vision models (and how our method works).

This is an extended version with an expanded methods description of the Towards Data Science article “What Explainable AI fails to explain (and how we fix that)”.

The interpretability of neural networks is becoming increasingly necessary, as deep learning is being adopted in settings where accurate and justifiable predictions are required. These applications range from finance to medical imaging. However, deep neural networks are notorious for a lack of justification. Explainable AI (XAI) attempts to bridge this divide between accuracy and interpretability, but as we explain below, XAI justifies decisions without interpreting the model directly.

What is “Interpretable”?

Defining explainability or interpretability for computer vision is challenging: What does it even mean to explain a classification for high-dimensional inputs like images? As we discuss below, two popular definitions involve saliency maps and decision trees, but both approaches have their weaknesses.

What Explainable AI Doesn’t Explain

Saliency Maps¹

Many XAI methods produce saliency maps, but saliency maps focus on the input and neglect to explain how the model makes decisions. For more on saliency maps, see these saliency tutorials and Github repositories.

Picturing the original image (left), saliency map using a method called Grad-CAM (middle), and another using Guided Backpropagation (right). The picture above is the canonical example for “class-discrimination”. The above saliency maps are taken from https://github.com/kazuto1011/grad-cam-pytorch.

What Saliency Maps Fail to Explain

To illustrate why saliency maps do not fully explain how the model predicts, here is an example: Below, the saliency maps are identical, but the predictions differ. Why? Even though both saliency maps highlight the correct object, one prediction is incorrect. How? Answering this could help us improve the model, but as shown below, saliency maps fail to explain the model’s decision process.

(Left) The model predicts Eared Grebe. (Right) The model predicts Horned Grebe. These are Grad-CAM results for a ResNet18 model trained on Caltech-UCSD Birds-200–2011, or CUB 2011 for short. Although the saliency maps look extremely similar, the model predictions differ. As a result, saliency maps do not explain how the model reached its final prediction.

Decision Trees

Another approach is to replace neural networks with interpretable models. Before deep learning, decision trees were the gold standard for accuracy and interpretability. Below, we illustrate the interpretability of decision trees.

Instead of only predicting “Super Burger” or “Waffle fries”, the above decision tree will output a sequence of decisions that lead up to a final prediction. These intermediate decisions can then be verified or challenged separately. As a result, classic machine learning calls this model “interpretable”.

For accuracy, however, decision trees lag behind neural networks by up to 40% accuracy on image classification datasets². Neural-network-and-decision-tree hybrids also underperform, failing to match neural networks on even the dataset CIFAR10, which features tiny 32x32 images like the one below.

Example to show just how tiny 32x32 is. This is a sample from the CIFAR10 dataset.

As we show in our paper (Sec 5.2), this accuracy gap damages interpretability: high-accuracy, interpretable models are needed to explain high-accuracy neural networks.

Enter Neural-Backed Decision Trees

We challenge this false dichotomy by building models that are both interpretable and accurate. Our key insight is to combine neural networks with decision trees, preserving high-level interpretability while using neural networks for low-level decisions, as shown below. We call these models Neural-Backed Decision Trees (NBDTs) and show they can match neural network accuracy while preserving the interpretability of a decision tree.

In this figure, each node contains a neural network. The figure only highlights one such node and the neural network inside. In a neural-backed decision tree, predictions are made via a decision tree, preserving high-level interpretability. However, each node in decision tree is a neural network making low-level decisions. The “low-level” decision made by the neural network above is “Has sausage” or “no sausage”.

NBDTs are as interpretable as decision trees. Unlike neural networks today, NBDTs can output intermediate decisions for a prediction. For example, given an image, a neural network may output Dog. However, an NBDT can output both Dog and Animal, Chordate, Carnivore (below).

NBDTs achieve neural network accuracy. Unlike any other decision-tree-based method, NBDTs match neural network accuracy (< 1% difference) on CIFAR10, CIFAR100, and TinyImageNet200. NBDTs also achieve accuracy within 2% of neural networks on ImageNet, setting a new state-of-the-art accuracy for interpretable models. The NBDT’s ImageNet accuracy of 75.30% outperforms the best competing decision-tree-based method by a whole ~14%.

How and what Neural-Backed Decision Trees Explain

Justifications for Individual Predictions

The most insightful justifications are for objects the model has never seen before. For example, consider an NBDT (below), and run inference on a Zebra. Although this model has never seen Zebra, the intermediate decisions shown below are correct — Zebras are both Animals and Ungulates (hoofed animal). The ability to see justification for individual predictions is quintessential for unseen objects.

NBDTs make accurate intermediate decisions even for unseen objects. Here, the model was trained on CIFAR10 and has never seen zebras before. Despite that, the NBDT correctly identifies the Zebra as both an Animal and an Ungulate (hoofed animal). The photos above are taken from pexels.com, under the Pexels License.

Justifications for Model Behavior

Furthermore, we find that with NBDTs, interpretability improves with accuracy. This is contrary to the dichotomy in the introduction: NBDTs not only have both accuracy and interpretability; they also make both accuracy and interpretability the same objective.

The ResNet10 hierarchy (left) makes less sense than the WideResNet hierarchy (right). In this hierarchy, Cat, Frog, and Airplane are placed under the same subtree. By contrast, The WideResNet hierarchy cleanly splits Animals and Vehicles, on each side of the hierarchy. The pictures above are taken directly from the CIFAR10 dataset.

For example, ResNet10 achieves 4% lower accuracy than WideResNet28x10 on CIFAR10. Correspondingly, the lower-accuracy ResNet⁶ hierarchy (left) makes less sense, grouping Frog, Cat, and Airplane together. This is “less sensible,” as it is difficult to find an obvious visual feature shared by all three classes. By contrast, the higher-accuracy WideResNet hierarchy (right) makes more sense, cleanly separating Animal from Vehicle — thus, the higher accuracy, the more interpretable the NBDT.

Understanding Decision Rules

With low-dimensional tabular data, decision rules in a decision tree are simple to interpret e.g., if the dish contains a bun, then pick the right child, as shown below. However, decision rules are not as straightforward for inputs like high-dimensional images.

As we qualitatively find in the paper (Sec 5.3), the model’s decision rules are based not only on object type but also on context, shape, and color.

This example demonstrates how decision rules are easy to interpret with low-dimensional, tabular data. To the right is example tabular data for several items. To the left is a decision tree we trained on this data. In this case, the decision rule (blue) is “Has bun or not?” All items with a bun (orange) are sent to the top child, and all items without a bun (green) are sent to the bottom child.

To interpret decision rules quantitatively, we leverage an existing hierarchy of nouns called WordNet³; with this hierarchy, we can find the most specific shared meaning between classes. For example, given the classes Cat and Dog, WordNet would provide Mammal. In our paper (Sec 5.2) and pictured below, we quantitatively verify these WordNet hypotheses.

The WordNet hypothesis for the left subtree (red arrow) is Vehicle. The WordNet hypothesis for the right (blue arrow) is Animal. To validate these meanings qualitatively, we tested the NBDT against unseen classes of objects: 1. Find images that were not seen during training. 2. Given the hypothesis, determine which child each image belongs to. For example, we know that Elephant is an Animal so is *supposed to go the right subtree. 3. We can now evaluate the hypothesis, by checking how many images are passed to the correct child. For example, check how many Elephant images are sent to the Animal subtree. These accuracies per-class are shown to the right, with unseen Animals (blue) and unseen Vehicles (red) both showing high accuracies.

Note that in small datasets with 10 classes i.e., CIFAR10, we can find WordNet hypotheses for all nodes. However, in large datasets with 1000 classes i.e., ImageNet, we can only find WordNet hypotheses for a subset of nodes.

How it Works

The training and inference process for a Neural-Backed Decision Tree can be broken down into four steps.

Training an NBDT occurs in two phases: First, construct the hierarchy for the decision tree. Second, train the neural network with a special loss term. To run inference, pass the sample through the neural network backbone. Finally, run the last fully-connected layer as a sequence of decision rules.

Construct a hierarchy for the decision tree, called the Induced Hierarchy.
This hierarchy yields a particular loss function, which we call the Tree Supervision Loss.
Start inference by passing the sample through the neural network backbone. The backbone is all neural network layers before the final fully-connected layer.
Finish inference by running the final fully-connected layer as a sequence of decision rules, which we call Embedded Decision Rules. These decisions culminate in the final prediction.

Running Embedded Decision Rules

We first discuss inference. As explained above, our NBDT approach featurizes each sample using the neural network backbone. To understand what happens next, we will first construct a degenerate decision tree that is equivalent to a fully-connected layer.

Fully-Connected Layer: Running inference with a featurized sample is a matrix-vector product, as shown below.

This yields a matrix-vector product yields a vector of inner products, which we denote with y-hat. The index of the largest inner product is our class prediction.

Naive Decision Tree: We construct a basic decision tree with one root node and a leaf for each class. This is pictured by “B — Naive” in the figure above. Each leaf is directly connected to the root and has a representative vector, namely a row vector from W (Eqn. 1 above).

Also pictured above, running inference with a featurized sample x means taking inner products between x and each child node’s representative vector. Like the fully-connected layer, the index of the largest inner product is our class prediction.

The direct equivalence between a fully-connected layer and a naive decision tree motivates our particular inference method, using an inner-product decision tree. In our work, we then extend this naive tree to deeper trees. However, that discussion is beyond the scope of this article. Our paper (Sec. 3.1) discusses how this works, in detail.

Building Induced Hierarchies

This hierarchy determines which sets of classes the NBDT must decide between. We refer to this hierarchy as an Induced Hierarchy because we build the hierarchy using a pretrained neural network’s weights.

In particular, we view each row vector in the fully-connected layer’s weight matrix W as a point in d-dimensional space. This is illustrated by “Step B — Set Leaf Vectors“. We then perform hierarchical agglomerative clustering on these points. The successive clustering then determines the hierarchy, as illustrated above. Our paper (Sec. 3.2) discusses this in more detail.

Training with Tree Supervision Loss

Consider “A — Hard” in the figure above. Say the green node corresponds to the Horse class. This is just one class. However, it is also an Animal (orange). As a result, we know that a sample arriving at the root node (blue) should go to the right, to Animal. The sample arriving at the node Animal also should go to the right again, towards Horse. We train each node to predict the correct child node. We call the loss that enforces this the Tree Supervision Loss, which is effectively a cross entropy loss for each node.

Our paper (Sec. 3.3) discusses this in more detail and further explains “B — Soft”.

Trying NBDTs in under a minute

Interested in trying out an NBDT, now? Without installing anything, you can view more example outputs online and even try out our web demo. Alternatively, use our command-line utility to run inference (Install with pip install nbdt). Below, we run inference on a picture of a cat.

nbdt https://images.pexels.com/photos/126407/pexels-photo-126407.jpeg?auto=compress&cs=tinysrgb&dpr=2&w=32  # this can also be a path to local image

This outputs both the class prediction and all the intermediate decisions.

Prediction: cat // Decisions: animal (99.47%), chordate (99.20%), carnivore (99.42%), cat (99.86%)

You can load a pretrained NBDT in just a few lines of Python as well. Use the following to get started. We support several WideResNet28x10, ResNet18 for CIFAR100, CIFAR100, and TinyImageNet200.

from nbdt.model import HardNBDT

from nbdt.models import wrn28_10_cifar10

model = wrn28_10_cifar10()

    model = HardNBDT(

    pretrained=True,

    dataset='CIFAR10',

    arch='wrn28_10_cifar10',

    model=model)

For reference, see the script for the command-line tool we ran above; only ~20 lines are directly involved in transforming the input and running inference. For more instructions on getting started and examples, see our Github repository.

Conclusion

Explainable AI does not fully explain how the neural network reaches a prediction: Existing methods explain the image’s impact on model predictions but do not explain the decision process. Decision trees address this, but unfortunately, images⁴ are kryptonite for decision tree accuracy.

We thus combine neural networks and decision trees. Unlike predecessors that arrived at the same hybrid design, our neural-backed decision trees (NBDTs) simultaneously address the failures (1) of neural networks to provide justification and (2) of decision trees to attain high accuracy. This primes a new category of accurate, interpretable NBDTs for applications like medicine and finance. To get started, see the project page.

By Alvin Wan, *Lisa Dunlap, *Daniel Ho, Jihan Yin, Scott Lee, Henry Jin, Suzanne Petryk, Sarah Adel Bargal, Joseph E. Gonzalez

where * denotes equal contribution

[0] Designed by author Alvin Wan. Footnote exists to clarify we have rights to use this graphic.

[1] There are two types of saliency maps: one is white-box, where the method has access to the model and its parameters. One popular white-box method is Grad-CAM, which uses both gradients and class activation maps to visualize attention. You can learn more from the paper, “Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization” http://openaccess.thecvf.com/content_ICCV_2017/papers/Selvaraju_Grad-CAM_Visual_Explanations_ICCV_2017_paper.pdf. The other type of saliency map is black-box, where the model does not have access to the model parameters. RISE is one such saliency method. RISE masks random portions of the input image and passes this image through the model — the mask that damages accuracy the most is the most “important” portion. You can learn more from the paper “RISE: Randomized Input Sampling for Explanation of Black-box Models”, http://bmvc2018.org/contents/papers/1064.pdf.

[2] This 40% gap between decision tree and neural network accuracy shows up on TinyImageNet200.

[3] WordNet is a lexical hierarchy of various words. A large majority of words are nouns, but other parts of speech are included as well. For more information, see the official website.

[4] In general, decision trees perform best with low-dimensional data. Images are the antithesis of this best-case scenario, being extremely high-dimensional.

Making Decision Trees Accurate Again: Explaining what Explainable AI did not was originally published in riselab on Medium, where people are continuing the conversation by highlighting and responding to this story.