Stories by Scott Haines on Medium

Beyond the Data Abyss

Scott Haines — Thu, 17 Jul 2025 23:40:12 GMT

How We Learned to Fall in Love with our Streaming Data Again.

Imaging life at the brink of creation. Your data is in a similar primordial state until you gain control and learn to harness it. Image by Author via MidJourney.

In part one of this series “Escaping the Void of the Data Abyss”, we put our heads together and learned about the countless horrors that will sneak up on you in the night (or over the years) with respect to streaming data at massive scale — through the lens of what didn’t work at Nike. We then learned about the strategy taken at Nike to right the sins of the past, and how Buf came to our rescue and saved the day. Lastly, I left us with a cliff hanger at the end of the first entry saying I’d talk about some of the problems (and solutions) that came about as we started to scale up our protobuf-first strategy. Alas, here we are. Ready. Set. Let’s Go.

Aligning on a Protobuf-First Data Strategy

This can be a hard sell — at first. Most people when given the choice between a) object-flexibility, or mutability and b) strict type-safety, object-level governance, or message-level immutability, will tend to do what is easiest which falls in the mutability camp. Mutability ultimately equals extra work since you’ll be paying the piper on ingest as things change, which in many cases means “after emission into a Kafka-like service”.

This is a problem since streaming JSON data is still mutable and on its own is missing an understanding of schema governance. This just means it has a comedic way of surfacing corrupt records at the worst times. This can be annoying for web services, but it is crippling for event streaming since there is no way to rewind and replay corrupt events — simply put it will lead to data loss and is an outage!

Most people when given the choice between a) object-flexibility, or mutability and b) strict type-safety, object-level governance, or message-level immutability, will tend to do what is easiest which falls in the mutability camp

So to recap. JSON is great for HTTP services. JSON is bad for reliable streaming services. Protobuf makes JSON services more manageable as well (through type-saftey and object-level invariants) — so a win win in both respects and an easier sell is simply to use protobuf as a backbone for traditional web services and as a go-to for event streaming. For us at Nike, we needed it for both.

Sidebar: The Streaming Data Problem

The problem I mentioned earlier is fairly ubiquitous across most enterprises — in a nutshell, non-binary serializable structured data + streaming causing problems. The trouble is that problems sneak up on us over time with slowly changing data (row or columnar) and this can eventually corrupt historically captured data — to the point of no longer being able to read records.

There is likely a graph somewhere that shows the length of time a data product has been in existence (vs) the number of changes to the underlying data structure(s) as a function of increasingly invalid records within said dataset across the lifetime of that data product. If not, then imagine a chart showing something getting slowly worse as a function of time.

Not breaking backwards compatibility is one reason why protovalidate is so important (see part 1 for more details) and one of the easier ways of selling the protobuf-first strategy. It’s simply *impossible to force through a breaking change.

Back to aligning on the strategy.

Moving Past the Hard Sell

Going back to the hard sell (which was “hey, we’re going to go all in on protobuf”), and making it less of a hard sell takes some careful planning. Remember people are most often afraid of what they don’t know, and unless people really trust you, they’ll need to see “proof” (which we had through the clickstream project), and they’ll also be on the lookout for an “easy-button” unless they are convinced they can continue to deliver while undergoing a complete re-architecture.

One of the myths we found floating around the rumor-mill was that “Protobuf is too ridged to work with due to it’s immutability”. Luckily, we could easily squash that myth since while protobuf is immutable, it is not immutable in the way we think of sealed or final objects (in software), in fact, protobuf definitions can and will change over time. The big difference is that the specification accounts for how changes can be made while ensuring backwards compatibility with prior versions of “said message self”.

Knowing that we could achieve backwards-compatibility and gain compile-time guarantees, along with the time-saving benefits of code-generation proved to be the lightbulb moment most engineers were waiting for. Most of them had been dealing with broken data promises and their associated outages (botched deployments, 400s, 500s, broken data pipelines) for years and were happy to swim to shore. An easy sell for engineers is always “you’ll do less work”.

Regarding message and service immutability
A protobuf definition is immutable “at the point in time” that it is compiled. It is easier to think about “this point in time” in terms a version. We simply say “for a given version of a protobuf message, it is immutable” — but how do we account for a given version?

Lessons Learned: Versioning Strategy for our Protobuf

Meanwhile, back in Nike-land. We ran into the problem of versioning our protobuf early-on in the project. What’s the best pattern? especially when we were compiling many “artifacts” for various platforms and languages? We asked ourselves:

Do we cut a new release on every change?
What if changes occur in places we don’t care about?
How do we version and release specific sub-sets of messages?
What about releases? Do we wait to release a new “complete” snapshot using git-tags following our standard release versioning?
Are there a better way to share common messages so we don’t fall into old habits (like denormalizing our messages vs normalizing via composition)?

As you can tell there was a lot on our minds.

Initially, we followed the data-domain pattern (a play on the domain-driven modeling approach) within a large monorepo. Since we were using a monorepo, we could create shared resources (using local references), and then our domain specific resources (our protobuf messages and connect services) could still benefit from sandboxing (within a given domain) — while still sharing the same parent directories. In essence, we could take the one ring to rule them all approach and hope for the best.

proto/
  common/
    product/
      item.proto
      sku.proto
    membership/
      user.proto
      ...
    order/
      order.proto
      return.proto
    iso/
      country.proto
      language.proto
    ...
  domain_a/
  domain_b/
  ...
  domain_n/

The Problem: Monorepos tend to start off with the best of intentions, but depending on the size and scale of the enterprise, they can become unruly. When we crossed the threshold of around 1000 protobuf types — types being message, enum, rpc’s—we realized we were headed for a deadend.

Additionally, the problem of the monorepo isn’t just tech related, how to scale the number of collaborators (around 20–30 people) who all have their own needs and deadlines is a chore in and of itself. We needed a way to share versioned artifacts without requiring everything to live under the same roof.

We had to find an easier way. We needed decentralized modules. Really, we simply needed a solution that understood the human-scalability concerns.

The Solution: The Buf Schema Registry (BSR).

BSR is a single pane of glass for all things within the protobuf ecosystem. It proved to be a wonderful tool for coordination and collaboration as well. This means that engineers across our API services, data-domains, analytic domains could come together in a central place — while still operating with autonomy.

The Swiss-Army Knife that is the Buf Schema Registry

To fix the issues of scaling out our protobuf-first strategy, across both humans and also across organizational domains, we (Nike) ended up solidified a longer-term partnership with Buf.

It was a no brainer to move forward because of the Buf Schema Registry (BSR), but around the same time that we were going through procurement, there was a new product offering just on the horizon — Bufstream (we’ll get to that at the end of this post) and that was of great interest to our plan for protobuf-based domination.

BSR provided a way for us to create versioned remote modules, rather than continuing to maintain a centralized monorepo. Adoption of BSR helped solve the problem of too many cooks in the mono-kitchen almost overnight. Now teams could operate within organizations independently (similar to how you would utilize organizations within Github), and we could all still share common modules through a “common” organization within specific collections (repos) — For example, if you think about common IETF standards like ISO codes, these can be shared as a collection of enums vs “strings” which means no more typos.

All of the tricky dependency management is handled by Buf automatically through the Buf Cli with assistance from the buf.yaml (updates your buf.lock). When it comes time to create “versioned” resources, this is done through “labels” (in a similar way to using git tags for releases) — bound to a specific git commit (at least how we’d recommend doing it). See Figure 1–1 which showcases versioning in action.

Figure 1–1: A view of the “docs” tab within the public Buf Schema Registry. Explore Here.

With the ability to version our protobuf modules in a standardized way including the beautiful addition of remote dependency tracking — all in a decentralized way—we were finally cooking with fire.

But we still needed to figure out the best pattern for releasing our artifacts. Luckily, this was already a feature baked into the Buf Schema Registry.

Light Bulb Moment: Server-Side Artifact Generation

One of the more amazing features of the BSR is the ability to lazily generate artifacts. In Figure 1–2, you’ll see the SDK tab highlighted. The view provides you (the engineer) with simple directions to fetch the specific “versioned” resources you need.

Figure 1–2: Utilizing Lazy SDK Generation hints via the SDK Tab.

This feature really cuts down on the level of effort for working with globally distrubuted teams and we could now do more work asynchronously. Teams no longer needed to reach out in slack to understand which version was safe to deploy for a given artifact. Gone were the useless meetings to coordinate releases across our API services. We now had enterprise-wide invariants — we could finally trust-first, ask questions later!

Lesson Learned: Git Actions Rule

Utilizing protobuf and building grpc services (via connect) was a huge win. The icing on the cake was the power of BSR’s bot-user’s and Buf’s official github action (buf-action).

What are Bot Users?
If you are unfamiliar with bot-users, they are also referred to as “headless” users, or service principals. In general, this is a user that will never need to “see” the UI, and they are typically used for CI/CD or for other automations.

We realized we could create a new bot user for each onboarded organization within our enterprise BSR. We could then utilized github actions (shown in Figure 1–3) to enforce “stage>pr>push” patterns in a unified way.

Our custom github workflow included a step would fetch the bot-user token and afterwards all commits to BSR would be signed by the bot-user of a given organization. While it can be nice to “test” things out locally and push experiments (as a human), in the long run we turned off “write” access for non-owner’s within the BSR to prevent garbage from littering the environment. This strategy may not be the correct way of working for your company, but for us the additional safe-guard meant we could enforce our ways of working and sleep well at night.

Figure 1–3. Buf’s official Github Action.

Now that we had a strategy for how we’d roll out protobuf across the company, we could start to build additional layers onto this solid foundation. This was a good time for us to step back and consider why things were going well.

Project Retrospective

We started small. Proved that our ideas would work and scale. Often it feels like you are moving at snail-speed getting everything required into place, but this is more of a time dilation. The momentum of a new project brings with it so many novel ideas but we had to remember to stick to milestones and finish what we started. Then we could add additional layers and complexity with the trust provided by our apriori actions and successes.

The other thing that we did was pave a golden path for all teams that would be following in our footsteps. This meant we had already paid down the system-wide complexity, figured out ways of working that evolved through real trial and error, but we did so with mission-critical focus and concentration.

Semantic Streaming with Bufstream

In part one of this post (and highlighted here in Figure 1–4), I talked about our ingestion architecture and how we built our gateway ingestion services to receive clickstream events, validate this event streams, and either a) emit error messages back to the client, or b) write to Kafka to be consumed within Databricks.

Figure 1–4. Our Original Ingestion Architecture

This architecture was our starting point. Given the goal of having a protobuf-first data strategy, we realized fairly quickly that we were going to run into additional points of contention even with our generic “protobuf-aware” pyspark applications. The overhead alone of running full-time streaming applications, or coordinating scheduled jobs using structured streaming and trigger(availableNow=True) meant that we’d need to bring best in class automation to the table (literally).

Luckily for us, we had the Bufstream ace-up-our-sleeve. While we are currently working to scale out this new architecture, the early results are looking really promising. For us, reducing the overhead required for ingesting from N Kafka topics and writing all records to N lakehouse tables means we’ll reduce our ingestion complexity by N. This is huge, even if it doesn’t seem that way.

Figure 1–4. Future Facing Ingestion Architecture

This is all made possible given Bufstream can do zero-copy writes into the Iceberg Lakehouse format. Given the acquisition of Tabular by Databricks, Apache Iceberg was now a first-class citizen within Unity Catalog for our Databricks workspace(s), which means we wouldn’t need to do anything complex other than write our analytical data to a Bufstream topic. Now that Buf provides support for Unity Catalog managed Iceberg tables, this is a simple win — take the win! The hard work is now done behind the scenes for us, all within our own VPC.

I’ll be writing more in depth about Bufstream, so look out for future posts.

The Future is Bright (Nike x Buf)

Now that Nike has partnered with Buf, and after the process of integrating BSR, nailing down the golden paths for managing our versioned protobuf, figuring out the right set of github actions to simplify deployment of our connect rpc services, and beginning the journey towards simplifying our Kafka footprint using Bufstream — I can say the future is looking extremely bright.

The one last thing for this post is a quote from one of my close friends at Nike — he said “before using Buf’s tooling we (Nike) were looking at months to do basic changes to our existing APIs, and now that we’ve nailed down the right patterns, we are looking at DAYS to do the same work.”

before using Buf’s tooling we (Nike) were looking at months to do basic changes to our existing APIs, and now that we’ve nailed down the right patterns, we are looking at DAYS to do the same work.

Now, when asked if you prefer a) object-flexibility, or mutability and b) strict type-safety, object-level governance, or message-level immutability — the answer is clear. Choose protobuf and the message-level invariants that you will come to grow and trust.

If you missed Part 1 of this series. Check it out below.

Escaping the Void of the Data Abyss

* impossible — you can force a breaking change by turning off protovalidate, luckily with BSR you can enforce breaking-change detection at the registry across the entire enterprise.

Escaping the Void of the Data Abyss

Scott Haines — Mon, 30 Jun 2025 21:31:30 GMT

Leaning on the Guiding Lights of Structured Data and Protocol-Level Invariants to Avoid the Evil Clutches of Bad Data and Rewrite the Sins of the Past.

Does your Data Scare You? Image By Author via Midjourney.

If you’re anything like me you don’t frighten easily. I love horror movies. I truly appreciate the clever use of foreshadowing, especially when we all know what is clearly coming next. Being a member of this silent audience — sitting there on the edge of our seats—provides us with the excitement of premonition (from the safety of our seats) while the movie takes us right around the corner. What comes next is literally why we’re here!

Does Your Data Scare You?

Jump scares and coming face to face with monsters isn’t what we crave though when it comes to our data. It just happens to be one of the realities of working at large enterprises. Comically bad data can make you feel just as helpless as the damsel in distress, and unfortunately leaves you paralyzed simply being dragged along for the ride just like the audience in the movie theater — even though we already understand what is probably right around the corner and in this case it is broken data promises and sadness.

So why does our data scare us?

It doesn’t have to. In fact, just like the unsung hero in most horror movies, it is up to us to flip the script and change the outcome of the story. What comes next is in your hands (dear audience) but as a lifeline, I’ll tell you a recent story from Nike in hopes that you too can escape the event horizon of the data abyss!

Steering around Data Swamps and Preventing the Impending Datapocalypse

I joined Nike in 2022 and my mission was fairly “simple” — I was being hired to change the data culture and find creative ways to fix the monstrous problems we had with “all things enterprise data and analytics”. Simple. Right? Nope.

I had unwittingly just stumbled into a horror movie set in a scary world of hopes and *poorly executed data pipeline dreams — luckily this wasn’t my first time escaping the abyss. Just like Van Helsing, I too had tools at my disposal and prior success to help me on my way. What I needed first was to assemble a crew, and materialize a mission that could set off a series of events to rid the world of the data swamp monsters lurking in the data deep—once and for all.

Oh. But where to get started?

Focusing on Mission Criticality and Solving the Hard Problems

It’s easy to do work in the shadows. Alone. It is much harder to get work accomplished with the blinding light of Sauron’s eye watching your every move — all the while building trust (at a new company) and assembling a core team as an IC (distinguished software engineer).

When you’re working to disrupt the status quo (highest probability of why we were here), there are people who want you to succeed — and then there is everyone else. Luckily for me, I had the support of my boss (head of ED&AI) on my side and found a crew of folk who wanted to work hard and solve complex problems (across multiple organizations in a matrix company… scary!).

In order for us to solve our data problems (which I’ll go into next), we were going to need to roll back the clocks to a time before the worst of the bad decisions were made, and then rewrite the playbook—and as we also found out later; the entire analytics and data serving stack.

We were going to once again need to lean heavily on my best friend Protobuf, but this time we were going to pick up some new advanced tools along the way (foreshadowing!).

If you’ve ever heard me talk before, you’ve probably heard me talk about my love of all things protobuf. If not, check out the video from RedisConf 2020.

So about those data problems!

The Hard Problem: Nike’s Clickstream

After a few months at Nike something became apparent (seemingly bubbling to the surface on its own). There was something wrong with the clickstream. This wasn’t like a simple “oops”, this was so much more than “ooo-oo-oooops”. This was sinister and insidious like a good haunted house story.

Something was systemically wrong and oh, this is it (I thought to myself), this is why I’m here. I found my mission. I figured out where I can lean in and create meaningful change. But what was so wrong (?) you might be asking — cause you don’t know the story like I do.

Not the Data Lakehouse you were Expecting!. Image by Author via Midjourney.

Oftentimes what starts out with the best of intentions — essentially all of our clickstream events, our hand-tuned SDKs, and even our end-to-end data delivery strategy — was beginning to crack, crumble, and fall apart. There were even early warning signs of complete system-wide collapse (things had grown too big for the architecture) — much like the abandoned “maybe haunted house” in the picture above—things were not looking too good. We had to make a call. Do we Fix It (and can we)? Or do we abandon ship (and essentially burn things to the ground). This is a difficult question that deserves some truly objective reasoning.

Retrofit or Rebuild?

There are pros and cons when using JSON for data intensive applications — mostly cons though when it comes to reliable streaming of mission critical data. In this case, the mutability of JSON data was causing system-wide issues on the clickstream. So the question was still there though — do we retrofit or rebuild? And can we fix the sins of the past?

If you some time on your hands, you could dig further into a longer post on analytical stream processing. Goes into the why behind “why not JSON”.

A Modest Introduction to Analytical Stream Processing

Initially, the idea was that we could just retrofit the system since after all it was built with the best of intentions. So the team and I dug in, and we uncovered some more ugly truths.

The Sins of the Past

Back in the day (sometime around 2014-ish) a consulting company (one of the big ones with gigantic market cap) sold Nike on the idea of using “TSV” (yep, tab-separated values) files to encapsulate individual events for the clickstream. This solution was created to solve the problem of long lead times for changes on the clickstream (events specifically). The main problem identified by “said consulting company” was the delta between modeling an event, and then instrumenting said event on our desktop and mobile sites (and applications) with proper testing — it was simply taking “longer than expected”.

So people nodded along, and “said consulting company” was paid to implement the solution, train as many teams as possible across Nike to use their complex solution, and then they left their solution in “our very capable hands”.

The solution was that any product manager (PM) could edit a (tsv) file, and then “it” would be parsed and compiled into JSON-SCHEMA. This was the solution (plus a read-only UI) so that teams across Nike could then utilize the event schemas (after implementing their own interpretation of what each event “required”) for any given event. Sounds okay, right?

In theory, this wasn’t a bad idea. In practice, there were problems that arose quickly — like for example when “a typo” had the amazing effect of removing a previously “generated” value from json-schema’s enum. To paint a clearer picture, this little typo had the cascading power to break all downstream data consumers leading to broken ML models, dashboards and reports. In short, the butterfly effect in the data world. Cause there was no testing for breaking changes (more foreshadowing!) and zero object-level governance or semantic validation at play.

When faced with problems, what is better than adding additional complexity?

Due to symptoms of the sunk cost fallacy an even wilder problem was created on top of a bad solution. Rather than continuing to utilize unified events, communicate between teams, and test changes per release using our shiny new (fragile) system, and ultimately “share” reusable events across all experiences (nike.com, nike app ios/android, and others), the product managers decided that they could simply copy (or fork) events so they could “own” their own and not bother with a unified strategy. So where we once had 1 event, we now had between 9 and 36 versions of each event (yes the math doesn’t make sense, but remember the horror theme?).

Love it or Hate It. Object-Level Governance is Incredibly Important

Given the strain on the system, the years of events being forked (vs reused or composed), and the complexity in painting a unified analytical story across the various Nike experience event streams — we were quickly coming to a difficult realization. As you probably guessed, the retrofit was off the table.

After all, about 8 years had already gone by at this point in time and the cracks in the system were essentially the entire system. So we decided that the best path forward was to do a rebuild. But we were going to do it right this time. Even if it would kill us. Mission Accepted.

The fact that we had loose definitions of events — given the type-free nature of TSV based definitions — we effectively had zero object-level governance and could kiss semantic validation goodbye. This meant that there was no system in place to “modify” an event in a standard way. If you think about classic relational databases (OLTP) each table has a schema. If you want to modify the table’s schema, then the protocol-level invariants (promises) enforce “how” a change can occur and “how>what>when” the change will affect said given table.

When it comes to streaming systems, like the classic event stream (clickstream), you really want to abide by similar rules for schema evolution (as with classic relational databases) even though it may feel like you are moving a little slow (that will change). This includes rules about type-saftey, field-level position (or columnar position), and how nested-objects will be encoded — as well as how backwards and fowards compatibility will be achieved and even enforced.

In order to provide event-level invariants, we’d need to lean on a type-safe message format that supports backwards and forwards compatibility — which again means leaning on protocol buffers.

Fixing Nike’s Clickstream and Behavioral Analytics

We had a mission. We had purpose. What we didn’t have yet was a clear plan of attack— we just knew we needed to fix the sins of the past.

Establishing a Plan of Attack

We had a kick off brainstorming meeting between the small core team of engineers — there were 3 of us, let’s call the other two Doug and Christian (to keep them anonymous), representing Data Ingestion & Platform, Apps & Experiences, as well as Data, Analytics, and AI. The goal of the session was to come up with a set of standards, system expectations, as well as ways of working that we could stake our reputation on. Given we’d be doing a lot of work asynchronously and on a limited timeline (3 months to prove things would work), ways of working were not optional — they were essential.

The end result of our time, coffee, and some beers was the following:

Standardize on Protobuf. Not simply for API services, but truly end to end across analytics and ML as well. End-to-End Protobuf was not a nice to have, it was the way. We’d need a way to ensure our protobuf was written in a standard way across the company, and need to write some test harnesses to ensure backwards compatibility (see zero breaking changes below). JSON was the reason the initial system fell apart — given its mutability and lax standards across the company we wouldn’t fall prey to “loosely” structured data.
Lean on Code Generation & Semantic Validation. Ridding ourselves of the long lead time for change in the system would mean we’d need the ability to cross-compile our event definitions so we’d have no excuse not to “reuse” and “compose” shared events in a much more governed way. In addition, we’d need a way to compile our definition of “correct” for each event to reduce the time taken from event ingestion to insight — we’d need some kind of compiled validation logic. This would flip the script on the prior system where new events would take months to finally release across all experiences.
gRPC for our SDKs. Taking things one step further, we’d compile down our SDKs utilizing gRPC to prevent the issues we’d encountered with traditional REST for our clickstream. This way, we’d write our interfaces (using IDL for gRPC) and cross-compile for javascript, typescript, swift, kotlin, and go. Reducing the time required to implement new events.
Zero Breaking Changes. We’d guarantee that all events would always be backwards compatible within a major version (1.0.0 vs 2.0.0). We would only make a breaking change when it mattered. This was in direct response to the lack of governance in the prior system. Trust is built when things just work release after release.
Automate Event Ingestion to Databricks. We were using databricks for our Lakehouse environment. One of the issues of the prior system was a large pain point when it came to scaling out the JSON-based event streams — due to zero semantic validation and no governance the streams would be corrupt more often than not. We’d fix that issue by automating the ingestion of our event streams, leaning on apriori semantic validation, to ensure that only trust-worth data would be appended to our tables and made available via Unity Catalog.

So in short, we’d agreed to 1) standardize on protobuf, 2) implement code generation and semantic validation, 3) utilize gRPC for our SDKs to simplify the exchange of events, 4) provide compile-time guarantees and a mission of zero breaking changes, and lastly, 5) we’d automate the last mile ingestion of event data by leaning on protocol level invariants, and high-trust built on semantic event validation at the ingestion edge. Now the hard work would begin.

From Mission Briefing to Production

During the early research phase we stumbled upon a company called Buf — by accident really. We’d been digging into some newer changes enabling native protobuf support for Apache Spark and found a file called buf.yaml while digging into the depths of spark-connect (spark’s gRPC client).

Sometimes things in life are simply serendipitous and we couldn’t have hoped for a better find this early on in the project.

Simplified Standardization with Protobuf

The buf.yaml turned out to be a specification that worked along with the Buf CLI — that checked off two of the known unknowns from our list. First, how we’d simplify breaking change detection — which we learned buf breaking could be used for just that (this was clutch), and secondly, how we’d provide capabilities to ensure our protobuf messages were written in a standard way — this ended up being the secret behind buf lint.

version: v1
breaking:
  use:
    - FILE
lint:
  use:
    - BASIC

Buf

So we’d accidentally discovered amazing tooling that we could use to drastically simplify how we’d manage our protobuf definitions over time. We’d also discovered a way to provide protobuf message-level semantic validation at the Nike edge using protovalidate.

Semantic Validation with Protovalidate

What is protovalidate you ask? It provides a critical missing component to the protobuf specification — runtime field level semantic validation. If you’ve used protobuf before, you are probably familiar with end-of-wire exceptions. These occur while deserializing a binary payload back into a concrete message and are more often than not caused by changing field types in a non-backwards compatible way. I bring this up since an end-of-wire exception points to exceptions due to unknown changes in the “shape” of the protobuf message — but with perfect type-saftey the message can still be invalid due to missing “fields”.

If you want to understand Protobuf Best Practices. Read the protobuf do’s/dont’s. For more on protovalidate, read on.

If you think about API-level contracts, some fields are marked as “required” and other’s are marked as “optional”. This provides the end-user with an understanding of what fields they can rely on, vs which one’s may exist or not. Consistent values within the “required” fields (as well as the optional when they show up) is critical not just for APIs, but for event streams.

With protovalidate, we simply annotate our message (shown below), and we can compile-down the validation rules to then use for runtime semantic validation checks.

message Order {
  // Each Order takes place at a point in Time
  google.protobuf.Timestamp order_created = 1 [(buf.validate.field).cel = {
    id: "not_from_the_future",
    message: "we are not ready to offer scheduled orders. Maybe in the future",
    // Ensure that the server's local time (utc) is used as a gating mechanism for sane timestamps
    expression: "this <= now"
  }];
  // An Order can be purchased at a CoffeeCo Location, otherwise where is the coffee going to be made
  // It is true that the Store could be online, but that makes this reference more complicated than necessary
  coffeeco.v1.Store purchased_at = 2 [(buf.validate.field).required = true];
  // A Customer can Order from our Coffee Location
  coffeeco.v1.Customer customer = 3 [(buf.validate.field).required = true];
  // Each Order may have one or more items. We cannot have an Order without something to Purchase
  repeated coffeeco.v1.Product items = 4 [(buf.validate.field).required = true];
  // Each Order has a monetary value
  coffeeco.v1.Total total = 5 [(buf.validate.field).required = true];
}

The Order above can now be validated at runtime.

if err := s.Validator.Validate(req.Msg); err != nil {
  log.Println("validation failed:", err)
  response = err.Error()
} else {
  // do something with the valid protobuf object
}

Finally, here is a longer example that showcases the full validation flow (at the go service level).

func (s *CoffeeserviceServer) CoffeeOrder(ctx context.Context,
 req *connect.Request[coffeeservicev1.CoffeeOrderRequest]) (*connect.Response[coffeeservicev1.CoffeeOrderResponse], error) {
 log.Println("Request Headers: ", req.Header())
 var order = req.Msg.Order
 
 response := ""

 if err := s.Validator.Validate(req.Msg); err != nil {
  log.Println("validation failed:", err)
  response = err.Error()
 } else {
  data, err := proto.Marshal(order)
  if err != nil {
   log.Println(err)
  }
  err = s.Kafka.Produce(&kafka.Message{
   TopicPartition: kafka.TopicPartition{
    Topic:     &s.TopicName,
    Partition: kafka.PartitionAny,
   },
   Key:   []byte(order.Customer.Name),
   Value: data,
  }, nil)
  if err != nil {
   log.Println("Failed to Send Order : stream/coffeeco.v1.orders")
   response = fmt.Sprintf("We Failed to Send your Order, %s\n", order.Customer.Name)
  } else {
   log.Println("Order Published Successfully : stream/coffeeco.v1.orders")
   response = fmt.Sprintf("Thanks for the Order, %s\n", order.Customer.Name)
  }
 }
 log.Println("New Order: ", order)
 res := connect.NewResponse(&coffeeservicev1.CoffeeOrderResponse{
  Response: response,
 })
 res.Header().Set("CoffeeService-Version", "v1")
 return res, nil
}

The service above provides edge-level validation ensuring that any downstream consumer of “the data” can trust that the data being processed has been edge validated. While this may not seem like a “big deal”, in practice most data that ends up in the hands of data engineers is required to go through brutle “cleansing” steps in order to provide “non-corrupt” data for further downstream processing. Why reprocess all of your data when you can instead just rely on it being “semantically valid”?.

Around the same time we’d hit a wall with the time-to-first-byte cost of running gRPC across our desktop experiences (due to true page-loads on nike.com vs single-page style application architecture), and while all signs were pointing in the right direction elsewhere, this was crippling. We’d hit a potential pitfall that could break the project. Or so we thought!

Connect is the Missing Link for Enterprise gRPC

During our experiments with protovalidate (originally protoc-gen-validate), we’d discovered the Connect protocol — another hidden treasure in the Buf treasure chest. When we thought things were going to simply fall apart, and when back to square one was not on the menu, we found a life line. Connect.

From the Docs: Connect is a family of libraries for building browser and gRPC-compatible APIs. If you’re tired of hand-written boilerplate and turned off by massive frameworks, Connect is for you

Connect provides gRPC networking using the native networking clients for javascript (ecmascript) and typescript (as well as swift, kotlin, and go). This meant that all of our SDKs could lean on gRPC but with faster load times and more native integrations. This meant that we could spin up a connect session on the desktop (or mobile web) without sacrificing time to first byte, and our analytics events could be emitted using our new “unified” analytics stack.

Connect was the bridge and glue to “connect” the final set of dots, and we were well on our way to production, there was just one last thing we needed to get done. We needed to make good on our promise to automate the data ingestion pipelines.

Automating Data Ingestion

At this point in our journey, things we’re smooth sailing and the crew was having a blast. We knew we were onto something great here, and had found an incredible company to partner with — Buf. There was one last thing missing from our initial brainstorming session, we’d need to prove we could automate the ingestion of our clickstream events into Databricks without a full-time dedicated team sitting around. We’d need to make sure the new clickstream could scale out and provide value in a way that most data engineers are familiar with — at the table level.

Data Ingestion Architecture

The ingestion architecture that we’d come up with wasn’t really novel. It was a take on some earlier work from my days at Twilio, the big change was in the available tooling and in some of the changes made to Delta Lake.

I’d been riding the Delta Lake waves since around 2018, and it provided a lot of the same protocol-level invariants that we got from protobuf, just as a columnar-oriented format vs a row-based format. Given protobuf was natively supported in Apache Spark, and given that Delta Lake provided sophisticated streaming capabilities, we could simply leverage the best of both worlds to provide end-to-end streaming.

End-to-End Data Ingestion: Circa 2024. Image by Author.

We were so proud of what we had achieved that a good friend of mine Ashok Singamaneni and I decided to present our work during the Data+AI Summit.

Given the length of this post, I’d suggest watching the following video if you want to dive deeper into the streaming architecture.

https://medium.com/media/7763d61508a5293e7d78ec46da858de8/href

Being Brave and Standing up against the Status Quo

It took us three months to get to production. It then took us another year to get everyone on board across the company to ensure the strategy and vision would continue as the number of people on the project grew. The biggest revalation here was that a core group of passionate engineers can accomplish a lot more than people expect, and this doesn’t just happen by burning the midnight oil. In fact, nothing ever happens for free.

The results were astonishing though. In a matter of months, the core team was able to deliver and we scaled through Black Friday and Cyber Week with a measly additional operating cost of around $20/day. This was down around 100x from the prior system and we couldn’t have done any of that without the folks from Buf. They were true silent partners offering advice and support, and helping us along our way. In all reality, if it wasn’t for the tooling provided by Buf (buf generate, buf breaking, buf lint, buf image and the connect protocol), Nike would be still suffering under the weight of prior bad decisions and we’d never be able to escape the clutches of bad data decision making.

Now that we were fully committed to moving forwards, it was time to solidify our partnership with the folks over at Buf and that will take us to part two of this story, where we’ll talk about some of the problems (and solutions) that came about as we started to scale up our protobuf-first strategy.

Continue on to Part 2.

Beyond the Data Abyss

*: Not all data pipelines were terrible. There are always teams that are willing to work against the status quo. There was truthfully a lot of slop though and that happens overtime, and usually comes from a place of neglect and data hoarding.

A Modest Introduction to Analytical Stream Processing

Scott Haines — Tue, 15 Aug 2023 14:06:57 GMT

Architectural Foundations for Building Reliable Distributed Systems.

Distributed Streaming Data Networks are Unbounded and Growing at Incredible Rates. Image Created via Author’s MidJourney

Foundations of Stream Processing

Foundations are the unshakable, unbreakable base upon which structures are placed. When it comes to building a successful data architecture, the data is the core central tenant of the entire system and the principal component of that foundation.

Given the most common way data flows into our data platforms is through stream processing platforms like Apache Kafka and Apache Pulsar this post covers this topic area.

Therefore it becomes critical to ensure we (as software engineers) provide hygienic capabilities and frictionless guardrails to reduce the problem space related to data quality “after” data has entered into these fast-flowing data networks.

This means establishing api-level contracts surrounding our data’s schema (types, and structure), field-level availability (nullable, etc), and field-type validity (expected ranges, etc) become the critical underpinnings of our data foundation especially given the decentralized, distributed streaming nature of today’s modern data systems.

However, to get to the point where we can even begin to establish blind-faith — or high-trust data networks — we must first establish intelligent system-level design patterns.

Building Reliable Streaming Data Systems

As software and data engineers, building reliable data systems is literally our job, and this means data downtime should be measured like any other component of the business. You’ve probably heard of the terms SLAs, SLOs and SLIs at one point or another. In a nutshell, these acronyms are associated to the contracts, promises, and actual measures in which we grade our end-to-end systems.

As service owners, we will be held accountable for our own successes and failures, but with a little upfront effort, standard metadata, and common standards and best practices can ensure things are running smooth across the board.

Additionally, the same metadata can also provide valuable insights into the quality and trust of our data-in-flight, along its journey until it finds its terminal area to rest. The lineage tells a story all on its own.

Adopting the Owners Mindset

For example, Service Level Agreements (SLAs) between your team, or organization, and your customers (both internal and external) are used to create a binding contract with respect to the service you are providing. For data teams, this means identifying and capturing metrics (KPMs — key performance metrics) based on your Service Level Objectives (SLOs). The SLOs are the promises you intend to keep based on your SLAs, this can be anything from a promise of near perfect (99.999%) service uptime (API or JDBC), or something as simple as a promise of 90-day data retention for a particular dataset. Lastly, your Service Level Indicators (SLIs) are the proof that you are operating in accordance with the service level contracts and are typically presented in the form of operational analytics (dashboards) or reports.

Knowing where we want to go can help establish the plan to get there. This journey begins at the inset (or ingest point), and with the data. Specifically, with the formal structure and identity of each data point. Considering the observation that “more and more data is making its way into the data platform through stream processing platforms like Apache Kafka” it helps to have compile time guarantees, backwards compatibility, and fast binary serialization of the data being emitted into these data streams. Data accountability can be a challenge in and of itself. Let’s look at why.

Managing Streaming Data Accountability

Streaming systems operate 24 hours a day, 7 days a week, and 365 days of the year. This can complicate things if the right up front effort isn’t applied to the problem, and one of the problems that tends to rear its head from time to time is that of corrupt data, aka data problems in flight.

Dealing with Data Problems in Flight

There are two common ways to reduce data problems in flight. First, you can introduce gatekeepers at the edge of your data network that negotiate and validate data using traditional Application Programming Interfaces (APIs), or as a second option, you can create and compile helper libraries, or Software Development Kits (SDKs), to enforce the data protocols and enable distributed writers (data producers) into your streaming data infrastructure, you can even use both strategies in tandem.

Data Gatekeepers

The benefit of adding gateway APIs at the edge (in-front) of your data network is that you can enforce authentication (can this system access this API?), authorization (can this system publish data to a specific data stream?), and validation (is this data acceptable or valid?) at the point of data production. The diagram in Figure 1–1 below shows the flow of the data gateway.

Figure 1–1: A Distributed Systems Architecture showing authentication and authorization layers at a Data Intake Gateway. Flowing from left to right, approved data is published to Apache Kafka for downstream processing. Image Credit by Scott Haines

The data gateway service acts as the digital gatekeeper (bouncer) to your protected (internal) data network. With the main role of controlling , limiting, and even restricting unauthenticated access at the edge (see APIs/Services in figure 1–1 above), by authorizing which upstream services (or users) are allowed to publish data (commonly handled through the use of service ACLs) coupled with a provided identity (think service identity and access IAM, web identity and access JWT, and our old friend OAUTH).

The core responsibility of the gateway service is to validate inbound data before publishing potentially corrupt, or generally bad data. If the gateway is doing its job correctly, only “good” data will make its way along and into the data network which is the conduit of event and operational data to be digested via Stream Processing, in other words:

“This means that the upstream system producing data can fail fast when producing data. This stops corrupt data from entering the streaming or stationary data pipelines at the edge of the data network and is a means of establishing a conversation with the producers regarding exactly why, and how things went wrong in a more automatic way via error codes and helpful messaging.”

Using Error Messages to Provide Self-Service Solutions

The difference between a good and bad experience come down to how much effort is required to pivot from bad to good. We’ve all probably worked with, or on, or heard of, services that just fail with no rhyme or reason (null pointer exception throws random 500).

For establishing basic trust, a little bit goes a long way. For example, getting back a HTTP 400 from an API endpoint with the following message body (seen below)

{
  "error": {
    "code": 400,
    "message": "The event data is missing the userId, and the timestamp is invalid (expected a string with ISO8601 formatting). Please view the docs at http://coffeeco.com/docs/apis/customer/order#required-fields to adjust the payload."  
  }
}

provides a reason for the 400, and empowers engineers sending data to us (as the service owners) to fix a problem without setting up a meeting, blowing up the pager, or hitting up everyone on slack. When you can, remember that everyone is human, and we love closed loop systems!

Pros and Cons of the API for Data

This API approach has its pros and cons.

The pros are that most programming languages work out of box with HTTP (or HTTP/2) transport protocols — or with the addition of a tiny library — and JSON data is just about as universal of a data exchange format these days.

On the flip side (cons), one can argue that for any new data domain, there is yet another service to write and manage, and without some form of API automation, or adherence to an open specification like OpenAPI, each new API route (endpoint) ends up taking more time than necessary.

In many cases, failure to provide updates to data ingestion APIs in a “timely” fashion, or compounding issues with scaling and/or api downtime, random failures, or just people not communicating provides the necessary rational for folks to bypass the “stupid” API, and instead attempt to directly publish event data to Kafka. While APIs can feel like they are getting in the way, there is a strong argument for keeping a common gatekeeper, especially after data quality problems like corrupt events, or accidentally mixed events, begin to destabilize the streaming dream.

To flip this problem on its head (and remove it almost entirely), good documentation, change management (CI/CD), and general software development hygiene including actual unit and integration testing — enable fast feature and iteration cycles that don’t reduce trust.

Ideally, the data itself (schema / format) could dictate the rules of their own data level contract by enabling field level validation (predicates), producing helpful error messages, and acting in its own self-interest. Hey, with a little route or data level metadata, and some creative thinking, the API could automatically generate self-defining routes and behavior.

Lastly, gateway APIs can be seen as centralized troublemakers as each failure by an upstream system to emit valid data (eg. blocked by the gatekeeper) causes valuable information (event data, metrics) to be dropped on the floor. The problem of blame here also tends to go both ways, as a bad deployment of the gatekeeper can blind an upstream system that isn’t setup to handle retries in the event of gateway downtime (if even for a few seconds).

Putting aside all the pros and cons, using a gateway API to stop the propagation of corrupt data before it enters the data platform means that when a problem occurs (cause they always do), the surface area of the problem is reduced to a given service. This sure beat debugging a distributed network of data pipelines, services, and the myriad final data destinations and upstream systems to figure out that bad data is being directly published by “someone” at the company.

If we were to cut out the middle man (gateway service) then the capabilities to govern the transmission of “expected” data falls into the lap of “libraries” in the form of specialized SDKS.

Software Development Kits (SDKs)

SDKs are libraries (or micro-frameworks) that are imported into a codebase to streamline an action, activity, or otherwise complex operation. They are also known by another name, clients. Take the example from earlier about using good error messages and error codes. This process is necessary in order to inform a client that their prior action was invalid, however it can be advantageous to add appropriate guard rails directly into an SDK to reduce the surface area of any potential problems. For example, let’s say we have an API setup to track customer’s coffee related behavior through event tracking.

Reducing User Error with SDK Guardrails

A client SDK can theoretically include all the tools necessary to manage the interactions with the API server, including authentication, authorization, and as for validation, if the SDK does its job, the validation issues would go out the door. The following code snippet shows an example SDK that could be used to reliably track customer events.

import com.coffeeco.data.sdks.client._
import com.coffeeco.data.sdks.client.protocol._

Customer.fromToken(token)
  .track(
    eventType=Events.Customer.Order,
    status=Status.Order.Initalized,
    data=Order.toByteArray
  )

With some additional work (aka the client SDK), the problem of data validation or event corruption can just about go away entirely. Additional problems can be managed within the SDK itself like for example how to retry sending a request in the case of the server being offline. Rather than having all requests retry immediately, or in some loop that floods a gateway load balancer indefinitely, the SDK can take smarter actions like employing exponential backoff. See “The Thundering Herd Problem” for a dive into what goes wrong when things go, well, wrong!

The Thundering Herd Problem

Let’s say we have a single gateway API server. You’ve written a fantastic API and many teams across the company are sending event data to this API. Things are going well until one day a new internal team starts to send invalid data to the server (and instead of respecting your http status codes, they treat all non-200 http codes as a reason to retry. But wait, they forgot to add any kind of retry heuristics like exponential backoff, so all requests just retry indefinitely — across an ever increasing retry queue). Mind you, before this new team came on board there was never a reason to run more than one instance of the API server, and there was never a need to use any sort of service level rate limiter either, because everything was running smoothly within the agreed upon SLAs.

The Not-So-Fail-Whale. What can happen when you restore problems and get back out of the hot water again. Image via Midjourney via the Author.

Well, that was before today. Now your service is offline. Data is backing up, upstream services are filling their queues, and people are upset because their services are now starting to run into issues because of your single point of failure…

These problems all stem from a form of resource starvation coined “The Thundering Herd Problem”. This problem occurs when many processes are awaiting an event, like system resources being available, or in this example, the API server coming back online. Now there is a scramble as all of the processes compete to attempt to gain resources, and in many cases the load on the single process (api server) is enough to take the service back offline again. Unfortunately, starting the cycle of resource starvation over again. This is of course unless you can calm the herd or distribute the load over a larger number of working processes which decreases the load across the network to the point where the resources have room to breathe again.

While the initial example above is more of an unintentional distributed denial of service attack (DDoS), these kinds of problems can be solved at the client (with exponential backoff or self-throttling) and at the API edge via load balancing and rate limiting.

Ultimately, without the right set of eyes and ears, enabled by operational metrics, monitors and system level (SLAs/SLIs/SLOs) alerting, data can play the disappearing act, and this can be a challenge to resolve.

Whether you decide to add a data gateway API to the edge of your data network, employ a custom SDK for upstream consistency and accountability, or decide to take an alternative approach when it comes to dealing with getting data into your data platform it is still good to know what your options are. Regardless of the path in which data is emitted into your data streams this introduction to streaming data wouldn’t be complete without a proper discussion of data formats, protocols, and the topic of binary serializable data. Who knows we may just uncover a better approach to handling our data accountability problem!

Selecting the Right Data Protocol for The Job

When you think of structured data the first thing to come to mind might be JSON data. JSON data has structure, is a standard web-based data protocol, and if nothing else it is super easy to work with. These are all benefits in terms of getting started quickly, but over time, and without the appropriate safeguards in place, you could face problems when it comes to standardizing on JSON for your streaming systems.

The Love / Hate Relationship with JSON

The first problem is that JSON data is mutable. This means as a data structure it is flexible and therefore fragile. Data must be consistent to be accountable, and in the case of transferring data across a network (on-the-wire) the serialized format (binary representation) should be highly compactable. With JSON data, you must send the keys (for all fields) for each object represented across the payload. Inevitably this means that you’ll typically be sending a large amount of additional weight for each additional record (after the first) in a series of objects.

Luckily, this is not a new problem, and it just so happens that there are best practices for these kinds of things, and multiple schools of thought regarding what is the best strategy for optimally serializing data. This is not to say that JSON doesn’t have its merits. Just when it comes to laying a solid data foundation the more structure the better and the higher level of compaction the better as long as it doesn’t burn up a lot of CPU cycles.

Serializable Structured Data

When it comes to efficiently encoding and transferring binary data two serialization frameworks tend to always come up: Apache Avro and Google Protocol Buffers (protobuf). Both libraries provide CPU efficient techniques for serializing row-based data structures, and in addition to both technologies also provide their own remote procedure call (RPC) frameworks and capabilities. Let’s look at avro, then protobuf, and we will wrap up looking at remote procedure calls.

Avro Message Format

With avro, you define declarative schemas for your structured data using the concept of records. These records are simply JSON formatted data definitions files (schemas) stored with the file type avsc. The following example shows a Coffee schema in the avro descriptor format.

{
  "namespace": "com.coffeeco.data",
  "type": "record",
  "name": "Coffee",
  "fields": [
    ("name": "id", "type: "string"},
    {"name": "name", "type": "string"},
    {"name": "boldness", "type": "int", "doc": "from light to bold. 1 to 10"},
    {"name": "available", "type": "boolean"}
 ]
}

Working with avro data can take two paths that diverge relating to how you want to work at runtime. You can take the compile time approach, or the figure things out on-demand at runtime. This enables a flexibility that can enhance an interactive data discovery session. For example, avro was originally created as an efficient data serialization protocol for storing large collections of data, as partitioned files, long-term within the Hadoop file system. Given data was typically read from one location, and written to another, within HDFS, avro could store the schema (used at write time) once per file.

Avro Binary Format

When you write a collection of avro records to disk the process encodes the schema of the avro data directly into the file itself (once). There is a similar process when it comes to Parquet file encoding, where the schema is compressed and written as a binary file footer. We saw this process firsthand, at the end of chapter 4, when we went through the process of adding StructField level documentation to our StructType. This schema was used to encode our DataFrame, and when we wrote to disk it preserved our inline documentation on the next read.

Enabling Backwards Compatibility and Preventing Data Corruption

In the case of reading multiple files, as a single collection, problems can arise in the case of schema changes between records. Avro encodes binary records as byte arrays and applies a schema to the data at the time of deserialization (conversation back from a byte array into an object).

This means you taking extra precaution to preserve backwards compatibility, otherwise you’ll find yourself running into issues with ArrayIndexOutOfBounds exceptions.

Broken schema promises can happen in other subtle ways too. For example, say you need to change an integer value to a long value for a specific field in your schema. Don’t. This will break backwards compatibility due to the increase in byte size from an int to a long. This is due to the use of the schema definition for defining the starting and ending position in the byte array for each field of a record. To maintain backwards compatibility, you’ll need to deprecate the use of the integer field moving forwards (while preserving it in your avro definition) and add (append) a new field to the schema to use moving forwards.

Best Practices for Streaming Avro Data

Moving from static avro files, with their useful embedded schemas, to an unbounded stream of well binary data, the main differentiator is that you need to bring your own schema to the party. This means that you’ll need to support backwards compatibility (in the case that you need to rewind and reprocess data before and after a schema change), as well as forward compatibility, in the case that you have existing readers already consuming from a stream.

The challenge here is support both forms of compatibility given that avro doesn’t have the ability to ignore unknown fields, which is a requirement for supporting forward compatibility. In order to support these challenges with avro, the folks at Confluence open-sourced their schema registry (for use with Kafka) which enables schema versioning at the Kafka topic (data stream) level.

When supporting avro without a schema registry, you’ll have to ensure you’ve updated any active readers (spark applications or otherwise) to use the new version of the schema prior to updating the schema library version on your writers. The moment you flip the switch otherwise, you could find yourself at the start of an incident.

Protobuf Message Format

With protobuf, you define your structured data definitions using the concept of messages. Messages are written in a format that feels more like defining a struct in C. These message files are written into files with the proto filename extension. Protocol Buffers have the advantage of using imports. This means you can define common message types and enumerations, that can be used within a large project, or even imported into external projects enabling wide scale reuse. A simple example of creating the Coffee record (message type) using protobuf.

syntax = "proto3";
option java_package="com.coffeeco.protocol";
option java_outer_classname="Common";

message Coffee {
  string id       = 1;
  string name     = 2;
  uint32 boldness = 3;
  bool available  = 4;
}

With protobuf you define your messages once, and then compile down for your programming language of choice. For example, we can generate code for Scala using the coffee.proto file using the standalone compiler from the ScalaPB project (created and maintained by Nadav Samet), or utilize the brilliance of Buf, which created an invaluable set of tools and utilities around protobuf and grpc.

Code Generation

Compiling protobuf enables simple code generation. The following example is taken from the /ch-09/data/protobuf directory. The directions in the chapter READMEj covers how to install ScalaPB and includes the steps to set the correct environment variables to execute the command.

mkdir /Users/`whoami`/Desktop/coffee_protos
$SCALAPBC/bin/scalapbc -v3.11.1 \
  --scala_out=/Users/`whoami`/Desktop/coffee_protos \
  --proto_path=$SPARK_MDE_HOME/ch-09/data/protobuf/ \
  coffee.proto

This process saves time in the long run by freeing you up from having to write additional code to serialize and deserialize your data objects (across language boundaries or within different code bases).

Protobuf Binary Format

The serialized (binary wire format) is encoded using the concept of binary field level separators. These separators are used as markers that identify the data types encapsulated within a serialized protobuf message. In the example, coffee.proto, you probably noticed that there was an indexed marker next to each field type (string id = 1;), this is used to assist with encoding / decoding of messages on / off the wire. This means there is a little additional overhead compared to the avro binary, but if you read over the encoding specification, you’ll see that other efficiencies more than make up for any additional bytes (such as bit packing, efficient handling of numeric data types, and special encoding of the first 15 indices for each message). With respect to using protobuf as your binary protocol of choice for streaming data the pros far outweigh the cons in the grand scheme of things. One of the ways in which it more than makes up for itself is with support for both backwards and forwards compatibility.

Enabling Backwards Compatibility and Preventing Data Corruption

There are similar rules to keep in mind when it comes to modifying your protobuf schemas like we discussed with avro. As a rule of thumb, you can change the name of a field, but you never change the type or change the position (index) unless you want to break backwards compatibility. These rules can be overlooked when it comes to supporting any kind of data in the long term and can be especially difficult as teams become more proficient with their use of protobuf. There is this need to rearrange, and optimize, that can come back to bite you if you are not careful. (See the Tip below called Maintaining Data Quality Over Time for more context).

Best Practices for Streaming Protobuf Data

Given protobuf supports both *backwards and *forwards compatibility, this means that you can deploy new writers without having to worry about updating your readers first, and the same is true of your readers, you can update them with newer versions of your protobuf definitions without worrying about a complex deploy of all your writers. Protobuf supports forward compatibility using the notion of unknown fields. This is an additional concept that doesn’t exist within the avro specification, and it is used to track the indices and associated bytes it was unable to parse due to the divergence between the local version of the protobuf and the version it is currently reading. The beneficial thing here is that you can also opt-in, at any point, to newer changes in the protobuf definitions.

For example, say you have two streaming applications (a) and (b). Application (a) is processing streaming data from an upstream Kafka topic (x), enhancing each record with additional information, and then writing it out to a new Kafka topic (y). Now, application (b) reads from (y) and does its thing. Say there is a newer version of the protobuf definition, and application (a) has yet to be updated to the newest version, while the upstream Kafka topic (x) and application (b) are already updated and expecting to use some new fields available from the upgrade. The amazing thing is that it is still possible to pass the unknown fields through application (a) and onto application (b) without even knowing they exist.

See “Tips for maintaining good data quality over time” for an additional deep dive.

Tip: Maintaining Data Quality over Time

When working with either avro or protobuf, you should treat the schemas no different than you would code you want to push to production. This means creating a project that can be committed to your companies github (or whatever version control system you are using), and it also means you should write unit tests for your schemas. Not only does this provides living examples of how to use each message type, but the important reason for testing your data formats is to ensure that changes to the schema don’t break backwards compatibility. The icing on the cake is that in order to unit test the schemas you’ll need to first compile the (.avsc or .proto) files and use the respective library code generation. This makes it easier to create releasable library code, and you can also use release versioning (version 1.0.0) to catalog each change to the schemas.

One simple method to enable this process is by serializing and storing a binary copy of each message, across all schema changes, as part of the project lifecycle. I have found success adding this step directly into the unit tests themselves, using the test suite to create, read and write these records directl into the project test resources directory. This way each binary version, across all schema changes, is available within the code base itself.

With a little extra upfront effort, you can save yourself a lot of pain in the grand scheme of things, and rest easy at night knowing your data is safe (at least on the producing and consuming side of the table)

Using Buf Tooling and Protobuf in Spark

Since writing this chapter back in 2021, Buf Build (https://buf.build/) has materialized into the all-things-protobuf company. Their tooling is simple to use, free-and-open-source, and appeared at just the right time to power a few initatives in the Spark community. The Apache Spark project introduced full native support for Protocol Buffers in Spark 3.4 in order to support the spark-connect, and are using Buf for compiling GRPC services and messages. Spark Connect is after all a GRPC native connector for embedding Spark applications outside of the JVM.

Traditional Apache Spark application must run as a driver application somewhere, and in the past this meant using pyspark or native spark, which in both cases still run on top of a JVM process.

Directory structure via Spark Connect. Shows the protobuf definitions, along with buf.gen.yaml and buf.work.yaml which help with code generation.

At the end of the day, Buf Build enables peace of mind in the build process. In order to generate the code, one must run a simple command: buf generate . For simple linting and consistent formatting, buf lint && buf format -w . The icing on the cake however is the breaking change detection. buf breaking --against .git#branch=origin/main is all it takes to ensure that new changes to your message definitions won’t negatively affect anything that is currently running in production. *In the future, I will do a write up on using buf for enterprise analytics, but for now, it is time to conclude this chapter.

So where were we. You now know that there are benefits to using avro or protobuf when it comes to your long-term data accountability strategy. By using these language agnostic, row-based, structured data formats you reduce the problem of long-term language lock-in, leaving the doors open to whatever the programing language is later down the line. Cause honestly it can be a thankless task to be supporting legacy libraries and code bases. Additionally, the serialized formats help to reduce the network bandwidth costs and congestion associated with sending and receiving large amounts of data. This helps as well to reduce the storage overhead costs for retaining your data long-term.

Lastly, let’s look at how these structured data protocols enable additional efficiencies when it comes to sending and receiving data across the network using remote procedure calls.

Remote Procedure Calls

RPC frameworks, in a nutshell, enable client applications to transparently call remote (server-side) methods (procedures) via local function calls by passing serialized messages back and forth. The client and server-side implementations use the same public interface definition to define the functional RPC methods and services available. The Interface Definition Language (IDL) defines the protocol and message definitions and acts as a contract between the client and server-side. Let’s see this in action looking at the popular open-source RPC framework gRPC.

gRPC

First conceptualized and created at Google, gRPC which stands for “generic” remote procedure call, is a robust open-source framework being used for high performance services ranging from distributed database coordination, as seen with CockroachDB, to real-time analytics, as seen with Microsofts Azure Video Analytics.

Figure 1–2. RPC (in this example gRPC) works by passing serializing messages to and from a client and server. The client implements the same Interface Definition Language (IDL) interface and this acts as an API contract between the client and server. (photo credit: https://grpc.io/docs/what-is-grpc/introduction/)

The diagram shown in Figure 9–3 shows an example of gRPC at work. The server-side code is written in C++ for speed, while clients written in both ruby and java can interoperate with the service using protobuf messages as their means of communicating.

Using protocol buffers for message definitions, serialization, as well as the declaration and definition of services, gRPC can simplify how you capture data and build services. For example, let’s say we wanted to continue the exercise of creating a tracking API for customer coffee orders. The API contract could be defined in a simple services file, and from there the server-side implementation and any number of client-side implementations could be built using the same service definition and message types.

Defining a gRPC Service

You can define a service interface, the request and response objects, as well as the message types that need to be passed between the client and server as easily as 1–2–3.

syntax = "proto3";

service CustomerService {
    rpc TrackOrder (Order) returns (Response) {}
    rpc TrackOrderStatus (OrderStatusTracker) returns (Response) {}
}

message Order {
    uint64 timestamp    = 1;
    string orderId      = 2;
    
    string userId       = 3;
    Status status       = 4;
}

enum Status {
  unknown_status = 0;
  initalized     = 1;
  started        = 2;
  progress       = 3;
  completed      = 4;
  failed         = 5;
  canceled       = 6;
}

message OrderStatusTracker {
  uint64 timestamp = 1;
  Status status    = 2;
  string orderId   = 3;
}

message Response {
    uint32 statusCode = 1;
    string message    = 2;
}

With the addition of gRPC, it can be much easier to implement, and maintain both the server-side and client-side code used within your data infrastructure. Given that protobuf supports backwards and forwards compatibility, this means that older gRPC clients can still send valid messages to newer gRPC services without running into common problems and pain points (discussed earlier under “Data Problems in Flight”).

gRPC speaks HTTP/2

As a bonus, with respect to modern service stacks, gRPC is able to use HTTP/2 for its transport layer. This also means you can take advantage of modern data meshes (like Envoy) for proxy support, routing and service level authentication, all while also reduce the problems of TCP packet congestion seen with standard HTTP over TCP.

Mitigating data problems in flight and achieving success when it comes to data accountability starts with the data and fans outwards from that central point. Putting processes in place when it comes to how data can enter into your data network should be considered a prerequisite to check off before diving into the torrent of streaming data.

Summary

The goal of this post is to present the moving parts, concepts, and background information required to arm ourselves before blindly leaping from a more traditional (stationary) batch-based mindset to one that understandings the risks and rewards of working with real-time streaming data.

Harnessing data in real-time can lead to fast, actionable insights, and open the doors to state-of-the-art machine learning and artificial intelligence.

However, distributed data management can also become a data crisis if the right steps aren’t taken into consideration ahead of time. Remember that without a strong, solid data foundation, built on top of valid (trustworthy) data, that the road to real-time will not be a simple endeavor, but one has its fair share of bumps and detours along the way.

I hope you enjoyed the second half of Chapter 9. To read the first part of this series, head on over to A Gentle Introduction to Analytical Stream Processing.

A Gentle Introduction to Analytical Stream Processing

— — — — — — — — — — — — — — — — — — — — — — — —

If you want to find dig in even deeper, please check out my book, or support me with a high five.

Modern Data Engineering with Apache Spark: A Hands-On Guide for Building Mission-Critical Streaming Applications

If you have access to O’Reilly Media then you can also read the book entirely for free (good for you, not so good for me), but please find the book for free somewhere if you have the opportunity, or get an ebook to save on shipping cost (or needing to find a place for a 600+ page book).

Modern Data Engineering with Apache Spark: A Hands-On Guide for Building Mission-Critical Streaming Applications

A Modest Introduction to Analytical Stream Processing was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

A Gentle Introduction to Analytical Stream Processing

Scott Haines — Fri, 31 Mar 2023 14:57:35 GMT

Building a Mental Model for Engineers and Anyone in Between

Stream Processing can be handled gently and with care, or wildly, and almost out of control! You be the judge of what future you’d rather embrace. credit: @psalms original_photo

Introduction

In many cases, processing data in-stream, or as it becomes available, can help reduce an enormous data problem (due to the volume and scale of the flow of data) into a more manageable one. By processing a smaller set of data, more often, you effectively divide and conquer a data problem that may otherwise be cost and time prohibitive.

How you transition from a batch mindset to a streaming mindset although can also be tricky, so let’s start small and build.

From Enormous Data back to Big Data

Say you are tasked with building an analytics application that must process around 1 billion events (1,000,000,000) a day. While this might feel far-fetched at first, due to the sheer size of the data, it often helps to step back and think about the intention of the application (what does it do?) and what you are processing (what does the data look like)? Asking yourself if the event data can be broken down (divided and partitioned) and processed in parallel as a streaming operation (aka in-stream), or must you process things in series, across multiple steps? In either case, if you modify the perspective of the application to look at bounded windows of time, then you now only need to create an application that can ingest, and processing, a mere 11.5 thousand (k) events a second (or around 695k events a minute if the event stream is constant), which is an easier number to rationalize.

While these numbers may still seem out of reach, this is where distributed stream processing can really shine. Essentially, you are reducing the perspective, or scope, of the problem, to accomplish a goal over time, across a partitioned data set. While not all problems can be handled in-stream, a surprising number of problems do lend themselves to this processing pattern.

Note: This chapter is part of my book “Modern Data Engineering with Apache Spark: A Hands-On Guide for Building Mission-Critical Streaming Applications”. The book takes you on the journey building from simple scripting, to composing applications, and finally deploying and monitoring your mission critical Apache Spark applications.

What you will learn this Chapter

This chapter will act as a gentle introduction to stream processing making room for us to jump directly into building our own end to end Structured Streaming application in chapter 10 without the need to backtrack and discuss a lot of the theory behind the decision-making process.

By the end of the chapter, you should understand the following (at a high level):

How to Reduce Streaming Data Problems in Data Problems over Time
The Trouble with Time, Timestamps, and Event Perspective
The Different Processing Modes for Shifting from a Batch to Streaming Mental Model

Stream Processing

Streaming data is not stationary. In fact, you can think of it as being alive (if even for a short while). This is because streaming data is data that encapsulates the now, it records events and actions as they occur in flight. Let’s look at a practical, albeit theoretical, example that begins with a simple event stream of sensor data. Fix into your mind’s eye the last parking lot (or parking garage) you visited.

Use Case: Real-Time Parking Availability

Parking is a Nightmare: The problem with most parking infrastructure, or a common pain point for the customer, is more often than not finding an available spot while still being able to get places on time. Photo via Unspash and @ryansearle

Imagine you just found a parking spot all thanks to some helpful signs that pointed you to an open space. Now let’s say that this was all because of the data being emitted from a connected network of local parking sensors. Sensors which operate with the sole purpose of being used to identify the number of available parking spaces available at that precise moment in time.

This is a real-time data problem where the real-time accuracy is both measurable, as well as physically noticeable, by a user of the parking structure. Enabling these capabilities all began with the declaration of the system scenario.

Product Pitch: “We’d like to create a system that keeps track of the status of all available parking spaces, that identifies when a car parks, how long the car remains in a given spot, and lastly this process should be automated as much as possible”

Optimizing a system like this can begin with a simple sensor located in each parking spot (associated with a sensor.id / spot.id reference). Each sensor would be responsible for emitting data in the form of an event with a spot identifier, timestamp, and simple bit (0 or 1), to denote if a spot is empty or occupied. This data can then be encoded into a compact message format, like the example from Listing 9–1, and be efficiently sent periodically from each parking spot.

Listing 9–1. An example sensor event (encapsulated in the Google Protocol Buffer message format) is shown for clarity.

message ParkingSensorStatus {
  uint32 sensor_id = 1;
  uint32 space_id = 2;
  uint64 timestamp = 3;
  bool available = 4;
}

During the normal flow of traffic throughout the day, the state (availability of a parking spot) via the sensors would flip on or off (binary states) as cars arrive or leave each spot. This behavior is unpredictable due to the dynamic schedules of each individual drivers, but patterns always emerge at scale.

Using the real-time state provided by the collected sensor data , it is easily feasible to build real-time, real-life (IRL) “reporting” to update drivers on the active state of the parking structure: is the parking infrastructure full, or not, and if it isn’t full, that there are now X total number of available spots in the garage.

What the Sensor Data Achieves

This data can help to automate the human decision-making process for drivers and could even be made available online, through a simple web service, for real-time status tracking, since ultimately drivers just want to park already and not waste time! Additionally, this data can also be used to track when each sensor last checked in (refreshed) which can be used to diagnosis faulty sensors, and even track how often sensors go offline or fail.

Nowadays, more technologically advanced garages even go so far as to direct the driver (via directional signs and cues) to the available spots within the structure. This acts to both reduce inter-garage traffic and congestion, which in turn raises customer satisfaction, all by simply capturing a live stream of sensor data and processing it in near-real-time.

Surge Pricing and Data Driven Decision Making

Given the temporal (timestamp) information gathered from these streams of sensor events, a savvy garage operation could use prior trends to even decrease or increase the daily or hourly prices, based on the demand for parking spots, with respect to current availability (number of spots left) in real-time. By optimizing the pricing (within realistic limits) an operator could find the perfect threshold where the price per hour / price per day leads to a full garage more times than it doesn’t. In other words, “at what price will most people park and spots don’t go unused?”.

This is an example of an optimization problem that stems from the collection of real-time sensor data. It is becoming more common for organizations to look at how they reuse data to solve multiple problems at the same time. The Internet of Things (IOT) use cases are just one of the numerous possible streams of data you could be working with when writing streaming applications.

Earlier in the book we discussed “creating a system that could take information about Coffee store occupancy, which would inform folk what shop nearest to them has seating for a party of their size” at that point in the story we simply created a synthetic table that could be joined to showcase this example, but this is another problem that can be solved with sensors, or something as simple as a check-in system, that emits relevant event data to be passed reliably downstream via our friend the streaming data pipeline.

Both examples discussed here (parking infrastructure and coffee empire expansion) employ basic analytics (statistics) and can benefit from simple machine learning to uncovering new patterns of behavior that lead to more optimal operations. Before we get too far ahead of ourselves, let’s take a short break to dive deeper into the capabilities streaming data networks provide.

Time Series Data and Event Streams

Moving from a stationary data mindset, about a fixed view or moment in time, to one that interprets data as it flows over time, in terms of streams of unbounded data across many views and moments in time, is an exercise in perspective but also one that can be challenging to adopt at first. Often when you think about streaming systems, the notion of streams of continuous events bubble to the surface. This is one of the more common use cases and can be used as more of a gentle introduction to the concept of streaming data. Take for example the abstract time series shown in Figure 9–1.

Figure 9–1: Events occur at precise moments of time and can be collected and processed individually (t1->t4), or can be aggregated across windows of time (w1). Image Credit: Author (Scott Haines)

As you can see, data itself exists across various states depending on the perspective or vantage point applied by a given system (or application). Each event (T1->T4) individually understand only what has occurred within their narrow pane of reference, or to put that differently, events capture a limited (relative) perspective of time. When a series of events are processed together in a bounded collection (window), then you have a series of data points (events) that encapsulate either fully realized ideas, or partially realized ideas. When you zoom out and look at the entire timeline then you can paint a more accurate story of what happened from first event to last.

Let’s take this idea one step further.

Do Events Stand Alone?

Consider this simple truth. Your event data exists as a complete idea, or as partial ideas or thoughts. I have found that thinking of data as a story over time helps to give life to these bytes of data. Each data point is therefore responsible for helping to compose a complete story, as a series of interwoven ideas and thoughts that assemble or materialize over time.

Data composition is a useful lens through which to look as you work on adopting a distributed data view of things. I also find it lends itself well while building up and defining new distributed data models, as well as, while working on real world data networks (fabrics) at scale. Viewed as a composition, these events come together to tell a specific story, whose event-based breadcrumbs can inform of the order in which something came to be and is greatly enhanced with the timestamp of each occurrence. Events without time paint a flat view of how something occurred while the addition of time grants you the notion of momentum or speed, or a slowing down and stretching of the time between events or for a full series of data points. Understanding the behavior of the data flowing through the many pipelines and data channels is essential to data operations and requires reliable monitoring to keep data flowing at optimal speeds.

Let’s look at a use case where the dimension of time helps paint a better story of a real-world scenario.

Use Case: Tracking Customer Satisfaction

A quiet coffee shop pouring love with every cup. Photo by Nafinia Putra on Unsplash

Put yourself in the shoes of a data engineer working with the data applications feature teams in a fake coffee empire named “CoffeeCo”, the conversation is about what data paints a good story of customer satisfaction over time (time series analysis).

What if I told you two customers came into our coffee shop, ordered drinks and left the store with their drinks. You might ask me why I bothered to tell you that since that is what happens in coffee shops. What if I told you that the two coffee orders were made around the same and that the first customer in the story was in and out of the coffee shop in under five minutes. What if I told you, it was a weekday, and this story took place during morning rush hour? What if I told you that the second customer, who happened to be next in line (or right after the first customer) and was in the coffee shop for thirty minutes? You might ask if the customer stayed to read the paper or maybe use the facilities. Both are valid questions.

If I told you that the second customer was waiting around because of an error that occurred between step 3 and 4 of a four-step coffee pipeline, then we’d have a better understanding of how to streamline the customer experience in the future.

The four steps are:

1. Customer Orders: {customer.order:initialized}
2. Payment Made {customer.order:payment:processed}
3. Order Queued: {customer.order:queued}
4. Order Fulfilled: {customer.order:fulfilled}

Whether the error was in the automation, or because of a breakdown in the real-world system (printer jam, barista missed an order, or any other reason), the result here is that the customer needed to step in (human in the loop) and inform the operation (coffee pipeline) that “it appears that someone forgot to make my drink”.

At this point the discussion could turn towards how to handle the customers emotional response, which could swing widely across both positive and negative reactions: from happy to help (1), to mild frustration (4), all the way to outright anger (10) at the delay and breakdown of the coffee pipeline. But by walking through a hypothetical use case, we are all now more familiar with how the art of capturing good data can be leveraged for all kinds of things.

The Event Time, Order of Events Captured, and the Delay Between Events All Tell a Story

Without the knowledge of how much time elapsed from the first event (customer.order:initialized) until the terminal event (customer.order:fulfilled), or how long each step typically takes to accomplish, we’d have no way to score the experience or really understand what happened, essentially creating a blind spot to abnormal delays or faults in the system. It pays to know the statistics (average, median, and 99th percentiles) of the time a customer typically waits for a variable sized order, as these historic data points can be used via automation to step in to fix a problem preemptively when, for example, an order is taking longer than expected. It can literally mean the difference between an annoyed customer, and a lifetime customer.

This is one of the big reasons why companies solicit feedback from their customers — be it a thumbs up / thumbs down on an experience, rewarding application-based participation (spend your points on free goods and services), and to track real-time feedback like in the case of “your order is taking longer than expected, here is $2 off your next coffee. Just use the app to redeem”. This data, collected and captured through real-world interactions, encoded as events, and processed for your benefit, are worth it in the end if it positively affects the operations and reputation of the company. Just be sure to follow data privacy rules and regulations and ultimately don’t creep out your customers.

This little thought experiment was intended to shed light on the fact that the details captured within your event data (as well as the lineage of the data story over time) can be a game changer and furthermore that time is the dimension that gives these journeys momentum or speed. There is just one problem with time.

The Trouble with Time

While events occur at precise moments in time the trouble with time is that it is also subject to the problems of time and space (location). Einstein used his theory of relativity to explain this problem on a cosmic scale, but this is also a problem on a more localized scale as well. For example, I have family living in different parts of the United States. It can be difficult to coordinate time where everyone’s schedule syncs up. This happens for simple events like catching up with everyone over video (remotely) or meeting up in the real-world for reunions (locally). Even when everything is all coordinated, people have a habit of just running a little bit late.

Zooming out from the perspective of my family, or people in general, with respect to central coordination of events, you will start to see that the problem isn’t just an issue relating to synchronization across time zones (east / central or west coast), but if you look closer you can see that time, relative to our local / physical space, is subject to some amount of temporal drift or clock skew.

Take the modern digital clock. It runs as a process on your smart phone, watch or any number of many “smart” connected devices. What remains constant is that time stays noticeably in sync (even if the drift is on the order of milliseconds). Many people still have analog, non-digital, clocks. These devices run the full spectrum from incredibly accurate, in the case of high-end watches (“timepieces”) to cheap clocks that sometimes need to be reset every few days.

The bottom line here is that it is rare that two systems agree on the precise time in the same way that two or more people share similar trouble coordinating within both time and space. Therefore, a central reference (point of view) must be used to synchronize the time with respect to systems running across many time zones.

Correcting Time

Servers running in any modern cloud infrastructures utilize a process called Network Time Protocol (NTP) to correct the problem of time drift. The ntp process is charged with synchronizing the local server clock using a reliable central time server. This process corrects the local time to within a few milliseconds of the Universal Coordinated Time (UTC). This is an important concept to keep in mind since an application running within a large network, producing event data, will be responsible for creating timestamps, and these timestamps need to be precise in order for distributed events to line up. There is also the sneaky problem of daylight savings (gain or lose an hour ever 6 months) so coordinating data from systems across time zones as well as across local datetime semantics (globally) requires time to be viewed from this central, synchronized, perspective.

We’ve looked at time as it theoretically relates to event-based data but to round out the background we should also look at time as it relates to the priority in which data needs to be captured and processed within a system (streaming or otherwise).

Priority Ordered Event Processing Patterns

You may be familiar with this quote. Time is of the essence. This is a way of saying something is important and a top priority. The speed to resolution matters. This sense of priority can be used as an instrument, or defining metric, to make the case for real-time, near-real-time, batch or eventual (on-demand) processing when process critical data. These four processing patterns handles time in a different way by creating a narrow, or wide focus on the data problem at hand. The scope here is based on the speed in which a process must complete which in turn limits the complexity of the job as a factor of time. Think of these styles of processing as being deadline driven, there is only a certain amount of time in which to complete an action.

Real-Time Processing

The expectations of real-time systems are that end-to-end latency from the time an upstream system emits an event, until the time that event is processed and available to be used for analytics and insights, occurs in the milliseconds to low seconds. These events are emitted (written) directly to an event stream processing service, like Apache Kafka, which under normal circumstances enables listeners (consumers) to immediately use that event once it is written. There are many typical use cases for true real-time systems, including logistics (like the parking space example as well as finding a table at a coffee shop), and then processes that impact a business on a whole new level like fraud detection, active network intrusion detection or other bad actor detection where the longer the mean time to detection (average milliseconds / seconds to detection) can lead to devastating consequences both in terms of reputation, financially or both.

For other systems, it is more than acceptable to run in near real-time. Given that answering tough problems requires time, real-time decision making requires a performant, pre-computed or low-latency answer to the questions it will ask. This really is pure in-memory stream processing.

Near Real-Time Processing

Near real-time is what most people think of when they consider real-time. A similar pattern occurs here as you just read about under Real-Time, the only difference is that the expectations of end-to-end latency are relaxed to a high number of seconds to a handful of minutes. For most systems, there is no real reason to react immediately to every event as it arrives, so while time is still of the essence, the priority of the SLA for data availability is extended.

Operational dashboards and metric systems that are kept up to date (refreshing graphs and checking monitors every 30s — 5 minutes) are usually fast enough to catch problems and give a close representation of the world. For all other data systems, you have the notion of batch or on-demand.

Batch Processing

We covered batch processing and reoccurring scheduling in the last two chapters but for clarity having periodic jobs that push data from a reliable source of truth (data lake or database) into other connected systems has been, and continues, to be how much of the worlds data is processed.

The simple reason for this is cost. Which factors down to both the cost of operations and the human cost for maintaining large streaming systems.

Streaming systems demand full time access to a variable number of resources from CPUs and GPUs to Network IO and RAM, with an expectation that these resources won’t be scarce since delays (blockage) in stream processing can pile up quick. Batch on the other hand can be easier to maintain in the long run assuming the consumers of the data understand that there will always be a gap from the time data is first emitted upstream, until the data becomes available for use downstream.

The last consideration to keep in mind is on-demand processing (or just-in-time processing).

On-Demand or Just-In-Time Processing

Let’s face it. Some questions (aka queries) are asked so rarely, or in a way that is just not suitable to any predefined pattern.

For example, custom reporting jobs and exploratory data analysis are two styles of data access that lend themselves nicely to these paradigms. Most of the time, the backing data to answer these queries is loaded directly from the data lake, and then processed using shared compute resources, or isolated compute clusters. The data that is made available for these queries may be the by-product of other real-time or near-real-time systems, that were processed and stored for batch or historic analysis.

Using this pattern data, can be defrosted, and loaded on-demand by importing records from slower commodity object storage like Amazon S3 into memory, or across fast-access solid state drives (SSDs), or depending on the size, format, and layout of the data, can be queried directly from the cloud object store. This pattern can be easily delegated to Apache Spark using SparkSQL. This enables ad-hoc analysis via tools like Apache Zeppelin, or directly in-app through JDBC bindings using the Apache Spark thrift-server and the Apache Hive Metastore.

The differentiator between these four flavors of processing is time.

Circling back to the notion of views and perspective, each approach or pattern, has its time and place. Stream processing deals with events captured at specific moments in time and as we’ve discussed during the first half of this chapter, how we associate time and how we capture and measure a series of events (as data) all come together to paint a picture of what is happening now, or what has happened in the past. As we move through this gentle introduction to stream processing it is important to also talk about the foundations of stream processing. In this next section, we’ll walk through some of the common problems and solutions for dealing with continuous, unbounded streams of data. It would only make sense to therefore discuss data as a central pillar and expand outward from there.

I hope you enjoyed the first half of Chapter 9. If you’d like to go ahead and read part 2. It is linked below. 👇

A Modest Introduction to Analytical Stream Processing

If you want to find out more, please check out my book!

Modern Data Engineering with Apache Spark: A Hands-On Guide for Building Mission-Critical Streaming Applications

A Gentle Introduction to Analytical Stream Processing was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Working with Spark SQL Time Functions

Scott Haines — Tue, 07 Mar 2023 16:28:00 GMT

Working with Time Functions in Spark SQL

A Hands-On Guide to Time with Apache Spark

Time is tricky. Time is also very important for Batch and Streaming Analytics. Photo by Jon Tyson on Unsplash

Over the next few weeks, and in preparation for the 1 year anniversary of my book “Modern Data Engineering with Apache Spark: A Hands-on Guide for Building Mission Critical Streaming Applications”, I will start to share more content from the book, and source code.

Working with Timestamps and Dates

Working with time can be a difficult thing. You learned in chapter 9 how clock drift and incorrect time zone information can cause problems (the trouble with time) and how issues can be generally resolved simply by standardizing on UTC and synchronizing clock drift on your servers using NTP. While standardizing can help future projects, you may find yourself needing to work with time in different ways, or even to correct and normalize timestamps.

How to Use the Following Code Snippets

You can install Spark Locally, use Databricks Community Edition, or use Apache Zeppelin docker run -p 8080:8080 --rm --name zeppelin apache/zeppelin:0.10.0

Common Date and Timestamp functions

Starting from your empty note (in Zeppelin or Databricks), copy the code block (listing 12–3) and run the paragraph. This will install the Spark SQL Functions, and then the SQL statement generates a row with columns representing the date and time information captured by Spark at runtime.

%spark
import org.apache.spark.sql.functions._

spark.sql("""
SELECT current_timestamp() as ts,
current_timezone() as tz,
current_date() as date,
TIMESTAMP 'yesterday' as yesterday,
TIMESTAMP 'today' as today,
TIMESTAMP 'tomorrow' as tomorrow
""").show(6,0,true)

Listing 12–3. Using Spark SQL to get a sense of time

The output of the resulting row will be a snapshot of standard system date and time information collected when the code is evaluated.

-RECORD 0 - - - - - - - - - - - - - - - -
ts | 2021–09–12 00:54:03.691464
tz | Etc/UTC
date | 2021–09–12
yesterday | 2021–09–11 00:00:00
today | 2021–09–12 00:00:00
tomorrow | 2021–09–13 00:00:00

The current_timestamp, current_timezone, current_date and the TIMESTAMP constants yesterday, today, and tomorrow are all higher order Spark SQL datetime functions. Next, we will recreate the same output use the Spark DSL functions directly.

Applying Higher-Order Functions using withColumn

Create a new paragraph. Inside the paragraph we will create a single 1x1 (row/column) DataFrame storing only a timestamp. Mimicking the current_timestamp expression, listing 12–4 shows how you can wrap a simple Java Instant to replicate the current_timestamp expression.

%spark
val tsDf = Seq(Instant.now).toDF("ts")

Listing 12–4. Create a DataFrame with a single row and a single column storing a Timestamp

This simple technique (12–4) takes advantage of implicit conversions to encode a Scala Seq[Instant] as a Catalyst Row (DataFrame) with a TimestampType column. Using the tsDf DataFrame we can now add additional columns using the withColumn method on the DataFrame.

DataFrame.withColumn(colName, col)

The withColumn method is used to add a new column iteratively across all rows of a DataFrame. The col (Column) parameter is a powerful primitive that encapsulates a columnar SQL expression and can be used to enable User Defined Functions to your applications. It is important to keep in mind that withColumn can only reference data from within adjacent columns of a row. There are other techniques that can be used to process all rows or a subset of rows to generate derived aggregates using window functions.

The following code block (listing 12–5) adds a column literal containing the time zone as well as the derived date using the column operation to_date.

%spark
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
val dtInfoDf = tsDf
.withColumn("tz",
lit(spark.conf.get("spark.sql.session.timeZone"))
)
.withColumn("date", to_date($"ts"))

Listing 12–5. Using withColumn to add a column literal, “literally this column is exactly what you see”, and a derived date column using to_date

The result of the transformation is a new DataFrame representing the combination of two new columns added to the source DataFrame. Remember however that because Spark operates lazily, no actual work will be executed until you arrive at a specific action. So, the operation from Listing 12–5 adds the promise to provide the new date column when the time comes to execute on the final series. The first column uses a special Column named lit. The lit column wraps and encodes the underlying data type as a typed Column using an implicit typed encoder (if available). The second column (date) is referred to as a derived column. When you use the to_date function and pass a source column ($”ts”), Spark will use the reference column (ts) to generate a DateType.

Printing the schema of dtInfoDf at this point, you would see the following.

root
| - ts: timestamp (nullable = true)
| - tz: string (nullable = false)
| - date: date (nullable = true)

Now, tackling the rest of the missing columns needed to recreate the DataFrame from Listing 12–3 leaves the columns yesterday, today, and tomorrow.

Using Date Addition and Subtraction

The datetime functions date_sub, date_add, as well as a simple cast expression can be combined to derive yesterday, today, and tomorrow as shown in listing 12–6.

tsDf
…
.withColumn("yesterday",
date_sub($"date", 1).cast(TimestampType))
.withColumn("today", $"date".cast(TimestampType))
.withColumn("tomorrow",
date_add($"date", 1).cast(TimestampType))

Listing 12–6. Using date subtraction, addition and data type casting to mimic the output from listing 12–3

You can now create timestamps, derive dates from timestamps, and add or subtract dates. Rounding out the tour of datetime functions, we will be looking at the year, month, dayofmonth, dayofweek and dayofyear calendar functions.

Calendar Functions

It is common to compare explicit periods (windows) of time using the calendar for analytics and insights. Trend analysis and timeseries forecasting are techniques that use statistics to measure the rate of change over time (deltas) between data points. Data sets can be partitioned (bucketed) and analyzed using Spark SQL to create aggregations broken down by seconds, minutes, hours, days, weeks, months, and even years using fixed or relative windows.

Fixed and Relative Windows

A fixed window is defined by an explicit start and end time. For example, yesterday is a window defined by the 24-hour period beginning at 00:00:00 and ending at 23:59:59. Fixed windows are typically used to compute changes across two or more data sets commonly computed hour over hour, day over day, week over week and month over month.

A relative window uses a non-fixed point in time to define one edge of a time-based boundary. This boundary can then be used to compute either the beginning or ending timestamp producing an arbitrary window to observe statistics. For instance, you can use relative time to isolate a dataset encapsulating the last thirty minutes rather than splitting an hour at a fixed 30-minute interval (06:00:00–06:29:59 | 06:30:00–06:59:59).

Using the following code (listing 12–7) as reference, we will add five additional columns to the tsDf DataFrame.

%spark
tsDf
…
.withColumn("year", year($"date"))
.withColumn("month", month($"date"))
.withColumn("day", dayofmonth($"date"))
.withColumn("day_of_week", dayofweek($"date"))
.withColumn("day_of_year", dayofyear($"date"))

Listing 12–7. Deriving the year, month, day, the day within the current week, the day within the current year using a single DateTime column.

The final DataFrame represents the current date with columns expressing different observations generated from the initial timestamp column.


-RECORD 0 — — — — — — — — — — — — — — — — 
ts | 2021–09–12 01:04:39.04086
tz | Etc/UTC
date | 2021–09–12
yesterday | 2021–09–11 00:00:00
today | 2021–09–12 00:00:00
tomorrow | 2021–09–13 00:00:00
year | 2021
month | 9
day | 12
day_of_week | 1
day_of_year | 255

Using the Time Functions in a Derived DataFrame

You now have a solid reference to go back to whenever you need a quick refresher on working with date and time, but what about the actual time zone?

Time Zones and the Spark SQL Session

Working with time-based data requires conversion between time zones for more reasons that just adherence to a common time zone like UTC. You can also use a time zone as a lens to view a data set from the perspective of observers of specific events. This use case comes up when producing Insights that are tied to specific geolocations and time zones.

Configuring the Time Zone

Spark defaults to using the local system time of its environment (your laptop or a remote server). Using the default system time can cause discrepancies when processing data. To ensure consistent behavior regardless of where the application is run, you can configure the default time zone using the config spark.sql.session.timeZone.

Modifying the Spark Time Zone at Runtime

The SparkSession handles timestamp conversions automatically for you globally. However, there may be times when you want to explicitly change the time zone used by a specific query. Create a new paragraph in Zeppelin and add the example code block (listing 12–8) which shows how to dynamically set the time zone.

%spark
import java.time._
val ts = Seq(Instant.now).toDF("ts")
spark.conf.set("spark.sql.session.timeZone", "UTC")
ts.show(truncate=false) // utc
spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
ts.show(truncate=false) // pst

Listing 12–8. Observing time zone changes in Spark’s output of time based on the SparkSession time zone configuration

Running the paragraph, you will see first-hand how Spark’s observation of time changes in step with the value of the time zone configuration. The output captures the reference to the single immutable timestamp (ts) as observed through the lens of the UTC and PST time zones respectively.

UTC |2021–09–12 03:02:32.434387
PST |2021–09–11 20:02:32.434387

The ability to shift the observation of time using the Spark runtime config enables you to use a single source of immutable truth (UTC timestamps) for your backing data while simplifying how downstream applications compose queries for specific time zones without the headache of creating multiple copies of a dataset to handle the conversion between time zones for timestamps.

Using Set Time Zone

You can also use SET TIME ZONE (shown in listing 12–9) to directly set the time zone config using Spark SQL to switch between time zones dynamically.

SET TIME ZONE 'America/Los_Angeles';
SELECT TIMESTAMP 'now' as now_pst;

SET TIME ZONE 'UTC';
SELECT TIMESTAMP 'now' as now_utc;

Listing 12–9: Set time zone for your Spark Session using SET TIME ZONE

You can now declaratively change the time zone in Spark using the SparkSession. This simple configuration enables each application to select how time should be observed while querying and displaying time-based data. Just remember to use a default time zone for your Spark applications to ensure reliable repeated results.

Seasonality, Time Zones, and Insights

Consider the following example. You are tasked with designing a system that automatically tracks and compares the relative changes in observed customer shopping behavior across goods being sold at CoffeeCo to produce insights on the purchase trends of customers. While testing that the system is working as expected, you stumble upon what at first appears to be an anomaly in sales of a particular item. Upon further investigation it turns out that the item is a Pumpkin Spice Latte, and the sales numbers went from zero average sales per day, to customers buying these pumpkin lattes at records levels, as compared to all other items in category, in only a few days. Is this an anomaly? What other data might be necessary?

Some goods are only available for a limited amount of time, or at a specific time of year. A Pumpkin spice latte is seasonal since it is considered a fall (season) beverage, but what is it about the seasons (holidays) that affect sales? What about emotions?

I grew up on the northeast coast of the United States, and for me, October always marked an observable change in the seasons, leaves would fall from the trees, and the colder temperatures meant sweaters and jackets and the peace and quiet reflection of snow in New England. With this change came anticipation of the winter holidays, which meant cider (mulling spices) and pumpkin pie, and as an adult pumpkin spice lattes bring back fond memories. Is this unique only to me? Is there perhaps a correlation to higher sales on the east coast of the United States? Are sales driven by temperature or other weather patterns like snowstorms or rain?

While you might not have any emotional connection to fall in New England and you might hate pumpkin spice lattes, it is important to think about what kind of data can be useful to generate insights from analytical observations. Bottom line, think like a data detective or partner with a great data analyst to ensure that the work that goes into data collection, aggregation, and analysis can produce novel insights that can be used to drive experiences that make happy customers.

Timestamps and Dates Summary

Thinking about the different angles, lenses, or views you can derive from a single timestamp provides you a path towards event-based insights. You can calculate information about the time of year (seasonality), whether a date falls on a weekend or weekday (specific to the geolocation of an event, or even the local time zone), you can check if the event occurred on a holiday, or even if the date is within proximity contextually to any other meaningful historic date. Associating what happened, and correlating data to real-world behaviors is a driving factor behind many successful analytics and insights initiatives. If you want any more proof, search online for “diapers beer correlations” to see how retailers derive insight from customer transactions to move commonly purchased items closer together to increase sales across seemingly irrelevant items. We will now look at some techniques for preparing data for analysis.

I hope you enjoyed the start of Chapter 12. If anyone is interested in a PySpark version of the chapter then let me know (in comments, or by clapping for the post). If anyone is interested in the writing process, you can check out:

Problems All Writers Face

Cheers. Scott Haines

Data+AI 2022: Spark Inception:

Scott Haines — Wed, 29 Jun 2022 07:35:24 GMT

Harnessing the Spark REPL to power Streaming Notebook Environments. For Fun. For Profit?

This blog post is essentially covering the gist of the presentation, and providing all useful links to the source code, and how-to docs.

Student is solving a complex problem: Data problems can also feel like this. (photo credit)

A friend came to me today at 3:45pm and looked at the slides I was going to present in a matter of some odd minutes (4:45pm), and said, “what will keep me at your session, why should I care?” I (introverted engineer) was paying so much more attention to how cool the code was, how easily it worked, and how neat it would be to share this new thing with the world that I missed the most critical component of a good presentation.

He couldn’t have been more right on the nose though. Thank you Itai Yaffee

The Presentation

https://medium.com/media/3395b116fbe88631a3d20de4106adea2/href

The presentation was split into three pieces.

The History: Where did Notebook Environments Come From?

I started with some history on why the notebook environment exists. Historically, there is a concept called the “scientific process”. In a nutshell, you have an idea (hypothesis) and in order to prove (convert the idea into law), you must run a series of experiments (using the same input and output) and gain similar if not perfect results at the conclusion of each run.

The Mental Model

Next, we looked the notebook paradigm and broke it down into component parts to help building the shared nomenclature (words) for talking about notebook environments — namely that notebooks consist of one or more paragraphs (cells) that can be “run” or “played” to trigger the code to be compiled and run interactively.

This is an example of a DataBricks Notebook

The above example of the databricks notebook shows a mix between making simple SQL calls, and outputting the data in a well formatted table (rows/cols). Now that you understand a little more about the notebook environment, let’s go ahead and build our own.

Building the Streaming Notebook Application

Given we want to build a similar environment to that of the Databricks Notebook Environment, and yes we are using Apache Spark to get there, but with those little worries behind us, let’s look at the high level application architecture to give us a better handle on my vision, so you can follow along and adopt my mental model.

The Streaming Notebook Environment: (Left: Redis (Stream)) connects to our (Middle: Spark Structured Streaming) application over a DataStreamReader of type ‘redis’. Magic happens from here on out.

The DataStreamReader (redis) acts as our streaming API endpoint, and is tasked with delivering the goods (commands) to the application. Once the inbound stream has been processed (Dataset[NetworkCommand]), the application takes advantage of ds.writeStream.foreachBatch to generate arbitrarily stateful operations, in this case, the reading, evaluation, printing, and looping (REPL) for our dynamically generated Spark / Scala. All while running inside of a structured streaming application. (Think of this like Level 2 in the movie Inception).

Once the new code has been interpreted, we are essentially constructing a new Spark application on-the-fly. This process is similar to how a more traditional Notebook environment works, however, there isn’t a streaming message bus (maybe this was the first of its kind!).

Each message (you’ll see how we make those next), have contextual headers (%spark) and (%sql) that help hint at how to interpret the code represented within the Notebook Paragraph:

Example
%spark
val df = spark.read.delta(“s3a://…”).createOrReplaceTempView(‘theview’)

The last step in the process is just writing the output (after compiling new code or evaluating a sequence of actions) out to an external Redis table (hashset).

Now let’s dive into the code:

Open up the entire project at https://github.com/newfront/spark-inception. Then let’s start at the edge with the SparkInceptionControllerApp.

object SparkInceptionControllerApp extends SparkStructuredStreamingApplication[NetworkCommand, NetworkCommand] {
  val logger: Logger = Logger.getLogger("com.coffeeco.data.SparkInceptionControllerApp")

  /**
   * Modify the base inputStream to
   */
  override lazy val inputStream: DataStreamReader = {
    super.streamReader()
      .schema(Encoders.product[NetworkCommand].schema)
  }

  /**
   * Use to control the general options on the output Stream
   * @param writer The DataStreamWriter reference
   * @param sparkSession The implicit SparkSession
   * @return The decorated DataStreamWriter
   */
  override def outputStream(writer: DataStreamWriter[NetworkCommand])
                           (implicit sparkSession: SparkSession)
  : DataStreamWriter[NetworkCommand] = super.outputStream(writer)
    .trigger(Trigger.ProcessingTime("5 seconds"))

  @transient implicit lazy val sparkRemoteSession: SparkRemoteSession[_] = SparkRemoteSession()
  /**
   * Your application will take a DataStreamReader, do something with the inbound micro-batch data
   * and then ultimately the data will flow back out of the application, through a DataStreamWriter
   *
   * @return The StreamingQuery (which is the full source->transform->sink.start)
   */
  override def runApp(): StreamingQuery = {
    logger.info(s"run.app.called")
    import sparkSession.implicits._

    // the inception pipeline
    outputStream(
      NetworkCommandProcessor(sparkSession).process(
        inputStream.load().as[NetworkCommand]
      ).writeStream
    ).foreachBatch((ds: Dataset[NetworkCommand], batchId: Long) => processBatch(ds, batchId)
    ).start()
  }

  /**
   * For each micro-batch, collect the RPC command stream to the driver, process, and pass the results onto Redis
   * @param ds The RPC commands (via Redis)
   * @param batchId The batchId (can be used to skip reprocessing events if checkpoints are enabled)
   */
  def processBatch(ds: Dataset[NetworkCommand], batchId: Long): Unit = {
    import sparkSession.implicits._

    // Collect all of the Distributed Commands and bring down to the Driver
    val localResults = ds.collect().map { networkCommand =>
      // this is running on the driver now (not the executors)
      val res = sparkRemoteSession.processCommand(networkCommand)
      // wrap the execution details so we can write the results to redis
      NotebookExecutionDetails(
        networkCommand.notebookId,
        networkCommand.paragraphId,
        networkCommand.command,
        networkCommand.requestId,
        networkCommand.userId,
        res.commandStatus,
        res.consoleOutput
      )
    }.toSeq

    // generate a new dataframe and then write back to redis
    val forRedis = sparkSession.createDataset[NotebookExecutionDetails](localResults)
    forRedis
      .write
      .format("org.apache.spark.sql.redis")
      .options(sparkSession.sparkContext
        .getConf
        .getAllWithPrefix(appConfig.SinkStreamOptions).toMap[String, String])
      .mode(SaveMode.Append)
      .save()
  }
  run()
}

The SparkInceptionControllerApp is responsible for reliably reading, transforming, and processing an unbounded stream of NetworkConnect objects. Let’s look at the processBatch function in more details next.

def processBatch(ds: Dataset[NetworkCommand], batchId: Long): Unit = {
  import sparkSession.implicits._

  // Collect all of the Distributed Commands and bring down to the Driver
  val localResults = ds.collect().map { networkCommand =>
    // this is running on the driver now (not the executors)
    val res = sparkRemoteSession.processCommand(networkCommand)
    ... (we will look at this more later)
}

From the processBatch method, we collect the contents of our dataset back to the Spark driver application (so we can generate new dataframes and create new DAGs from our interpreted code). Let’s look at the SparkRemoteSession class for more details.

package com.coffeeco.data.controller

import com.coffeeco.data.SparkInceptionControllerApp
import com.coffeeco.data.controller.SparkRemoteSession.InitializationCommands
import com.coffeeco.data.rpc.Command.{SparkCommand, SparkSQLCommand}
import com.coffeeco.data.rpc.{Command, NetworkCommand, NetworkCommandResult, Status}
import com.coffeeco.data.traits.SparkApplication
import org.apache.commons.io.output.ByteArrayOutputStream
import org.apache.log4j.Logger
import org.apache.spark.repl.SparkILoop
import org.apache.spark.sql.DataFrame

import java.io.{File, PrintStream}
import java.net.URL
import java.nio.file.{Files, Paths}
import java.util.concurrent.atomic.AtomicBoolean
import scala.reflect.ClassTag
import scala.reflect.internal.util.ScalaClassLoader.URLClassLoader
import scala.tools.nsc.Settings
import scala.util.Properties.{javaVersion, javaVmName, versionString}
import scala.util.{Failure, Success, Try}

object SparkRemoteSession {
  val logger: Logger = Logger.getLogger("com.coffeeco.data.controller.SparkRemoteSession")
  /*
    The initialization command bootstraps the internal SparkILoop.
    See https://github.com/apache/spark/blob/v3.2.1/repl/src/main/scala-2.12/org/apache/spark/repl/SparkILoop.scala#L45
    for reference.
    - These commands are evaluated in order, creating and customizing your Spark Session.
    - If there are specific things you need to Ensure exist - then this technique helps you setup exactly that
    */
  val InitializationCommands: Seq[String] = Seq(
    "import com.coffeeco.data.SparkInceptionControllerApp",
    """
    println("SPARK INCEPTION REPL: Initialization")
    """,
    """
    @transient val spark = SparkInceptionControllerApp.sparkSession
    println(spark)
    """,
    "import org.apache.spark.SparkContext._",
    "import spark.implicits._",
    "import org.apache.spark.sql.functions._"
  )

  @transient
  @volatile protected[data] var sparkRemoteSession: SparkRemoteSession[_] = _

  def apply(replInitCommands: Seq[String] = InitializationCommands): SparkRemoteSession[_] = {
    if (sparkRemoteSession == null) {
      synchronized {
        if (sparkRemoteSession == null) {
          sparkRemoteSession = new SparkRemoteSession(
            app = SparkInceptionControllerApp,
            replInitializationCommands = replInitCommands)
        }
      }
    }
    sparkRemoteSession
  }

}

class SparkRemoteSession[T <: SparkApplication : ClassTag](app: T, replInitializationCommands: Seq[String]) {

  import SparkRemoteSession.logger

  import scala.tools.nsc.interpreter.JPrintWriter
  // application enables you to get to the SparkSession, SparkConf, etc
  // we use this to ensure we create a shared context since the SparkRemoteSession
  // exists (inside of) the SparkInceptionControllerApp, and runs in its own separate
  // runtime Context.

  // save the initial Console output stream
  val replOutputStream = new ByteArrayOutputStream()
  private[this] final val initialConsoleOutputStream: PrintStream = System.out;
  // forwarding std.out PrintStream to the replOutputStream (^^)
  private[this] final val consolePrintStream = new PrintStream(replOutputStream, true)
  private[this] final val outputStream = new JPrintWriter(replOutputStream, true)
  lazy val sparkILoop: SparkILoop = {
    val sets: Settings = new Settings
    sets.processArguments(
      List("-Yrepl-class-based", "-Yrepl-outdir", s"${outputDir.getAbsolutePath}"),
      processAll = true
    )
    if (sets.classpath.isDefault) {
      logger.info(s"sets.classpath=default update:java.class.path")
      sets.classpath.value = sys.props("java.class.path")
    }
    sets.usejavacp.value = true

    // adding additional user jars
    if (extraJarsDir.nonEmpty) {
      logger.info(s"spark.repl.extra.jars.dir=$extraJarsDir")
      val jarsDir = new File(extraJarsDir)
      val jars: Seq[URL] = jarsDir.listFiles().map {
        _.toURI.toURL
      }.toSeq
      val classPathValue = sets.classpath.value
      val updateClassPathValue = (jars ++ Seq(classPathValue)).mkString(File.pathSeparator)
      logger.info(s"updated.classpath.value: $updateClassPathValue\n")
      sets.classpath.value = updateClassPathValue
      updateClassLoader(new URLClassLoader(jars, Thread.currentThread().getContextClassLoader))
    }
    // set the reference classes that can be used for T variables and use in the local Spark ILoop
    sets.embeddedDefaults(_contextClassLoader)

    logger.info(s"generating the SparkILoop")

    lazy val sparkILoop = new SparkILoop(None, outputStream) {
      override val initializationCommands: Seq[String] = InitializationCommands

      /* replace the standard Spark welcome message */
      override def printWelcome(): Unit = {
        import org.apache.spark.SPARK_VERSION
        echo(
          """
          spark.version %s
          """.format(SPARK_VERSION))
        val welcomeMsg = "Using Scala %s (%s, Java %s)".format(
          versionString, javaVmName, javaVersion)
        echo(welcomeMsg)
        echo("Type in expressions to have them evaluated.")
        echo("Type :help for more information.")
      }
    }

    sparkILoop.settings = sets
    // load the jar for this class
    sparkILoop.createInterpreter()
    sparkILoop.initializeSpark()
    sparkILoop.initializeSynchronous()

    sparkILoop
  }
  val isInitialized: AtomicBoolean = new AtomicBoolean(false)
  // enables setting of user_jars, if empty nothing will be loaded
  val extraJarsDir: String = app.sparkSession.conf.get(app.appConfig.ReplExtraJarsDir, "")
  // storage location for dynamic compiled classes and for replaying the console history
  val replClassDirectory: String = app.sparkSession.conf.get(
    app.appConfig.ReplClassDir,
    sys.props.getOrElse("java.io.tmpdir", ""))
  // use this stream to capture console output (like when printing tables)

  /* where our dynamic classes will be written out to */
  val outputDir: File = {
    val f = Files.createTempDirectory(Paths.get(replClassDirectory), "spark").toFile
    f.deleteOnExit() // works with normal exit
    f
  }
  private[this] val scalaTypePrefixPattern = "^\\$ires\\w*\\:\\W[A-Za-z]*\\W\\=\\W*".r
  private[this] var _contextClassLoader: ClassLoader = Thread.currentThread().getContextClassLoader

  /**
   * Takes the resulting output from the Repl Output Buffer, cleans the buffer, and returns the output
   *
   * @return
   */
  def readOutput(): Seq[String] = {
    synchronized {
      this.replOutputStream.flush()
      val result = this.replOutputStream.toString("utf-8")
        .split("\n")
        .toSeq
        .map(filterOutput)
        .filter(_.nonEmpty)
      this.replOutputStream.reset()
      result
    }
  }

  def initialize(): Unit = {
    this.isInitialized.set(true)
    sparkILoop.isInitializeComplete
  }

  /**
   * Cleans up the application. Should be part of the SparkApplication shutdown process
   * additionally you can also just use sys.addShutdownHook { sparkRemoteSession.close() }
   * just make sure you close up shop after stopping the outer Spark Application
   */
  def close(): Unit = {

    // these are all autoflushing - lets force flush & close up shop
    this.replOutputStream.flush()
    this.consolePrintStream.flush()
    this.outputStream.flush()
    // clean up after ourselves
    this.replOutputStream.close()
    this.outputStream.close()
    this.consolePrintStream.close()
    if (this.isInitialized.get()) {
      this.sparkILoop.closeInterpreter()
    }
  }
  // sparkILoop.replOutput (gives all the information from IMain)
  // sparkILoop.settings (cascade of all the things)

  /**
   * THIS IS WHERE THE MAGIC HAPPENS:
   * Will trigger an action (evaluating the Remote NetworkCommand in the SparkILoop via the SparkRemoteSession)
   *
   * @param cmd The NetworkCommand being processed
   * @return The results of processing the command
   */
  def processCommand(cmd: NetworkCommand): NetworkCommandResult = {
    initialize()
    val user = cmd.userId.getOrElse("nobody")
    val parsed = cmd.parse()
    if (parsed._1 != Command.UnsupportedCommand && authCheck(user)) {
      logger.debug(s"security.gate.passed")
      // evaluate the command

      val results = Console.withOut(consolePrintStream) {
        System.setOut(Console.out)
        parsed._1 match {
          case SparkCommand => processSparkScala(parsed._2)
          case SparkSQLCommand => processSparkSQL(parsed._2)
          case _ => (Status.Failure, s"${cmd.command} is not supported")
        }
      }
      System.setOut(initialConsoleOutputStream)

      NetworkCommandResult(
        requestId = cmd.requestId,
        commandStatus = results._1,
        consoleOutput = results._2
      )
    } else NetworkCommandResult(cmd.requestId, "Failure", s"$user is not authorized")
  }

  /**
   * Using the SparkILoop, eval code, interact with live Spark directly
   *
   * @param cmd The scala block to evaluate
   * @return The results of running the Scala block
   */

  def processSparkScala(cmd: String): (String, String) = {
    val result = sparkILoop.interpret(cmd, synthetic = true)
    val consoleOutput = readOutput()
    (result.toString, consoleOutput.mkString("\n"))
  }

  /**
   * Using the Native app.sparkSession pointer run Spark SQL commands directly
   *
   * @param cmd The SQL command will fail or succeed, stack trace will be output in the case of a failure
   * @return The results of interpreting the Spark SQL Command
   */
  def processSparkSQL(cmd: String): (String, String) = {
    // note: In the case where you want delete protection for tables
    // or want to add specific limits (like limit 10 for open queries)
    // then you can parse the cmd string and add magic

    Try(app.sparkSession.sql(cmd)) match {
      case Success(df: DataFrame) =>
        (Status.Success, df.toJSON.collect().toSeq.mkString("\n"))
      case Failure(ex: Exception) =>
        ex.printStackTrace(consolePrintStream)
        (Status.Failure, readOutput().mkString("\n"))
      case Failure(thr: Throwable) =>
        thr.printStackTrace(consolePrintStream)
        (Status.Failure, readOutput().mkString("\n"))
      case _ =>
        (Status.Failure, "Something went wrong")
    }
  }

  /**
   * Basic check - not comprehensive security
   *
   * @param user The name of the user of an inbound Command
   * @return True if we want to process the inbound command, false for auth block
   */
  def authCheck(user: String): Boolean = {
    user == app.sparkSession.sparkContext.sparkUser
  }

  /**
   * Use this method to filter the Notebook processing output. Eg. do you care about class definitions or types?
   * what about formatting the output - like df.show(10, true) will convert to a console printed table.
   * Do you want to see that $iresN Unit: ()? probably not
   *
   * @param str The string to evaluate
   * @return empty string or cleaned string
   */
  def filterOutput(str: String): String = {

    val out = str match {
      case _ if str.startsWith("##") => ""
      case st if str.startsWith("$ires") =>
        scalaTypePrefixPattern.replaceFirstIn(st, "")
      case _ => str
    }
    out.trim
  }

  private def updateClassLoader(classLoader: ClassLoader): Unit = {
    _contextClassLoader = classLoader
  }

}

Now moving back from the SparkILoop and the SparkRemoteSession work back to the SparkInceptionControllerApp.processBatch method to close the final loop.

def processBatch(ds: Dataset[NetworkCommand], batchId: Long): Unit = {
  import sparkSession.implicits._

  // Collect all of the Distributed Commands and bring down to the Driver
  val localResults = ds.collect().map { networkCommand =>
    // this is running on the driver now (not the executors)
    val res = sparkRemoteSession.processCommand(networkCommand)
    // wrap the execution details so we can write the results to redis
    NotebookExecutionDetails(
      networkCommand.notebookId,
      networkCommand.paragraphId,
      networkCommand.command,
      networkCommand.requestId,
      networkCommand.userId,
      res.commandStatus,
      res.consoleOutput
    )
  }.toSeq

  // generate a new dataframe and then write back to redis
  val forRedis = sparkSession.createDataset[NotebookExecutionDetails](localResults)
  forRedis
    .write
    .format("org.apache.spark.sql.redis")
    .options(sparkSession.sparkContext
      .getConf
      .getAllWithPrefix(appConfig.SinkStreamOptions).toMap[String, String])
    .mode(SaveMode.Append)
    .save()
}

That is it for the main classes in the Spark Inception Controller Application. You can read the notes and use the README.md in the source code for more learnings. Or ask questions in the comments.

Running the End to End Demo

Data+AI Summit 2022: Spark Inception: Exploiting the Spark REPL to Build Streaming Notebooks

In order to run the demo, you need to have docker installed and create a bridged network `docker network create mde`.

Use the following `docker-compose.yml` to get `redis` up and running.

version: '3'

services:
  minio:
    image: minio/minio:${_MINIO_VERSION:-latest}
    volumes:
      - ${DATA_ENGINEERING_BASEDIR}/minio/data:/data
    expose:
      - "9000"
      - "9001"
    environment:
      MINIO_ROOT_USER: minio
      MINIO_ROOT_PASSWORD: minio_admin
    command: server --console-address ":9001" /data
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
      interval: 30s
      timeout: 20s
      retries: 3
    ports:
      - 9000:9000
      - 9001:9001
    networks:
      - mde
    hostname: minio
    restart: always

  redis:
    image: redis:${_REDIS_VERSION:-latest}
    container_name: redis
    hostname: redis
    networks:
      - mde
    ports:
      - 6379:6379
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 5s
      timeout: 30s
      retries: 50
    restart: always

networks:
  mde:
    external: true
    name: mde

use docker compose to fire up the dependent services.

docker compose -f /path/to/docker-compose.yml up -d`

2. Once Redis is up and running. Open a new terminal window since it is time to spin up the spark-inception-controller. The following command will spin up the Spark Structured Streaming application and provides you with everything you need to be successful while evaluating the demo.

docker run \
  -p 4040:4040 \
  --hostname spark-inception-controller \
  --network mde \
  -it newfrontdocker/spark-inception-controller:1.0.0 \
  /opt/spark/bin/spark-submit \
  --verbose \
  --master "local[*]" \
  --class "com.coffeeco.data.SparkInceptionControllerApp" \
  --deploy-mode "client" \
  --jars "/opt/spark/app/jars/spark-inception-controller.jar" \
  --conf "spark.driver.extraClassPath=/opt/spark/app/jars/spark-inception-controller.jar" \
  --driver-java-options "-Dconfig.file=/opt/spark/app/conf/application-live.conf" \
  /opt/spark/app/jars/spark-inception-controller.jar

The spark-inception-controller container has Apache Spark located at `/opt/spark` and the `spark-inception-controller` is located in `/opt/spark/app`. Next, you’ll spin up the Spark Structured Streaming application.

Now you have a fully functioning Apache Spark application. The trouble is the application doesn’t do anything since it is essentially a blank canvas waiting for your input. Consider this no different than a Notebook environment, you need to `add some content` in order for something amazing to happen. So we will do just that.

Interacting with the Spark Inception Controller

We have two services running. The first is redis. Which introduced the Stream data type in Redis 5.0, which we will use as a lightweight streaming message bus, and the second service is our Spark Structured Streaming Application (which is configured to process up to 10 “commands” per micro-batch). Each “command” is the equivalent of a paragraph within a notebook (think Apache Zeppelin, Jupyter, or Databricks Notebooks).

The command RPC structure is as follows:

NetworkCommand {
 notebookId: String
 paragraphId: String
 command: String
 requestId: String
 userId: Option[String]
}

The NetworkCommand is used to send either %spark or %sql commands to be processed in the Spark Inception Controller.

Let’s look at an example putting all the pieces together.

We will have three docker exec processes at play to trigger remote commands (redis#1) on our Spark Inception Controller (spark-inception-controller), and to receive data back (redis#2).

The results of processing our NotebookCommand:command, aka a paragraph, are fed into a Redis HashSet (used in Spark as a memory optimized columnar table) and the Status of the command “Success, Failure, or Error” is tracked in the NotebookExecutionDetails object.

NotebookExecutionDetails {
 notebookId: String,
 paragraphId: String,
 command: String,
 requestId: String,
 userId: Option[String],
 commandStatus: String,
 result: String
}

This gives us all the pieces now to have fun and exploit Apache Spark.

Run the End to End Demo

Open up a new Terminal (Console) tab for our Redis Command Line

docker exec -it redis redis-cli

2. In another terminal window, we will monitor the commands and data being passed back and forth. This is our immediate feedback that a command was received, processed, and we can see what the result of running the command actually was!

docker exec -it redis redis-cli monitor

We will be sending the three following commands: (encoded as Redis X objects for the Redis Stream)

Command #1

%sql
show tables

Command #2

%spark
case class Person(name: String, age: Int)
val people = Seq(Person(\”scott\”,37),Person(\”willow\”,12),Person(\”clover\”,6))
val df = spark.createDataFrame(people)
df.createOrReplaceTempView(\”people\”)

Command #3

%sql
select * from people

Now you can execute the following command: NetworkCommand in the terminal window you opened up in Step 1.

The following command creates a new case class Person, generates a sequence of People, uses the SparkSession reference inside of the SparkILoop to generate a new DataFrame (on-the-fly), and then registers the data as a temp view named people.

Run the First Remote Streaming Notebook Paragraph

xadd com:coffeeco:notebooks:v1:notebook1:rpc MAXLEN ~ 3000 * notebookId notebook1 paragraphId paragraph1 command “\n%spark\ncase class Person(name: String, age: Int)\nval people = Seq(Person(\”scott\”,37),Person(\”willow\”,12),Person(\”clover\”,6))\nval df = spark.createDataFrame(people)\ndf.createOrReplaceTempView(\”people\”)\n” requestId request1 userId “1000”

You will see lots of things happening in the process running the Spark Inception Controller. If this is the first command you’ve sent to the application, then the singleton SparkILoop will be scaffolded reusing the same SparkSession as our structured streaming application, and otherwise you will see the input/output information for the application. The SparkILoop is generated inside of the SparkRemoteSession class, and is essentially a customized instance of the spark-shell running inside of our Structured Streaming Application. This enables us to define new classes, construct tables, you name it all while never having thought of it ahead of time.

Let’s see the 3 items in our generated people table

To view the contents of the people table we created using the first remote command just pop back into the redis-cli (redis#1) from before, and execute the following command.

xadd com:coffeeco:notebooks:v1:notebook1:rpc MAXLEN ~ 3000 * notebookId notebook1 paragraphId paragraph2 command “\n%sql\nselect * from people” requestId request2 userId “1000”

You should now see the results of the people table output as newline separated JSON in redis.

From the redis-cli. Run the following command.

hget “com:coffeeco:notebooks:v1:notebook1:results:paragraph2” result

Which will give you the table output that has been transformed to json data for simplicity: `(df.toJSON.collect.mkString(“\n”))` in the people view you created with the first command.

“{\”name\”:\”scott\”,\”age\”:37}\n{\”name\”:\”willow\”,\”age\”:12}\n{\”name\”:\”clover\”,\”age\”:6}”

That’s it. You now have a Streaming Notebook environment using Spark to power the dynamic generation of new Spark applications, or just to fiddle around with and come up with whacky new ideas!

I hope you find this session companion article useful to further support the Data+AI Summit talk material.

Data+AI 2022: Spark Inception: was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.

Towards Analytics with Redis

Scott Haines — Fri, 25 Feb 2022 07:11:14 GMT

Using Hashes and Binary Data Structures

Image by Author. Redis Logo via https://redis.io/. Copyright Redis LTD.

You are probably wondering why anyone would do any kind of analytical processing directly in Redis. Why not just stick to traditional OLAP-style SQL?

I’m glad you asked. Redis has been around for a while now and is a proven, high-performance, low-maintenance database and excellent multipurpose cache. Redis is special because it is actually a Data Structure store with superior random access to complex data representations: from Bitmaps and Hash sets to the real-time Stream type, which is essentially a Sorted Set of rows sorted by insert time order where each row is represented by a Hash set like entry.

Getting back on track, Redis supports access patterns that aren’t available with respect to conventional analytical processing databases which means you can flex your creative muscle to solve problems that would otherwise require you to support multiple alternative databases, streaming platforms (Kafka/Pulsar), and API services to achieve the *same results.

Next, we’ll look at how to use the Bitmap, Hash, and Stream data types to solve actual problems.

Get Redis up and Running on Docker

You can follow along by spinning up a single node Redis cluster using Docker using the following command from your terminal (or favorite shell program).(*Requires Docker to be installed on your machine).

docker network create --driver bridge analytics
docker run --network analytics -p 6379:6379 --name redis6 -e ALLOW_EMPTY_PASSWORD=yes -dP redis redis-server

The command (you just ran) will download the latest official stable Redis image (6.2.6 at the time of writing) unless you have redis:latest downloaded to your docker image cache (you can check using: docker images redis), and then it will start up the container (run) using a bridge to your host (laptop / computer) network named analytics as a detached -d process. The detached process just means that the stdout / stderr container logs are not piped through to your terminal, you can reattach to the container later if you need.

Next, we’ll just do a simple sanity check that Redis is running. In the same terminal window, you can use the docker exec command to access and execute the redis-cli (aka the redis shell) program.

docker exec -it redis6 redis-cli

You will see the 127.0.0.1:6379> prompt which means you are ready to rock and roll my friend!

Note on Container Behavior: We created a named container called redis6 in the docker run command earlier. This lets us easily access the container using the docker exec command using the name you supplied (redis6) vs the random name given by the Docker runtime. The side effect is that you will have to stop and remove this container to make container config modifications. For example, if you try to run the command (from above) again you’ll see the following error message:

docker: Error response from daemon: Conflict. The container name “/redis6” is already in use by container “{long container uuid}”. You have to remove (or rename) that container to be able to reuse that name.

You can fix this little problem by simply executing the following: docker stop redis6 && docker rm redis6.

Last little gotcha with Data Persistence. Because the Redis database (stored as an rdb file) is created and stored internally to the container (redis6), this more importantly means that when you execute the stop & remove command that you’ll also be deleting any data stored in the database.

This is because containers are architected to be immutable (non-alterable) for the lifecycle of a specific tag and sha256 pair, while the container can be changed using environment variables and volumeMounts to temporarily augment the container. The actual container image has zero memory regarding your modifications. This guarantees you’ll have a reliable and consistent runtime environment with respect to the state of the container following a strict no surprises rule in terms of the file system, os, etc.

You can solve the data persistence problem, lack of data persistence, using volumes — which are persistent disks (or memory) slices attached to a particular path in the container. Volumes are commonly used to mount secrets and other configs to the common base container image.

Creating and using a Persistent Volume
Now to modify things before we get started. You can go ahead and stop and remove the redis6 container if you created it earlier, then move ahead with the following command to create the redis6 volume.

docker volume create redis6

Now that you have a persistent volume created you can mount the volume to your Redis container and then you’ll have a way to get back to the data (unless you loose your disk).

Note: the total disk space allocated for the Docker runtime on your Host machine will be used for this volume. You can view all docker volumes using docker volume ls and specific information for the redis6 volume using docker volume inspect redis6.

Starting Redis back up with a Persistent Volume and Consistent Redis Image

Like I covered in the note regarding container behavior above, the first quick start command created a transient Redis container that will forget any changes made to it during the runtime of the container. Luckily, using a persistent volume means your data will exist between stop/remove cycles so you can change container configurations, etc with minimal concern.

docker run \
 — name redis6 \
 — network analytics \
 -v redis6:/data \
 -p 6379:6379 \
 -e ALLOW_EMPTY_PASSWORD=yes \
 -dP \ redis:6.2.6@sha256:eaa2697033320e8b58b920f5f4aa0c56e8fa5eacf9094168504c7eeabbac8468 redis-server

Now let’s move on and learn to store daily and weekly data using Redis Hashes.

Data Structure Pattern: Storing Daily and Weekly Hourly Totals

Glad you’re still reading along.

Data Structure: Hash
Key Pattern: {key-prefix}:{version}:{region}:{dataset}:{identifier_tag}:{date-pattern}
Example Key:acme:v1:us:carts:completed:2022:w:1

In the above example key we are using the numeric week of the year. This way no one needs to bother with date math, and we can let Redis efficiently compress the shorter key in the global Redis keyspace. The value of date-pattern is {w:1} for week 1 of 2022. *The date pattern can also be simplified to your own use cases. Just make sure you try some things to see what works or follow other best practices like using yyyy:MM:dd style string patterns.

Objective: Store the hour of day across 7 days in the week for the hourly totals

The Redis hash set associated with our key (acme:v1:us:carts:completed:w:1) stores a collection of day of week (0–6) and hour of day (0–23) pointers, for example [0–6]:[0–23] stores the numeric value associated with the [total completed carts] for the exclusive hour of the specific day of the week.

So how do we set this value?

Writing the Hourly Cart Data using HSET

Using the Redis cli you can set the total number of completed cart transactions for Sunday @ 00:00:00–00:59:59 into the hash: The signature for adding one or more key value pairs to the common Hash is HSET key field value [field value...]

127.0.0.1:6379> HSET acme:v1:us:carts:completed:w:1 0:0 100000
~~~

Tip: Now you wouldn’t expect to have someone manually entering all of this data, rather you would write this analytical data as a by product of a pipeline job or streaming aggregation into Redis for the fast access from other data systems within your data ecosystem, to accomplish the following: a.) to cache the data for more complex use cases, or b.) to protect more expensive calls to your source of truth data sources.

Fetching the Cart Data using HGETALL

To collect all keys and completed cart totals (168 total possible keys if we do 0–6:0–23 for all hours of the week records) available in the hash map, or to collect a subset or partial week of data, you can use HGETALL to get all tuples in the Hash associated with the Key.

127.0.0.1:6379> HGETALL acme:v1:us:carts:completed:w:1
 1) "0:0"
 2) "100000"
 3) "0:1"
 4) "120000"
 5) "0:2"
 6) "150000"
 7) "0:3"
 8"250000"
 9) "0:4"
10) "290000"
11) "0:5"
12) "450000"
13) "0:6"
14) "450000"
15) "0:7"
16) "450000"
17) "0:8"
18) "450000"
19) "0:9"
20) "1200000"
21) "0:10"
22) "1800000"
23) "0:11"
24) "2300000"
25) "0:12"
26) "2900000"
27) "0:13"
28) "3600000"
29) "0:14"
30) "2600000"
31) "0:15"
32) "2200000"
33) "0:16"
34) "1200000"
35) "0:17"
36) "900000"
37) "0:18"
38) "1000000"
39) "0:19"
40) "1200000"
41) "0:20"
42) "1600000"
43) "0:21"
44) "1300000"
45) "0:22"
46) "600000"
47) "0:23"
48) "650000"
...

The underlying metrics (cart totals) shown in the output above are all efficiently stored internally as unsigned integers, unless you add negative numbers (which you wouldn’t want to since 0 sales is the least amount unless you start thinking about returns — but that would complicate things at this point in the explanation). Aside from being able to fetch your analytic data for use in dashboards or other APIs for internal or external customers, you can also fetch individual tuples by signature (day_of_week:hour_of_day) as a basis of comparison to compose higher order analytics (like the abandonment rate, or percentage of completed vs started carts per hour), by simply comparing other simple values stored in simple Redis data structures.

If I generate a full day [0:0…0:23] for Sunday and take a step back to look at the overhead in Redis. I can use the debug object feature to fetch the encoding and serializedlength properties of the data structure represented by the key from earlier with a minor change in the key pattern (d represents a day, while w represented a week — this pattern lets you use days of the year or weeks of the year) without changing the shared common prefix of the key (acme:v1:us:carts:{…}).

127.0.0.1:6379> debug object acme:v1:us:carts:completed:d:1
Value at:0x7ff7c4866670 refcount:1 encoding:ziplist serializedlength:217 lru:8641926 lru_seconds_idle:30

The output of calling debug object informs us that storing one full day of unsigned integers takes up just 217 bytes per day. Furthermore, due to the size of the Hash Set (24 string/uint32 tuples) Redis encodes the data using ziplist encoding for up to a default of 512 entries.

Now, back to using the Hash data structure.

Fetch the hourly total for Sunday at 12pm UTC

In order to fetch an entry, we just need to know the position. Sunday is 0 (while Saturday would be 6), and the hour offset is 11 for the 12th hour. You can adjust the pointers to begin with 1 for sanities sake as well.

127.0.0.1:6379> HGET acme:v1:us:carts:completed:w:1 0:12
"2300000"

Tip: Creating a RedisKey Builder library or Specific High Level SDK can assist with the complexities introduced by using Redis like I am showing here.

val total = Carts({version:1,region:us,path:acme})
  .weekOf("2022-02-21")
  .completed()
  .getOrElse(...)

Next. We will look at storing boolean data using Bitfields.

Data Structure Pattern: Store Daily User Activity

This next pattern will show you how to efficiently store daily user activity using Bitmaps in Redis.

Note: Many of us have written or at least read about how to build a registration system. Integer based identifiers vs UUIDs can be a hot topic. Given the maximum integer value is ~2.1 billion, let’s just assume for the point of this example that you have a distributed registration system ensuring that each customer (User) in your system gets their own auto incrementing id, and your scale is less than 2.14 billion. If we use the integer to represent the User, then we get a much smaller key which can help to reduce overhead in the key space, otherwise UUID v4 is fine but will come with additional overhead.

Data Structure: Binary Strings (Byte Array with Bit manipulation)
Key Pattern: {key-prefix}:{version}:{region}:{dataset}:{identifier_tag}:{date-pattern:yyyy:MM}
Example Key:acme:v1:us:users:active:0:2022-02

The example key above corresponds to the user with the integer id of 0. If using UUIDs, then you might see a key like acme:v1:us:users:active:917dfa61–3c56–4e59-a595–37bb20c40a1d:2022–02 which has some additional memory overhead due to the length of the key.

Setting Daily Activity Bits

With the red tape out of the way (see the note just above). Let’s dive in with User zero. We will set the active/inactive stats for the user using SETBIT.

We will start out here with setting the first 3 days of the month. [0–2], active on the 1st and the 3rd of the month of February 2022. We will add the first record, check the memory overhead, and then add the other two days.

127.0.0.1:6379> SETBIT acme:v1:us:users:active:2022–02 0 1

Check the memory overhead here.

127.0.0.1:6379> memory usage acme:v1:us:users:active:2022–02
(integer) 76

Now add two additional days.

127.0.0.1:6379> SETBIT acme:v1:us:users:active:2022–02 1 0
127.0.0.1:6379> SETBIT acme:v1:us:users:active:2022–02 2 1

This process is probably getting tedious now. Given we can add more values to the Bitmap using the BITFIELD command. Let’s set the rest of the days in the first 7 days of February using the Bitfield command.

BITFIELD acme:v1:us:users:active:2022–02 SET u1 3 1 SET u1 4 1 SET u1 5 0 SET u1 6 0

The BITFIELD command is variadic, meaning it can take zero or more additional SET {encoding} {offset} {value} pairs. Here we set the days 3–6 for the user. Interestingly enough because we are dealing with bits here, we don’t change the memory overhead required to store this information due to the < 8 values set on the initial byte array. 1 byte is 8 bits and we need some buffer space too. We can confirm this by calling debug object acme:v1:us:users:active:2022–02. You’ll see the serializedlength is still 2 bytes.

127.0.0.1:6379> debug object acme:v1:us:users:active:2022–02
Value at:0xf55d8330 refcount:1 encoding:raw serializedlength:2 lru:1578061 lru_seconds_idle:46

Given the default value for each bit is 0, we don’t actually need to do anything when it comes to inactive days. We just need to eventually allocate a large enough binary string to hold the offsets for future days. The Redis bitmap is just access methods on a binary encoded string.

Let’s assume our User in this use case (above) goes away on vacation for 7 days, and then starts intermittently coming back to whatever awesome service you provide. We wouldn’t need to do anything until adding a value the next time the User is active. Let’s look at the support for “Sparse” fields next.

Sparse Field Support

This year February has 28 days. Our bit offset index is 0–27 values to ultimately account for. We’ve filled in active / inactibive bits for the first 7 days [0–6]. We skip the following 7 days [7–14] since the user is on vacation. They hit the service again when coming back on the 16th of the month. *Yes it is just as confusing to write this as it is to read, but we don’t want to waste bits!

The following sets the active user bit for the 16th, 19th, 24th and the 28th of the month.

BITFIELD acme:v1:us:users:active:2022–02 SET u1 15 1 SET u1 18 1 SET u1 23 1 SET u1 27 1

Now we can see how the memory was allocated behind the Redis scenes enabling the backing byte array to grow dynamically on account for our user being active on specific days of the week within the 28 days encapsulating the month of February.

127.0.0.1:6379> memory usage acme:v1:us:users:active:2022–02
(integer) 84

127.0.0.1:6379> debug object acme:v1:us:users:active:2022–02
Value at:0xf55d8330 refcount:1 encoding:raw serializedlength:5 lru:1579962 lru_seconds_idle:53

Finding the Total Active Days of the Month

Now what is interesting here is that we can figure out if a User has been active or inactive (for the entire month) because there would be no record for the User in the case where no key was created. Check. Interestingly enough, we can also figure out how many days (in total) a user has been active by checking the BITCOUNT of the bitmap.

127.0.0.1:6379> bitcount acme:v1:us:users:active:2022–02
(integer) 7

So we know that the user has been active for 7 days. The bitcount operation is O(N) but given the small size it acts like O(1). I don’t know many alternatives to constant time analytics. Consider how much more efficient this is (yes, it is more difficult and not super simple to follow, however…) the alternatives to this approach require querying User activity in what may be a computationally expensive query over activity data stored in your Data Warehouse, or god forbid, your Data Lake!

Figuring out if a User was active on a specific Day of the month

This final example just shows how to use the BITFIELD command to get a specific value by offset, in our case, the day-1 index. We will check two days, we explicitly set the value of the 24th of the month, but never added a record for the 22nd.

127.0.0.1:6379> BITFIELD acme:v1:us:users:active:2022–02 GET u1 21
1) (integer) 0

127.0.0.1:6379> BITFIELD acme:v1:us:users:active:2022–02 GET u1 23
1) (integer) 1

We get back an array of integer response. This is because we have the option of getting multiple values returned from our Bitmap (just like the SET functionality was variadic, so is the GET).

127.0.0.1:6379> BITFIELD acme:v1:us:users:active:2022–02 GET u1 21 GET u1 23
1) (integer) 0
2) (integer) 1

Redis Streams

I will cover how to use the Redis Streams data type in a book I have coming out in April 2022 called Modern Data Engineering with Apache Spark. It covers more advanced use cases than this blog post can cover, with hands-on material you can take with you no matter where your career takes you.

Modern Data Engineering with Apache Spark: A Hands-On Guide for Building Mission-Critical Streaming Applications

Conclusion

I hope you found this introduction to using Redis on Docker for analytics exciting or at least interesting. Redis is a fascinating swiss-army knife in your data engineering and engineering toolbox. It is also just lovable. I was lucky enough to “have to” use Redis back in 2012 and ten years later I still love what I am working with.

*Yes. I am purposely skipping the conversation about limitations of Redis clustering, cost of memory vs flash vs ssd vs hdd, economies of scale on humongous data stores with multiple millions of keys, pros/cons of OLAP databases like Druid / Pinot / ClickHouse, and substantiating my above argument regarding simplifying the data stack with Redis with actual hard facts. Data decision making and selection of technology is multi faceted and this post is intended to introduce you to a nuanced approach to how to serve data.

Towards Analytics with Redis was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.

Problems All Writers Face

Scott Haines — Wed, 26 May 2021 15:30:30 GMT

Learning to break mental blocks and unstuck yourself

What inpires you? Photo via Unsplash

In Pursuit of Writing ✍️

I’ve been writing off and on now for the better half of a decade. Mostly technical writing here on medium now a days, but over the course of the past few years, I’ve had a few articles published, some even showed up in printed magazines, and I was offered an opportunity in 2019 to take a risk and work on my first book.

The Rise of Operational Analytics

In it for the long haul

I’m working on a new book now and it has me writing many nights and every weekend. This will be the longest thing I’ve ever written weighing in at around 400 pages (or more as there seems to be an endless amount of new content that fits the bill).

Understanding My Weaknesses

This project has taught me many things about myself, for instance I realized early on that I am terrible at estimating the actual time it takes to write each draft of a chapter. See, I have a problem context switching from left brain to right brain activities and the book I am writing is a hands-on book on Modern Data Engineering.

This means that each chapter has some flourish, setting up the chapter, connecting the dots from earlier chapters, laying a concrete foundation for more complex and advanced topics. But moving from the setup and the theory, to easily digestible “Fully functioning” code is always like pulling the ebrake at 90 miles an hour. I’m in for a bumpy and dangerous transition. Writing tutorials and trainings takes a surprising amount of time. When I used to present and do live trainings it would take around a month to put together the content and I would workshop the material at conferences and meetups sometimes for over a year like with the example below.

https://medium.com/media/9f89b3a714814bc012d246d3bce75259/href

I also have ADHD and while that can help me focus, it also means I have to try harder to keep a train of thought, lest things fall apart due to simple distractions. I found that it helped to sit down to write, put on my headphones and just start writing off my outline. I wouldn’t stop to edit, reread or tweak anything until the idea was out.

Just get something on the page, since you can always go back and edit it later but interrupting the flow of words is usually a costly mistake.

Better Together

On the side of wins, I’ve had the pleasure of having conversations with other authors and editors during the past year and it’s been a great opportunity to learn about the publishing industry and what it means to be a writer in the competitive field of technology and computer science. I also became a technical reviewer for other embattled authors and it has been a nice 360 degree immersion into all sides and the many angles viewed and visited from first page to publication.

Through this process I’ve also had the opportunity to really reflect on my time spent teaching and giving tech talks and learned a lot about myself through the course of this journey. Part of me thinks this process has helped me grow as an author, part of me hates myself for even going down this path.

What is life but a balancing act between idle time and deadlines

The Struggle

The things I’ve struggled most with during the process have been finding inspiration, removing distractions from my writing environment, and learning to live with the healthy fear that no matter how many edits, and read throughs, that I’ll always find one or more things that I can tweak. A slight improvement here or there (I’m doing a meta edit as we speak). There is no perfect draft of a chapter.

I also learned that a good development editor is really key to help grind down and polish what feels horribly unfinished, slightly rushed and offer up simple suggestions that yield fantastic enhancements that would have never happened without the give and take.

Also communication is key, sharing my doubts, stress, writers block, new ideas and avenues of writers drift. Essentially, I brought traditional software feature creep into my writing as things were moving faster (from a technology point of view) and out pacing my ability to write and deliver new chapters that didn’t feel dated the day they were delivered.

Ultimately, I learned to share early and often, and looked forwards to the feedback, notes and conversations.

Let’s walk through what I’ve distilled as the problems all authors will face.

Inspiration

Most people get inspired by listening to music, going out, going on or even just planning a new and exciting adventure somewhere. Simply put we escape our normal lives and leave our emotional baggage at the door. Getting out to do something new and adventurous can naturally produce seratonin and dopamine, which trigger the brain to experience happiness, motivation and more importantly creativity. All which have been harder with the pandemic raging, so finding inspiration in the little things has been key.

This simple act can help to break up a day that could have otherwise be just a grind. If you don’t have the luxury of being able to just get up and take off then you can look for inspiration by getting up and leaving the room your writing in and just wait for inspiration to return. I will sometimes just wake up in the morning and I feel the words assembling into sentences and I have to write cause clearly it’s a words day.

It feels like a words day

Words Days

Word days, for me at least, don’t just happen all the time. I would love for them to just be everyday but as a mental process that I can’t control there is no way of arguing with it. Honestly, I think the words are really just the end result of the mind just chewing on a new idea and giving you no deadline, so it is always refreshingly serendipitous.

Unfortunately, writing has its ups and downs.

Distraction

There is no better problem left unsolved as there is with distraction. Since with enough distraction you’ll never remember where you began. Jokes aside, this is the part where I bring up my ADHD for the second time. It is easy for me to get distracted. I learned to fight it and also cater to it. I might find myself running some errand and suddenly I inspiration hits in unexpected ways, and this random serendipity seemed to be come alongside crushing fear and anxiety, running up to the deadline on a deliverable. I’d much rather be focused than distracted, but sometimes distraction is what the mind needs to complete a thought without “You” interrupting it.

In the End

I’m not done with the book yet. I’m close though and I’m doing the best I can to balance my career, family, friends and getting this book out without giving up cycles of rest and downtime or sacrificing my personal life in pursuit of publication. No one tells you that books are emotionally and physically demanding, and they yearn to be completed. But sometimes, it is good to just sit back, relax and write something else to help the words flow across a different medium.

Analytical Hashing Techniques

Scott Haines — Thu, 11 Mar 2021 01:52:22 GMT

Spark SQL Functions to Simplify your Life

Photo Credit: https://unsplash.com/@swimstaralex

Anyone working in the field of analytics and machine learning will eventually need to generate strong composite grouping keys, and idempotent identifiers, for the data they are working with. These cryptographically strong identifiers help to reduce the amount of effort required to do complex bucketing, deduplication, and a slew of other important tasks.

We will look at two ways of generating hashes:

Using Base64 Encoding and String Concatenation
Using Murmur Hashing & Base64 Encoding

Spark SQL Functions

The core spark sql functions library is a prebuilt library with over 300 common SQL functions. However, looking at the functions index and simply listing things isn’t as memorable as running the code itself. If you have the spark-shell, then you can follow along and learn some analytical hashing techniques.

Spin up Spark

$SPARK_HOME/bin/spark-shell

Above: Reference to my Shell Environment

With the spark-shell up and running, you can follow the next step just by running :paste in the shell to paste multiline. *(:paste, then paste the code, and then cmd+D to process the code)

Import the Libraries and Implicits

import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import spark.implicits._

Create a DataFrame

val schema = new StructType()
  .add(StructField("name", StringType, true))
  .add(StructField("emotion", StringType, true))
  .add(StructField("uuid", IntegerType, true))

val df = spark
  .createDataFrame(
    spark.sparkContext.parallelize(
      Seq(
        Row("happy","smile", 1),Row("angry", "frown", 2))
      ),
      schema
    )

At this point you should have a very simple DataFrame that you can now apply the Spark SQL Functions to. The contents of which are shown using df.show().

scala> df.show()

+-----+-------+----+
| name|emotion|uuid|
+-----+-------+----+
|happy|  smile|   1|
|angry|  frown|   2|
+-----+-------+----+

Now we have a simple data frame. Next we can add a base64 encoder column to the DataFrame simply by using the withColumn function and passing in the Spark SQL Functions we want to use.

Hashing Strings

Base64 Encoded String Values

val hashed = df.withColumn(
  "hash", base64(
    concat_ws("-", $"name", $"emotion")
  )
)

The results of this transformation yield us a new column that is the result of base64 encoding the concatenated string values from the columns name and emotion. This is broken down as the following flow.

df.withColumn("concat",
  concat_ws("-",$"name",$"emotion"))
  .select("concat")
  .show

+-----------+
|     concat|
+-----------+
|happy-smile|
|angry-frown|
+-----------+

The end result of the full columnar expression is as follows.

scala> hashed.show()
+-----+-------+----+----------------+
| name|emotion|uuid|            hash|
+-----+-------+----+----------------+
|happy|  smile|   1|aGFwcHktc21pbGU=|
|angry|  frown|   2|YW5ncnktZnJvd24=|
+-----+-------+----+----------------+

Nice. Right.

Next. We can look at a stronger technique for hashing. This uses the Murmur3 Hashing algorithm, and explicit binary transformations before feeding into the base64 encoder.

Murmur Hashing and Binary Encoding

There are many ways to generate a hash, and the application of hashing can be used from bucketing, to graph traversal. When you want to create strong hash codes you can rely on different hashing techniques from Cyclic Redundancy Checks (CRC), to the efficient Murmur Hash (v3). We will use what we can get for free in Spark which is the Murmur3.

hashed.withColumn("binhash",
  base64(bin(hash($"name",$"emotion")))
)
.select("uuid", "hash", "binhash")
.show(false)

Which will return the following rows (comparing the two hashing methods) based on the same input data.

+----+----------------+--------------------------------------------+
|uuid|hash            |binhash                                     |
+----+----------------+--------------------------------------------+
|1   |aGFwcHktc21pbGU=|MTAxMTEwMDAxMTAwMDAwMTAwMDAwMDEwMTExMDAxMA==|
|2   |YW5ncnktZnJvd24=|MTEwMTAwMDEwMTExMTExMDEwMDAwMDExMDAxMTAxMA==|
+----+----------------+--------------------------------------------+

Looking at the Spark Code Generation

If you are curious to see how Spark works behind the scenes there is a great new feature of the explain function that will enable you to view the code that Spark generates (and optimizes) for your transformations. To view this, all you need to do is the following.

hashed.withColumn("binhash",
  base64(bin(hash($"name",$"emotion")))
)
.select("binhash")
.explain("codegen")

This will output the java code and explain more about your computation.

Above : Looking at the Spark Code Generation

This code is part of Spark’s Catalyst Optimizer and luckily there is a high probability you will never have to work at this lower level and can likely just go on with your life. But if you are writing custom data source readers and writers then this will likely be something you will want to deep dive into. If nothing else you can learn more about the underlying mechanics, in the example use case from above, the codegen details the use of the murmur hash library being used. This is a nice tool for debugging and for those who just want to learn in a 360 degree model.

Summary

You now have two more techniques that you can use in order to create strong composite keys or to use as a spring board for creating idempotent keys. I just thought it would be fun to share these techniques since they come in handy and reuse the core libraries that ship along side Spark. Happy Trails.

Analytical Hashing Techniques was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Spark on Kubernetes

Scott Haines — Tue, 29 Sep 2020 06:42:56 GMT

A Hands-On Introduction: Getting Up and Running.

Photo by Larisa Birta on Unsplash

This tutorial is aimed at engineers who want to understand how to get up and running on Spark on Kubernetes. It is my hope that you will be able to use the skills developed across this series in order to become proficient at building and deploying Spark applications using the Kubernetes scheduler.

Given this is the first tutorial in the series, it is also naturally the most simplistic. The idea here is to introduce the concepts and components we will be using across the series, including Kubernetes (K8s), Docker and Spark 3.0.1.

Pre-Requisites

A basic understanding of Apache Spark.

Apache Spark™ - What is Spark

What You Will Learn

This first tutorial will cover the following.

Getting up and Running on Docker Desktop
Configuring your Docker Environment
Using the bundled Kubernetes from Docker Desktop
Installing Kubernetes Dashboard
Basic Kubernetes Commands
Local Spark Installation
How to Build a Custom Spark Docker Image
How to Push the Custom Spark Image to DockerHub
Launching your first Spark Application inside of your Local Kubernetes Cluster
Using the Kubernetes Dashboard to see the Application running

The next two sections are intended to cover the basics of the technology being used, if you are in a hurry, then you can skip to Getting up and Running on Docker Desktop and come back later to learn more about what Kubernetes is, and a little about the Spark Kubernetes Operator or Spark’s official Kubernetes scheduler.

What is Kubernetes?

Figure 1–1: Deploy, auto-scale and heal containerized applications orchestrated by Kubernetes: Image Credit via Kubernetes from What is Kubernetes?

Kubernetes in a nutshell is an open-source infrastructure management framework that allows you to provision logical slices of your clustered compute infrastructure in order to deploy, scale and manage containerized applications of varying sizes.

At first glance parts of this mental model feels familiar. If you’ve been on the DevOps side of things for a while — like I have you have — you’ve probably seen many similar solutions in spirit. From the humble shell script, across different service bootstrapping libraries like Capistrano (remember Ruby), Chef (…more ruby), Ansible (python) across in-house, pre-cloud, and cloud based deployments we have all seen things that work well in one place, and fail horribly in another.

Remember the last time you or someone you knew changed a configuration, restarted a process and forgot the change was made? Only to see the same issue occur again in the next deployment :)

Automation

We are in the age of computerized automation, and being able to automate the healing process, trigger replacements or autonomously scaling up a fleet of production machines — all based on “infrastructure as code” can be extremely beneficial. What works locally is a mere blueprint for replicating the exact environment so you can rely on durable, dependable deployments across all your environments — this is the gist of it.

Infrastructure-as-code enables configurations of services, shared secrets (locations and not the actual secret…cause we don’t check passwords into github!), disk and ram allocations to be checked into your github and released alongside typical CI/CD pipelines. These configurations are Idempotent, meaning that unless there is a real change (config hash changes), then pushing the same configuration to the cluster twice won’t change the services running inside your cluster. At first glance this may seem trivial but Kubernetes is built off of years of Google sized deployments and their best practices come bundled across the stack.

For your containerized applications, Spark included, everything is wrapped inside of a versioned container that must be released first into your container registry, or to a public registry.

Kubernetes supports multiple container formats to prevent lock in, but most people you talk to will tell you about what they are doing with Docker.

See the Container Runtime information for more details from the official Kubernetes documentation or take a look through the book linked below.

Pods

Pods wrap your application configuration, machine layout (file system paths, installed OS, packages, etc), shared resources like cpus (cgroups), ram allocations, application configurations, system settings, and even locally accessible shared files via Volumes (python files, other scripts) along with any additional role based access controls.

The name comes from a pod, which in nature is a protected resource like a “pod of whales” or like the “pea-in-a-pod”. See What is a Pod for more details.

In the case of Spark applications, this application runtime logic is the standard spark.conf.* settings, as well as any shared volumes (disk), application dependencies (jars/python/r etc) and really anything you want fine grained access to when running your spark application.

Spark on Kubernetes

Figure 1–2: Spark Driver Running inside a Pod. Image via Spark Documentation

The Kubernetes Scheduler

The Spark Kubernetes Scheduler allows you to deploy your Apache Spark application inside a containerized package, alongside your application configuration, custom environment variables, shared secrets, and shared disk access via Volume mounts, as a what is know as the Driver Pod. This can be seen as an improvement over deployments on Spark Standalone for the reasons below.

From Spark Standalone to Spark Kubernetes

If you have been working with Spark in standalone mode this can be a bit different in terms of the overall mental model, here is how it breaks down. With Spark Standalone deploy mode, you had to deploy a stand alone Master (cluster coordinator), and for High Availability (HA) a second Master running in Standby mode. These machines don’t have to be beefy machines. 2 cores and 4gb ram with decent network did the trick.

This did however require Zookeeper to manage the cluster state, eg. What applications were running, what allocations each application had assigned — aka where did my CPU and RAM go? — and lastly what executor instances where running where. For many teams wanting to run Spark there was already a good amount of DevOps overhead just to get the environment ready to run the first Spark app.

You also had to configure and setup your Spark Worker instances in a first step deployment in order to have a cluster to submit work to.

Standalone Spark Versioning

There was also the tricky cluster version lock in. If you had a cluster running Spark 2.3, well then you would need to deploy and migrate your applications to a new cluster for Spark 2.4 and beyond. There were of course corner cases where minor releases still functioned appropriately, but in general it always felt better to migrate the applications vs hoping for the best.

New Mental Model

Now if we think abstractly, Kubernetes is equivalent conceptually to a Spark Cluster. Although this isn’t necessarily true, hear me out. Kubernetes is a general purpose infrastructure management framework that can schedule deployments, autoscale applications, and mitigate common infrastructure pitfalls like hosts going down — with self-healing. There is a rich ecosystem that enables you to easily monitor the behavior of the cluster, via the Kubernetes Dashboard, and you are essentially carving out space in an elastically scalable and highly distributed ecosystem very much like the shared compute resources in a traditional Spark Cluster. Sure, maybe I am reaching.

BYO-Cluster

Now given you already have a Kubernetes cluster provisioned, the really big deal and what I see as the greatest value add is the following. Each Spark application driver essentially allocates an ad-hoc Spark cluster, as assigned through Executor Pods, based off of an immutable Docker image, and idempotent Pod config.

Each Spark application driver essentially allocates an ad-hoc Spark cluster, as assigned through Executor Pods, based off of an immutable Docker image, and idempotent Pod config

If you’ve been in the habit of running Spark Standalone and co-located Spark Applications across these shared compute resources, then you also know the growing pains of having to migrate older applications to new versions of Spark to reduce the tech debt accumulated over the years.

Now this is a eureka! moment, at least for me, given that my team at Twilio has written and is running over 60 different Spark applications across different versions of Spark and also scala.

Next Steps

Okay. There was a lot to say there and I didn’t really even scratch the surface. Let’s move on to actually doing something hands-on. In the next section we will install and get Docker Desktop up and running.

Getting up and Running on Docker Desktop

Docker Desktop. Image via Docker

First things first. Let’s get Docker Desktop installed so we are all on the same even playing field.

Installing Docker Desktop

This part is fairly easy given the only step is to download the installer for your platform of choice.

Install Docker Desktop.
Create a Docker Hub account. (* only if you want to build and upload custom Spark Images.)

I tend to use 4 of my CPUs and 10gb RAM for . Depending on your machine size you can increase the resources, if you are running mainly on Docker/K8s vs running mainly on your core system OS. Just remember to leave some RAM for the core system. I tend to leave 1 core, 2gb RAM left on the machine for background operations.

My laptop is 6-core i9, 64gb ram, 2TB SSD — 16in MacBook Pro. This Setting tends to work well for Demos.

Enabling Kubernetes

Your Single-Node Cluster is just a click away. Seriously, this was a life saver since it reduced the cognitive burden of getting started with K8s.

Really. Just click Enable Kubernetes and you’re ready to Rock and Roll

Verifying that Kubernetes is Running

If all has gone well. Which it probably has. Then you can open up your favorite terminal program (Terminal, iTerm, etc) and you should have the K8s command line interface at your disposal.

kubectl cluster-info

You should see a similar dialog if things are working well.

Installing the K8s Dashboard

The Kubernetes Dashboard: Makes it easy to run things yourself

The most recent set of installation steps can be found on Github. Things were much easier than I had first anticipated and hopefully things are as easy for you too. Below is the component you will be installing. Think of this Dashboard as your guide and assistant for common K8s tasks. The integrated Dashboard allows you to click buttons vs remembering the Commands to do common things like stopping deployments, executing commands on containers and so much more.

Apply the Dashboard

Use the command line tool to apply the Dashboard config. If you want to download the YAML and see what it actually is doing please do so vs blindly applying this link!

kubectl apply -f https://raw.githubusercontent.com/kubernetes/dashboard/v2.0.4/aio/deploy/recommended.yaml

Then just start the Kubernetes Proxy.

This proxy will enable you to use localhost:8001 instead of your cluster DNS which can make it easy to find your way back to any UI running on your cluster.

kubectl proxy

Opening up the Dashboard

Dashboard: http://localhost:8001/api/v1/namespaces/kubernetes-dashboard/services/https:kubernetes-dashboard:/proxy/#/login

If you try to load the Kubernetes Dashboard you will get Yelled at with No Token. This is due to security-by-design. You have done nothing wrong. You may be asking yourself “But wait. I just clicked a button to Enable Kubernetes. How do I fix it?” since I don’t know where things are even configured. That is okay. You are not alone.

Oh Snap. What do we do now? Image via K8s Dashboard UI

Accessing the Default Token

When your K8s environment is bootstrapped from Docker Desktop the default system token is generated. You can use the following command to get the token to login to the Dashboard.

kubectl -n kube-system describe secret default

Using the command above will yield the following. Just copy the token string and login.

Copy the Token: You’ll paste it into the Auth Dialog

Now go back, paste in the token, and log into the dashboard.

Success.

You should now have your Kubernetes Dashboard Up and Running. Really take a moment to pat yourself on the shoulder. We are learning new things and that is really what this is all about.

Installing Spark 3.x

Installing Spark is fairly simple to do. Just go to the Downloads page below and select pre-built (tgz) or for the brave get the full source code to build yourself. I’ll wait it will take a few minutes while you download everything.

Downloads | Apache Spark

Hopefully you now have the tarball (tgz) downloaded.

Create an Install Location

Please note. For this tutorial I had spark-3.0.1-bin-hadoop3.2 downloaded.

mkdir ~/install && cd ~/install
mv ~/Downloads/spark-3.0.1*3.2.tgz .

Extract the Tarball and Cleanup

tar -xvzf spark-3.0.1-bin-hadoop3.2.tgz
rm -rf spark-3.0.1-bin-hadoop3.2.tgz

Update your BashRC or ZshRC with Spark Home

You will need to have a simple place to reference where you’ve installed Spark. It is standard to just use the SPARK_HOME. I am using zsh, if you are using bash then vim ~/.bashrc but you can follow the same commands.

Open and Edit your ZshRC

vim ~/.zshrc

And add the following

export SPARK_HOME=/Users/`whoami`/install/spark-3.0.1-bin-hadoop3.2

alias spark-shell=”$SPARK_HOME/bin/spark-shell”

Now ensure these settings stick with the current open Terminal window. Any new window you open will already inherit from your *rc.

source ~/.zshrc

At this point you have everything installed and ready.

Where we are in the process?

At this point in the tutorial we’ve done a lot of downloading, installing, configuring but we’ve also gotten our local environment setup and ready for Spark application development.

We’ve configured the docker settings to be 4 core by 10gb RAM.
We’ve Enabled Kubernetes so we can use the simple single node cluster.
We’ve Installed the Kubernetes Desktop package. This will let us see our application as it is running.
We’ve downloaded and installed Spark locally on our machine

Now we are ready to get started running Spark on Kubernetes.

Setting up Spark to Run on K8s

We want to be able to build and run local Docker images inside of our Kubernetes cluster. This is essential to running Spark on Kubernetes, and also useful if you want to create a base image for Spark to reuse within your organization. It is a critical skill to be able to reuse base images, and efficiently extend these basic shared images — like for example PySpark with custom runtime installation.

Building the Base Images

First let’s locate the Official Spark Dockerfile. You should be able to find it in the directory of your local Spark installation.

ls -l $SPARK_HOME/kubernetes/dockerfiles/spark

Here is what the base Dockerfile looks like.

The Spark Dockerfile. Via Spark on GitHub

Modifications

I wanted to build Spark with Java 14. Spark 3.x finally enables you to move away from Java 8, so you can take advantage of using modern Java now. I took a look on docker hub and found the OpenJDK resources. I ended up using 14-slim vs the standard 8-jre-slim from the default Dockerfile.

Note: Spark runs as spark_uid=185 by default. If you are building an image you may want to use your own unprivileged user.

Containers on a Diet

OpenJdk 14-slim is built off of Alpine linux distribution which is very minimal build. Slim cause it is on a diet. Most of the default command-line tools and packages are left out to reduce file size. However it is worth pointing out that in the case of Docker and Kubernetes, the smaller the container the faster things will come online due to the lighter bandwidth overhead. This also means that images in cache don’t bloat the cache, or cause other images to be evicted from cache.

Using this tactic you can also create minimal Spark applications by only shipping the jars that you need that are not also provided by the Spark Runtime. It is worth taking the time to inspect the jars that are packaged with Spark.

ls -l $SPARK_HOME/jars/

and using provided scope in your maven or sbt.

Building and Pushing Custom Spark Images

If you want to download the Images I built then feel free. I have them uploaded to Docker Hub.

Spark: Java 14 slim via DockerHub

PySpark: Additional Python Dependencies vis DockerHub

Otherwise. Let’s go through the process of creating the Spark Image.

Create the Docker Images

At the time of writing I build Spark 3.0.1 on the Slim Java 14 open-jdk. I also opted into Python3 support. The following command uses the Spark docker-image-tool.sh script. If you want to try this out for yourself, then just replace newfrontdocker with your dockerhub username.

$SPARK_HOME/bin/docker-image-tool.sh -r newfrontdocker -t v3.0.1-j14 -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile -b java_image_tag=14-slim build

This will run for a little while. If things worked out correctly you should see the following.

Command-Line: Successful Output for the Java/Scala Spark Docker

Command-Line: Successful Output for PySpark Build. Notice it skips the java image tag — this isn’t an error!

You should now have two new images located in your local docker image cache. You can check them with the following docker command.

docker images

Having the images locally is good for testing that things work the way you expect. The end game however is usually to push the build up to your Docker Repository. We will do that now.

Pushing your Spark Image to DockerHub

Given that all we did was build the spark & pyspark images — we still need to push to DockerHub.

Tip: Given that your Kubernetes cluster can be running locally or remotely it makes sense to push the final Spark build when you are ready to start working on things outside of your local machine.

If you want to remove the images you created, perhaps you didn’t like the potential repository name. Then you can use the following command to do just that.

docker image rm {containerId}

Example: removing an image I didn’t like.

Pushing the Images

If you are happy with things feel free to ship the build to your container registry.

The first step is to make sure you are logged into docker. Here are the two ways to do it.

Use the credentials from Docker Desktop

docker login docker.io

Fresh Login

docker login docker.io — username newfrontdocker

Either should end up with

> Login Succeeded

Push the Build

Spark Python Docker Image:

docker push newfrontdocker/spark-py:v3.0.1-j14

Spark Docker Image:

docker push newfrontdocker/spark:v3.0.1-j14

Command-Line: Pushing to Docker

Well at this point in the tutorial we are just at the main mission. In the next section we will finally be running a Spark application on Kubernetes. If you have stuck through up until this point thanks.

Spark on Kubernetes: Hello World

We are now preparing to run our first Spark Job on Kubernetes. This is a big deal. This is why we are here!

We need to find the cluster master.

kubectl cluster-info

Remember This: Now we are actually using the values from the Cluster Info

What we care about in our case is the Kubernetes Master since we need to reference that in our Spark Submit.

$SPARK_HOME/bin/spark-submit \
  --master k8s://https://kubernetes.docker.internal:6443 \
  --deploy-mode cluster \
  --name spark-pi \
  --class org.apache.spark.examples.SparkPi \
  --conf spark.executor.instances=5 \
  --conf spark.kubernetes.container.image=newfrontdocker/spark:v3.0.1-j14 \
  local:///opt/spark/examples/jars/spark-examples_2.12–3.0.1.jar

It is worth pointing out that the local:///opt/ is actually referencing the JAR Path inside of the Docker Container. This can be useful when you want to access local configuration files, spark-environment variables, logging profiles, and so much more from within your deployment.

Spark Submit Output: You should see something similar when running your Spark application

Spark in a Pod

If you open up the Kubernetes Dashboard while your Spark Application is running you will see a similar view. It is worth noting that since we configured spark.executor.instances=5 in our Spark Submit that you will see 5 pods running for your application. This is equivalent to having 5 executor instances in traditional Spark Standalone world.

Kubernetes Dashboard: View of the Spark Executor Pods Running

That’s a Wrap

We really covered a lot of ground during this first tutorial. I really enjoy this part of the learning cycle. It is when the creative juices start flowing and new ideas seems to come alive, it is one of the brains ways of rewarding the learning process and I hope you were able to stick around up until this point. Over the course of this tutorial we covered a lot of ground fairly quickly. So here is a rewind.

Docker Desktop

We installed and got Docker Desktop up and Running locally on our computers. High Fives all around. This included provisioning some space on our laptops in order to power the Docker environment. I went with a 4 core by 10gb ram slice off the old laptop. If you have more cores and more ram, Kubernetes will gladly take that from you to power things to run quicker with more parallelism.

Local Kubernetes

We took things further with the click of a checkbox and enabled Kubernetes (single-node) cluster support in our local Docker. This was the start of our kubernetes journey together.

Kubernetes Dashboard

We fetched and applied the configuration and information for running the Kubernetes Dashboard so we could get some Eyes on the cluster. This is a much better view of the cluster than just having the Running green light on the Docker Desktop Dashboard.

We explored some of the `kubectl` commands to find our tokens and actually login into the dashboard
We saw how to get some really important cluster information by using the `kubectl cluster-info` command

Installing Apache Spark

We installed Apache Spark locally in order to gain access to the core Dockerfile that ships alongside the project. This is an important part of the process since when we create a new Docker image we need to be able to ship the Spark source locally as well.

If we were just looking to do some basic things and get to know Spark with the spark-shell then we could have used that out of the box to just play around with the Spark Runtime. $SPARK_HOME/bin/spark-shell and $SPARK_HOME/bin/pyspark respectively give you access to the local shell.

Building the Spark Docker Image

We were able to build a local version of Spark bringing our own JDK to the table openjdk:14-slim — cause diets are nice when shipping containers. This allowed us to choose how we want to build these images without bringing in the collective kitchen sink. The ability to test locally, list local images, remove images selectively and lastly pushing an image to a public repository is all in your toolbox now.

Running Spark Locally on Kubernetes

We finished up this whirlwind deep dive by running one of the example Spark applications. This was done as a HelloWorld for testing the cluster and now we can use the work completed here in order to move on to bigger and better things.

Next Steps

In my next tutorial I am going to show you how to use these two base Spark Images in order to simplify Running Spark Scala applications as well as PySpark applications that require custom python installations. We will look more into how Kubernetes configurations work in order to enable service and role based access to the processes running within our driver and executor pods.

Thanks for taking the time to dive into Spark on Kubernetes.

Notes and Stumbling Blocks

As an aside. Going off of the Spark on Kubernetes Docs provided the answers for getting started. The hard part was actually getting the Images to build and upload to DockerHub. I am embarrassed to say I mashed my keyboard for over 2 hours trying to understand the -r was meant to be just my username — eg. newfrontdocker. Shame on me for overthinking things.

DockerHub’s registry actually helped me debug the issue since I couldn’t create folders within a Repository. Ahh. Looking at the docker-image-tool.sh from the Spark Kubernetes directory of the source was helpful to see how the artifacts were being composited. That helped me figure out that I should create two repositories spark and spark-py.

Links and Thanks

There were a ton of resources available online that provided help as I was journeying through this process as well. I have broken down the links by content.

Docker

https://docs.docker.com/docker-hub/

https://hub.docker.com/_/openjdk?tab=tags&page=1&name=14-slim

https://docs.docker.com/engine/reference/commandline/login/

https://www.docker.com/blog/docker-hub-new-personal-access-tokens/

https://docs.docker.com/engine/reference/commandline/push/

https://docs.docker.com/docker-hub/repos/

https://stackoverflow.com/questions/54410909/what-is-the-correct-upload-url-for-a-dockerhub-repo

Spark on K8s

https://spark.apache.org/docs/latest/running-on-kubernetes.html

https://spark.apache.org/docs/latest/running-on-kubernetes.html#pod-template

https://www.slideshare.net/databricks/running-apache-spark-jobs-using-kubernetes

https://github.com/marcelonyc/igz_sparkk8s

https://collabnix.com/kubernetes-dashboard-on-docker-desktop-for-windows-2-0-0-3-in-2-minutes/

https://www.oreilly.com/content/how-to-run-a-custom-version-of-spark-on-hosted-kubernetes/

Docker Desktop Alternative

Lastly. If you just want to run Kubernetes then you can also use MiniKube.

Spark on Kubernetes was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.