Stories by Leo Gorodinski on Medium

F# Async Guide

Leo Gorodinski — Tue, 26 Jun 2018 16:16:41 GMT

This is a usage guide for asynchronous programming in F# using the Async type. The content should be helpful to existing F# Async users or those approaching F# concurrency from another programming language, and is complementary to existing material such as Asynchronous Programming by Scott Wlaschin, Async in C# and F# by Tomas Petricek and Async Programming in F# on MSDN.

Definition — the definition of the F# Async type, its interaction with the thread pool and then, async workflows.
Hazards — common programming hazards with F# Async and workarounds.
Related Programming Models — relationship to other programming models.
Concepts — narrative on general concepts in concurrency used throughout the post.

Definition

The F# Async type represents an asynchronous computation. It is similar in concept to System.Threading.Tasks.Task in .NET, java.util.concurrent.Future in Java, a goroutine in Go, Control.Concurrent.Async in Haskell, Event in Concurrent ML, or promise in JavaScript, with some important differences.

Overall, F# Async serves the following needs:

It allows for more efficient use of OS threads by preventing the need to block them when waiting.
It provides constructs for concurrency and parallelism in addition to sequential computation.
It indicates that a computation is long-running, or may not be expected to terminate.

Programmatically, the Async type is defined as follows:

type Async<'a> = ('a → unit) → unit

In other words, a value of type Async<'a> is a function that accepts a callback function of type 'a → unit and returns unit.

We can derive the Async type as follows. Suppose you've an operation that transmits and then waits for a response to an HTTP request:

let download (url:string) : string =
  let client = new WebClient()
  let res = client.DownloadString url
  res

In this case, the call to DownloadString is blocking - the OS thread on which the execution is taking place becomes blocked for the duration of the IO operation. When a thread is blocked, it isn't directly consuming CPU resources, however it continues to consume stack space, which it needs to resume when the operation completes. These context switches, as a thread blocks and then unblocks, are costly. We can make more efficient use of threads and processing resources by using the calling thread to invoke the operation, and when the IO operation completes, send a notification to a callback, on another thread. This can be done as follows:

let downloadCallback 
  (url:string) 
  (callback:string → unit) : unit =      
  let client = new WebClient()
  client.DownloadStringCompleted 
  |> Event.add 
    (fun args → callback args.Result)
  client.DownloadStringAsync url

In this case, the call to downloadCallback returns immediately, and the provided callback is subscribed to an event that triggers when the invoked operation completes. This allows the callback to be called from a different thread, and allows the calling thread to continue doing useful work rather than remaining blocked. If you squint a little, you can see that the type of downloadCallback url is (string → unit) → unit and if we generalize that to a generic type 'a we end up with the definition of Async<'a> above.

Using the Async type, we have the following signature for the operation:

val downloadAsync : string → Async<string>

At this point, it is possible to understand why this computation is asynchronous. It is asynchronous because there are two core steps involved — the invocation of the operation and the receipt of the response. Furthermore, we can see how the async type allows us to manage OS threads more efficiently — rather than blocking the calling thread, the calling thread remains free to do other work. We’ll cover this in more detail below.

The actual implementation of the Async type, available on the F# repo, is more involved due to the need to support exceptions, cancellations, a growing stack - some of which are discussed later on. The central 'constructor' for an Async value is the Async.FromContinuations function:

Async.FromContinuations : 
  (('a → unit) * 
   (exn → unit) * 
   (OperationsCancelledExceptions → unit) → unit) → Async<'a>

In addition to the successful completion callback 'a → unit , it takes callbacks (continuations) for errors and cancellations.

Thread Pool

Rather than managing threads directly, the Async type works along with the .NET Thread Pool to schedule work. The thread pool maintains a pool of threads, growing and shrinking as needed and provides the following key interface:

ThreadPool.QueueUserWorkItem : (unit → unit) → unit

This operation queues an action unit → unit to be run on a thread pool thread. The operation of the ThreadPool can be visually depicted as follows:

While it is possible to simply start a new thread whenever an action needs to be scheduled, using a thread pool allows thread creation costs as well as context switching costs to be amortized. Rather than blocking threads, threads are kept busy with a queue of work maintained by the thread pool. In the example above, the DownloadStringCompleted event might be triggered from a thread pool thread. This approach to scheduling work items is sometimes referred to as green threads. The relationship between Asynccomputations and OS threads is not one-to-one — more Async computations does not automatically result in more threads, and in particular, increasing parallelism isn't achieved by increasing the number of threads, but rather, by increasing the number of in-flight computations. In effect, the Async type encapsulates callbacks and the ThreadPool into a higher-level programming model as described below. With that said, in some cases, tuning the thread count limits on the ThreadPool can improve performance.

Async Workflows

F# async workflows provide a syntax that permits expressing sequential workflows in terms of Async computations. For example, given the downloadAsync operation above, we may want to perform another download based on the result of the first and then perform a transformation on both results:

let callApi (url:string) = async {
  let! data1 = downloadAsync url
  let! data2 = downloadAsync (url + data1)
  return data1,data2 }

While this workflow is expressed sequentially, the underlying computation runs asynchronously and avoids blocking an OS thread during the processing of downloadAsync. This is achieved by translating the workflow syntax into Bind and Return operations defined on the Async type as follows:

The result of the call to downloadAsync is passed to Bind and the portion of the workflow after the call to downloadAsync is passed to Bind as the continuation. The Return operation takes a value, in this case the pair of data1 and data2, and lifts it into an Asyncvalue. The async workflow in F# also defines operations to support other control flow constructs - loops, delayed execution and exception handling.

A chain of bind operations forms a sequential computation. Much of the theory of computation (ie Turing machine, lambda calculus) models sequential computation, where steps happen sequentially, one after another. Of course, things wouldn’t be much fun if we were limited to sequential computation. The F# Async type also allows us to also express parallel computations, using the Async.Parallel : Async<'a>[] → Async<'a[]> operation, for example. This operation takes an array of async computations and returns a single async computation that will yield their aggregated results, thereby expressing fork-join parallelism. The "fork" part is the starting of the provided computations, and the "join" part is awaiting their results. Another way to express parallelism is with the Async.StartChild : Async<'a> → Async> operation. This operation starts a computation and returns a 'handle' to a computation that can be used to rendezvous with the result at a later time. This makes it possible to start multiple computations to be run in parallel, but still cleanly gather their results without any low level threading constructs in play. This in turn can be used to implement an operation such as Async.Parallel : Async<'a> * Async<'b> → Async<'a * 'b>. This operation can also be implemented using the sequential workflow with the Bind operation; however the provided computations would run sequentially rather than in parallel, changing the semantics significantly.

Hazards

There are several hazards to programming F# Async. Some are already covered in Tomas Petricek’s excellent Async in C# and F# Asynchronous gotchas in C# article, but we discuss a few more here.

Async.RunSynchronously

The Async.RunSynchronously : Async<'a> → 'a operation provides a way to commence and then obtain the result from an async computation. The name of the operation is deliberately made cumbersome to type because it must be used judiciously. F# novices or those new to functional programming in general often struggle with as they seek to access the value produced by a computation and end up "cheating" by calling Async.RunSynchronously. Ideally, Async.RunSynchronously would only be invoked once for the entire program and passed an async computation representing the program. Most importantly, calls to Async.RunSynchronously from within loops should be avoided. The reason for this is that Async.RunSynchronously is implemented by blocking the calling thread until the async computation completes. This in effect undoes much of the benefit of using the Async type in the first place, but is necessary in order for the async computation to take effect. If the call is made only once for the entire program, only one thread remains blocked waiting for the program to complete, which of course is fine. Frequent calls to Async.RunSynchronously however don't play well with the .NET ThreadPool. Blocking threads will pressure the ThreadPool to create more threads, eventually causing it to reach its limits, inducing a high number of context switches and wasted stacks. Instead of calling Async.RunSynchronously, either use async workflows or operations on the Async type such as async.Bind to access the produced value.

See also: Asynchronous Everything by Joe Duffy

Summary

Avoid calling Async.RunSynchronously except for at the entry point for the executable.

Async.Start

The Async.Start : Async<'a> → unit operation starts an async computation without waiting for the result by scheduling the computation on a ThreadPool thread. This operation is what actually puts the async computation chain constructed by calls to operations as defined above into motion. As described, the call to Async.RunSynchronously is implemented by starting a computation which stores its result in a wait handle on which the calling thread waits. It is akin to forking a thread. The related operations Async.StartChild : Async<'a> → Async> and Async.StartChildAsTask also start an operation without awaiting the result, however they also return a handle making it possible to await the result. Care should be taken with these operations because they can result in overly non-deterministic executions. It may cause too many operations to be running in parallel, potentially degrading performance. Moreover, exceptions raised by computations passed to Async.Start aren't propagated to the caller and are easily overlooked. In fact, it should rarely be needed to make use of Async.Start in application code. Instead, favor calls to Async.Parallel or Async.ParallelThrottled for expressing parallelism.

For example, suppose you’ve a sequence of async computations that need to be run. One way to run them is to iterate the sequence, starting each computation with Async.Start. However, this:

May cause more than the desired number of computations to be run in parallel.
Doesn’t provide a way to await the completion of the sequence and
Leaves exceptions thrown by individual computations unhandled.

// a sequence of computations
let comps : Async seq = ...

// start each computation
// do not await the results
comps |> Seq.iter Async.Start

Instead, it is possible to run the computations in parallel using a call to Async.Parallel which will address the aforementioned issues:

// run computations in parallel, 
// await the results, exceptions
// escalated to the caller
do! comps |> Async.Parallel

Another way a need to call Async.Start may come up is to start a background process of some sort. For example, a program may have a health check or reporting process to be run along the core logic. If however this background process is run using Async.Start, exceptions raised by the background process may be left unhandled, preventing the program from reporting its health.

val coreProcess : Async
val backgroundProcess : Async
Async.Start backgroundProcess
Async.RunSynchronously coreProcess

If this is undesirable, the fate of the background process should be tied to the fate of the core logic of the program using Async.Parallel : Async<'a> → Async<'b> → Async<'a * 'b> :

Async.Parallel 
  [coreProcess; backgroundProcess]

With the alternative approach, if exceptions raised by the background process should be discarded without causing the program to crash, this can be done explicitly by catching the exceptions and logging as appropriate.

Summary

Consider using a higher-level construct before using Async.Start.
Determine whether exceptions raised by computations started with Async.Start should affect the calling computation.
Be sure to propagate a CancellationToken to Async.Start if applicable.

Async.Parallel

As described above, Async.Parallel is a way to express fork-join parallelism. However, an important consideration when using this operation is the number of input computations provided. If the number of input computations is too high, then the call to Async.Parallel may create too much contention for both memory and IO resulting in performance degradation. Additionally, if the sequence of computations is unbounded, the call to Async.Parallel will run out of memory before starting any of the computations because internally, it allocates an array to store the result of each computation. Instead, consider using either Async.ParallelThrottled : int → Async<'a>[] → Async<'a[]> or Async.ParallelThrottledIgnore : int → Async seq → Async. The former is like Async.Parallel except it bounds the degree of parallelism, and the latter also bounds parallelism, but doesn't store the result of computations, only the count of the number completed, making it possible to use with unbounded sequences of computations. Care must be taken to tune for the appropriate degree of parallelism, especially for IO bound computations where there aren't rules of thumb such as for CPU bound computations (ie a thread per core). The best value may depend on the nature of the computations and may even change over time. An even more ideal scheduler would automatically control the degree of parallelism with a strategy to either maximize throughput or minimize latency.

The Async.ParallelThrottledIgnore operation can be implemented as follows:

https://medium.com/media/6184f2387b912a6c74e4a4af8c8844e7/href

Summary

Ensure that the number of input computations passed to Async.Parallel is bounded.
Consider using a throttled variant as described above to reduce contention.
Consider using a non-Async based parallelization mechanism for compute-bound computations which don’t use Async.

Compute-Bound Computations

While it is possible to express parallelism with Async, as described in the previous section, using this approach for compute-bound computations may not be the most efficient. A compute-bound computation is one where a majority of time is spent on computational tasks rather than awaiting IO operations. In these cases, it is better to use something like Parallel.For or PLinq to take advantage of parallelism. This method avoids the overhead involved in the Async continuation mechanism. However, it is important to note that if a compute-bound operation does make an IO request, using Async.RunSynchronously to await it will cause blocking and may reduce performance over using Async.Parallel.

MailboxProcessor

As described above, the MailboxProcessor (MBP) provides an actor-based concurrent programing model. However, for most applications, this model is fairly low-level and requires considerable care to avoid common pitfalls. The MBP is best suited for implementing higher-level library constructs, but it should be avoided in domain code for reasons described below. One of the most common hazards with the MBP is that it is easy to overlook exceptions thrown by the processing computations. These exceptions are published on the Error event, however this event needs to be explicitly subscribed to in order to observe the errors. Even if the error is caught, it may not be clear how to proceed as the context is lost. Next, the PostWithAsyncReply operation together with the AsyncReplyChannel type do not provide a way to propagate exceptions, forcing users to express exceptions using an explicit Result value or by using a TaskCompletionSource instead.

For example:

let rec proc mbp = async {
  let! (data,replyCh) = mbp.Receive ()
  let! result = .... // logic
  replyCh.Reply result
  return! proc mbp }

let mbp = 
  MailboxProcessor.Start proc

let handle (data:string) : Async =
  mbp.PostAndAsyncReply 
    (fun replyCh -> data,replyCh)

Here, if the processing logic throws an exception, the caller in handle will be suspended indefinitely and the exception will be swallowed. Moreover, the MailboxProcessor will halt and be unable to process any additional messages. One might instead expect the exception to be escalated to the caller, and for the MailboxProcessor to continue processing. This can be done by explicitly catching exceptions inside of the processing loop and then propagating to the caller, either using an explicit Result value or through a TaskCompletionSource rather than an AsyncReplyChannel. For example:

let postAndAwaitResult 
  (mbp:MailboxProcessor<'a>) 
  (f:TaskCompletionSource<'b> → 'a) = async {
    let ivar = TaskCompletionSource<_>()
    mbp.Post (f ivar)
    return! ivar.Task |> Async.AwaitTask }

let rec proc mbp = async {
  let! (data,ivar) = mbp.Receive ()
  try
    let! result = ....
    ivar.SetResult result
  with ex ->
    ivar.SetException ex
    return! proc mbp }

let mbp = MailboxProcessor.Start proc

// exceptions will be escalated 
// to the caller
let handle (data:string) : Async =
  postAndAwaitResult mbp 
    (fun ivar → data,ivar)

Another thing to keep in mind with MBP is that the mailbox is unbounded and therefore, has the potential to overflow. In the context of a producer-consumer scenario, the producer may produce messages at a higher rate than the consumer is able to consume them, resulting in an unstable system. An explicit backpressure mechanism is needed to coordinate the consumer and the producer for preventing overflow. One way to do this is using the BoundedMb type which places a bound on the number of messages in the mailbox. If the bound is reached, the BoundedMb exerts back-pressure on the producer.

Beyond these nuances with exceptions and back-pressure, the MailboxProcessor programming model can lead to needless layers of indirection. In the example above, if the desired outcome is to invoke the processing logic, it is much more reasonable to simply invoke the logic directly rather than routing through the MBP. Of course the MBP can do more than simply forwarding messages, but if more complex behaviors behaviors are required, it is better to encapsulate these behaviors in a reusable data structure.

Examples of higher-level async structures that can be implemented with MBP are:

MVar — a serialized variable with lazy initialization, akin to a ref but with support for serialized, async-based mutation. Beware of deadlocks when mutating!
SVar — like MVar but with an additional tap operation which returns an AsyncSeq of values stored.
Channel — synchronizes a producer and a consumer of a message. Similar in spirit to channels in Go and Concurrent ML, however without support for selective communication.
BoundedMb — a bounded mailbox, similar in functionality to BlockingCollection, however using Async to represent waiting. This is an effective way to include back-pressure for produce-consumer scenarios.
BatchProcessingAgent — a buffer which forms and publishes batches of publishes messages.

In many cases, it is better to rely on these data structures rather than implementing a custom MBP for a domain-specific use-case.

Another way to approach this programming model is to turn the processing logic “inside out” using AsyncSeq. First, we repurpose the MBP to act as solely as a mailbox:

let mbp : MailboxProcessor<'a> = 
  MailboxProcessor.Start 
    (fun _ -> async.Return())

Then we represent the incoming messages as a stream using AsyncSeq:

let stream : AsyncSeq<'a> = 
  AsyncSeq.replicateInfiniteAsync mbp.Receive

Now we can publish messages to the mailbox asynchronously, and consume the resulting AsyncSeq explicitly. This allows us to use existing operations on AsyncSeq to filter, transform and buffer the messages, it allows us to merge the stream with other streams, and represents the process explicitly as an Async operation such that we can join it with other operations:

let proc : Async =
  stream
  |> AsyncSeq.bufferByTimeAndCount 100 100
  |> AsyncSeq.iterAsync processBatch

This approach makes the processing logic explicit and provides a more convenient way to handle exceptions.

Summary

Beware of exceptions raised by processing logic used inside a MailboxProcessor.
Consider using TaskCompletionSource rather than AsyncReplyChannel to signal from within a MailboxProcessor, particularly when exceptions may be raised.
Consider using or implementing a higher-level component rather than using a MailboxProcessor for domain-specific code.

CancellationToken

A CancellationToken is used to cancel computations in response to cancellation requests that are external to the computation itself. Several of the operations on Async, such as Async.Start and Async.RunSynchronously, are parameterized with an optional CancellationToken, such that if a cancellation is requested on that token, the computation can be notified, allowing it to terminate. There are many reasons to cancel a computation. One of the most common is to impose a timeout on a computation. More generally, the reason could be as a response to new information, invalidating the inflight computation. Care must be taken to ensure that a computation will actually respond to a cancellation request. In many cases, this is done automatically by machinery inside Async itself. For example, before each async.Bind is invoked, the cancellation token is checked. Also, calls to Async.Sleep will be cancelled as expected. However, if an async computation has a prolonged compute-bound section, the cancellation token must be checked manually.

Each async computation is bound to a CancellationToken and is accessible with Async.CancellationToken : Async. If a token isn't provided explicitly as described above, Async.DefaultCancellationToken is used. The default cancellation token can be cancelled by calling Async.CancelDefaultToken, however this will signal a cancellation for all computations bound to this token. To explicitly bind an async computation to a token, the token can be passed along with the computation to Async.Start or other operations.

As a convenience:

https://medium.com/media/a5ef3b97b99babb6b46b1d8dac7a9b40/href

Note how in this case, the argument CancellationToken is linked with the ambient CancellationToken, and the linked token is passed to Async.Start. As a result, the computation will be cancelled in response to either the argument CancellationToken or in response to the ambient CancellationToken. This may not be desired in all cases.

Cancellation tokens are not a first-class concept within the Async type and require special treatment. In some cases, it is possible to use a first-class selective communication mechanism, or at least a best-effort attempt. What would it mean for cancellation to be first-class? A cancellation token establishes a race between two computations: the core computation at hand and the computation that represents a cancellation. For example, a timeout can be viewed as a race between a computation and a timer.

More generally, we can implement cancellations using the Async.Choice : Async<'a option> seq → Async<'a option> operation. Given a sequence of input computations, this operation will start all of them, return the result of the first one to complete, and cancel the others. However, cancellation is a best-effort attempt, and therefore, does not represent true selective communication. For example, if we apply Async.Choice to the Receive operations on two MailboxProcessor instances, the message received from the second one of the two to complete will be lost. A more elaborate synchronization mechanism is required to implement true selective communication wherein the message remains in the second mailbox.

Summary

Be explicit about propagating cancellation tokens when calling Async.Start and related operations accepting a cancellation token.
Avoid calling Async.CancelDefaultToken to avoid interference with unrelated computations.
Be sure to extract and reference the ambient cancellation token via Async.CancellationToken when an computation has an extensive compute-bound section to ensure that it is properly cancelled.
Consider using Async.Choice in scenarios requiring first-class flow control.
Take note of the issue when using Async.AwaitTask on cancelled Task instances as described in the next section.

Async.AwaitTask

The Async.AwaitTask : Task<'a> → Async<'a> operation translates a Task value to an Async value. Many asynchronous operations in the .NET Framework return Task and this operation is used to map them to Async. In versions of F# prior to 4.1, the implementation of Async.AwaitTask had a bug wherein cancellations to Task computations would be lost, resulting in indefinitely suspended Async computations. This would lead to difficult to find bugs in the program. Many have encountered this when using HttpClient from F#. Indefinitely suspended Async computations are a broader hazard discussed next.

Another hazard involving Task and Async is in attempting to use selective communication among them. For example, suppose you've a component such as a Socket or state representing a node's view of a cluster. We can represent the state of this component using a TaskCompletionSource which is set to the Completed state when the component is closed, or to the Faulted state when the component fails. Suppose also that you've component-dependent Async operations, such as sends and receives. We'd like to cancel an in-flight operation whenever the component is closed or faulted, so that they can be retried on a new component instance. This calls for selective communication - we'd like to select between awaiting the completion of an operation or the closing of a resource. More precisely, we're looking for a function of type chooseTaskOrAsync : Task<'a> → Async<'a> → Async<'a> where the first argument would correspond to the component state and the second to the operation. If the component is closed, we'd like to raise an exception, and to do that, we could use Task.ContinueWith. However, since for each instance of a component we might have a large number of component-dependent operations, we'd add a large number of continuations to the Task corresponding to the component. If those continuations aren't properly cleaned up, we end up with a memory leak. The Task.WhenAny operation on the other hand ensures that orphaned continuations are properly cleaned up and allows us to avoid a memory leak.

Summary

Ensure that you’re using a correct implementation of Async.AwaitTask to await Task instances which may be cancelled.

Indefinite Suspension

Nothing in the Async type ensures that the computation terminates. It is possible to impose a timeout, as described in the previous section, but this isn't done automatically. As a result, it is quite possible to end up with an async computation that never terminates, causing an indefinite suspension in the program. On the one hand, this accurately depicts the nature of asynchrony, but on the other hand, it can lead to some adventurous bug hunting.

A helpful operation to impose timeouts is as follows:

https://medium.com/media/7e616249457c5635bc42d27bbfa45e66/href

This operation can be applied onto top-level handler functions where it isn’t certain whether internal operations take care of timeouts, but where there is an evident upper bound on the time the operation should require. Of course some computations are deliberately non-terminating, such as a heartbeating process, for example. In this case, timeouts aren’t needed, and it may be helpful to explicitly signal this fact by returning a constructor-less Void type from the computation.

Summary

Consider imposing a limit on the duration of an async computation.
Take care to propagate all forms of completion for an async computation, including errors and cancellations.

Laziness

While F# is, by default, eagerly evaluated, Async computations are lazy, albeit with important exceptions. Laziness implies that simply having a reference to an Async computation does not imply that that computation is running. This is in contrast to Task, for example, which usually represents a computation which is already running. In addition, unlike lazy evaluation in languages like Haskell, Async computations are not memoized, which means they will be reevaluated each time they are run. This is again in contrast to Task, which is idempotent - once it completes, the produced value is memoized. The lazy nature of Async is evident through the async.Delay : (unit → Async<'a>) → Async<'a> operation which takes a function producing an async value, and represents it as an async value. The function will be evaluated each time the Async computation is evaluated. The Delay operation is used as part of a syntactic transformation of an async workflow, making everything inside an async block lazy. However, it is also possible to explicitly memoize an Async computation and it is impossible to determine whether a given async computation is memoized or not. For example, an Async computation can be memoized by using a TaskCompletionSource to store its result:

https://medium.com/media/7ebfe954b4bac2e96804ef17f441e046/href

Another example where an Async computation is a already in flight is the result of the Async.StartChild : Async<'a> → Async> operation. When the outer Async computation is bound, the input computation is started, and the inner Async computation is a handle to the started computation, which when bound, awaits its result. Awaiting the inner computation multiple times does not reevaluate the input computation.

The (mostly) lazy nature of Async can lead to unexpected results. For example, suppose you want to run two Async computations in parallel, and be notified when the first one completes, but also be able to retrieve the result of the second computation once it completes. Using the Async.choose operation as defined above would cause the second computation to be cancelled. If the calling code were to await its result, the computation would be reevaluated. Instead, the following operation might be better suited to this task:

https://medium.com/media/3dddfd61c6a2a77400ff053f55c52a1a/href

The Async.race operation explicitly memoized the result of the second computation. We can compare this with the Task.WhenAny operation which will also returns the first computation to complete, however the other computations are not cancelled and can still be awaited by the caller.

Thread Local Storage

As described in the Thread Pool section, async computations aren’t bound to specific threads, and a given workflow may execute across several thread pool threads throughout its lifecycle. As such, the Thread Local Storage (TLS) mechanism can’t be used to store contextual data for a workflow. However, cross-cutting concerns often require a notion of workflow-local storage, for example to store a tracing context. Even though this mechanism isn’t provided out of the box, it is possible to implement it explicitly by building a workflow for the following type:

type Context = Dictionary

// An async computation explicitly
// depending on a context
type AsyncEnv<'a> = Context → Async<'a>

This type can be treated in the same way as the existing Async type by implementing a computation workflow, however it can also provide operations for reading and writing into the context. In fact, the existing Async type already stores the ambient CancellationToken in its context and it should be possible to extend the implementation to support arbitrary data items. Note that workflow context should be used judiciously as it can lead to unexpected results and leaks.

Summary

Don’t rely on thread-local storage from within Async computations.
If you need workflow-local storage, consider implementing a extended Async computation workflow.

Related Programming Models

In this section, we compare the Async type to similar concepts in .NET and other programming languages.

.NET System.Threading.Tasks.Task

The System.Threading.Tasks.Task type in the .NET Framework serves a very similar purpose to Async. It also represents a computation that eventually produces a value. Async has operations to map to and from Task. However, there are some important differences. First, a Task is idempotent (monotonic): once it produces a value, the task is completed and will no longer perform additional computation. Async on the other hand can be evaluated many times. It is possible to cache the result of an Async computation, however this must be done explicitly. Second, in most cases, a Task represents an in-progress computation, whereas an Async represents a computation which must be explicitly evaluated. The Task.ContinueWith operation is similar to async.Bind - it binds a continuation to the result of the computation. Since Task is monotonic and idempotent, it is important to note that Task.ContinueWith adds the continuations to a list in the target computation, whereas async.Bind returns a copy of the workflow which will be reevaluated. As a shoutout to the monad people, Task.ContinueWith is actually the comonadic extend operation, whereas async.Bind is the monadic bind operation. Task has the additional Unwrap operation corresponding to the monadic join. It is possible to map between Async and Task using the Async.StartAsTask and Async.AwaitTask operations. In F# this is commonly done to interact with existing C# libraries, or to take advantage of Task in scenarios where it is a better fit.

Java java.util.concurrent.Future

The Future type in Java is essentially the same as the Task type above.

Akka

Akka is an actor framework for the JVM. It is heavily inspired by Erlang, and in addition to the actor model itself, provides facilities for routing, fault tolerance and distribution. As described in the MailboxProcessor section, the actor model is too low-level for many use-cases, making it easy to make mistakes. To that end, Akka also provides a Future type to express request-reply interactions. The FSharp.Akka library is a wrapper for the Akka.net port of Akka.

Go Goroutine

A Goroutine is very similar to F# Async. The Go concurrency model is heavily inspired by CSP, and in addition to goroutines, it includes channels. A channel is a junction across which goroutines can exchange messages. The select statement provides selective communication amongst channels. Note that selective communication is not an entirely trivial concept.

JavaScript Promise

A JavaScript promise is essentially the same thing as Task and Future, and also similar to Async. NodeJS users are familiar with the pain of callback-style programming, and JavaScript promises adapt it to the more convenient sequential flow control style.

Haskell Control.Concurrent.Async

The Haskell Async type is a thin layer atop the IO monad and is very similar to the F# Async type. There are additional constructs in the Control.Concurrent namespace, such as MVar, IVar and Chan. IVar is essentially TaskCompletionSource and MVar is described above in the MailboxProcessor section. Chan is similar to channels in Go and Concurrent ML. In addition, Haskell has other concurrent programming models such as Software Transactional Memory (STM) and Transactional Events. Simon Marlow's book Parallel and Concurrent Programming in Haskell offers a wealth of information on concurrent programming in Haskell.

Concurrent ML

Concurrent ML is a concurrency library for the ML programming language. The Event construct is very similar to F# Async, however at closer inspection it supports a richer set of operations. In particular, Event and the accompanying Channel construct in ML support selective communication. Selective communication forms a proper disjunction between computations, committing to one and ensuring the other is not committed to. Hopac is an implementation of Concurrent ML in F#, with a vast array of operations and types. In essence, it is an implementation of the pi-calculus.

Joinads

Joinads is a research extension of F# based on the join-calculus programming model. Joinads also include a syntactic construct extending the existing match syntax in F#, allowing the expression of join patterns among multiple channels. This provides a richer and more convenient set of synchronization mechanisms beyond F# Async - in particular, selective communication. With any luck, the programming model will make it into the core F# language at some point.

Hopac

Hopac is an implementation of Concurrent ML in F#. It provides a much richer set of operations than the F# Async type, in particular for selective communication. It is also more efficient than F# Async or Task for many workloads. In addition, the library is accompanied by a wealth of documentation which is useful for programmers in any language.

Clojure Async

F# Async is similar to and is motivated by many of the same reasons that Clojure Async is.

Concepts

This section is a narrative on concepts of concurrent and parallel programming used throughout the post.

Concurrency & Parallelism

Concurrency refers to the absence of ordering information among events. In other words, given two events, if we don’t know which came first, we call the events concurrent. Furthermore, even if we impose a total order on the events in the system, operations, consisting of an invocation and completion event, are regarded as concurrent when they overlap. Even though one operation may start before the other, overlap in their spans makes the ordering between operations a partial order. Concurrent programming refers to programming in the face of absence of ordering information among some subset of events in the system. Various models of concurrency have been developed in order to better understand the semantics of concurrency and/or to provide a programming model suited to concurrent domains. We shall discuss a few of these models and relate them to F#.

One model of concurrency from the process calculi family is called Communicating Sequential Processes (CSP). CSP models a concurrent system as a collection of independent, sequential processes (i.e. threads) which interact at explicit junctions. An interaction event is a point of synchronization between processes, allowing the exchange of information. Another model of concurrency is the actor model wherein actors, which are sequential threads of control, are a core computational primitive. Both processes in CSP and actors in the actor model interact using explicit message passing, rather than through shared memory, such as in the PRAM model. Note however that this distinction between shared memory and message passing becomes blurred since interactions with shared memory can also be modeled using message passing. Indeed, it takes a non-negligible amount of time to send a read request across the memory bus, and moreover, modern memory systems rely on cache coherence protocols in order to provide consistent guarantees. Both CSP and the actor model are notable because they’ve been very influential in the design of programming models for concurrency. The actor model is well known through the Erlang programming language, or the Akka actor framework on the JVM. CSP influenced the Concurrent ML programming model as well as the concurrency model in Go.

In .NET, we’ve the fundamental synchronization primitives which include locks, synchronization events, wait handles, interlocked operations, etc. A lock or mutex, for example, facilitates interaction among threads by delimiting a section of code — called the critical section — that can only be accessed by one thread at a time, providing mutual exclusion. Multiple threads can execute a critical section, but just one at a time, which makes it much easier to reason about memory access and mutation. Synchronization events also facilitate interaction among threads by allowing one thread to wait on a signal from another thread or process. Interlocked operations are essentially locks at the hardware level. The introduction of concurrent collections in .NET provided access to the higher-level producer-consumer pattern. The TaskCompletionSource type is similar to a synchronization event, however the signal can be accompanied by data, and waiting is expressed using the Task type.

In F# we also have the MailboxProcessor (MBP) which, as alluded by the name consists of a mailbox and a processor. The mailbox can be posted to and received from, and the processor is a thread of control interacting with the mailbox. Semantically, the MailboxProcessor can be associated to the actor model of concurrency, though typical actor model implementations (such as Akka.NET) are accompanied by support for distribution as well as a range of facilities for routing and fault-tolerance. The MBP manages concurrency by (FIFO) ordering messages posted to the mailbox. The thread of control processes a single message at a time without any need to consider parallelism in the implementation as only a single message is processed from the queue at any point in time. MBPs are particularly useful for implementing higher-level constructs such as producer-consumer queues, buffers, channels, etc.

Concurrency and parallelism are related notions and are often used interchangeably. However, upon a closer inspection, their relationship is more of a duality. Parallelism is the idea of launching operations to be run in parallel. This in turn results in events, generated by those operations, which are concurrent, because ordering information is absent. Concurrency, on the other hand, typically refers to synchronization among concurrent events. Speaking loosely, parallelism generates disorder and concurrency synchronizes it. As an example, the Async.Parallel operation involves both - it first parallelizes the input computations, but then it synchronizes the parallel computations into a single converged result.

Asynchronous & Synchronous

The Async type is so called because it enables controlled use of asynchrony by decoupling the invocation of an action from the handing of its result, while retaining sequential flow control. Asynchrony allows for more efficient use of threads, as well as for expression of parallelism and concurrency. A related notion is that of an asynchronous network wherein there is no bound on message transmission delay. The underlying substrate is that of asynchrony — the event that represents a message being transmitted is decoupled from the event representing receipt or completion, resulting in temporal decoupling. However, complete asynchrony wouldn’t be of much use without synchronization. In terms of events, synchronization is the act of combining multiple events into one. For example, an interaction between two processes can be represented by two events, one at each process. In the theory of concurrency this is known as synchronous rendezvous. In .NET, TaskCompletionSource is a way to implement a form of rendezvous between threads, with one thread waiting for a value and another signaling the value. In Go and Hopac, for example, channels are used as a rendezvous mechanism. It should be noted that synchronization requires coordination among participants. This can be costly in the context of a single process and even more so across network boundaries. As such, systems should be designed to be asynchronous to the extent possible, but with principled use of synchronization where it is required, keeping locality in mind.

Selective Communication

Selective communication is a concept involving channels, as seen in Go, Haskell, Concurrent ML, and F# Hopac. Selective communication is the idea of selecting a message from a set of channels, picking the first one to produce a message, while leaving the others intact. A critical component of selective communication is that only one channel is picked and received from, with the others left intact. Simply invoking a receive operation from multiple channels in parallel doesn’t quite do the trick since it may cause multiple channels to dequeue a message where only one will be received by the caller. F# Async doesn’t provide a selective communication mechanism out of the box. More broadly in .NET, we’ve the BlockingCollection.TakeFromAny operation, but of course BlockingCollection uses blocking as its synchronization mechanism. The need for selective communication is quite common. Whenever a choice needs to be made among a set of possible events, there's a need for selective communication. In this sense, selective communication is the dual to parallelism. However, selective communication is typically implemented in ad-hoc ways; in .NET it is usually done using CancellationToken.

Acknowledgements

Thanks to Gustavo Leon, Eirik Tsarpalis, Ruben Bartelink and many others at Jet for comments, edits, suggestions.

Scaling Event-Sourcing at Jet

Leo Gorodinski — Tue, 24 Oct 2017 17:45:51 GMT

At Jet, we’ve been using event-sourcing since the very beginning and learned some lessons along the way. There are several dimensions along which we had to scale our event-sourcing platform. The one which most teams using event-sourcing have to overcome early on is scaling reads — as streams increase in size it becomes prohibitive to read the entire stream to perform an operation. Another dimension of scaling is redundancy — in order to function continuously, the platform needs to tolerate not only failures of individual machines in a data center, but failures of an entire data center. The projection system needs to be scaled to support a growing number of consumers with varying workloads. Meanwhile, as the number of moving parts increases, it becomes essential to verify safety and liveness guarantees advertised by the system. Of course in time, the benefits of a highly modular architecture afforded by event-sourcing start to weigh on our ability to obtain accurate pictures of system states. To that end, we need a tracing platform which, in addition to request-reply, must support tracing of asynchronous interactions. All things considered, the challenges of operating an event-sourcing platform are noteworthy, but its sound foundational principles continue to pay dividends as we evolve.

Event Sourcing

There are a few definitions of event sourcing floating around. Martin Fowler’s is perhaps the most cited one and it states that:

Event sourcing is a paradigm where changes to application state are recorded as a series of events.

To make this more concrete, it is helpful to model applications using IO automata. An IO automaton is defined as a set of states, a special starting state, a set of input events, a set of output events and a transition function taking pairs of state and input event to pairs of state and output event:

State | — a set of states.
S∅ | — a starting state.
Input | — a set of inputs.
Output | — a set of outputs.
τ : State × Input → State × Output | — a transition function.

The hosting service manages state as well as interaction with input and output mediums. During an operation, the service receives an input event from the input medium, retrieves state corresponding to the input, executes the transition function, persists the resulting state and sends the output event to the output medium. A service typically manages multiple state machines concurrently— one for each entity (aggregate) in the system.

For example, consider the shopping cart system. State correspond to states of individual shopping carts, consisting of a list of items, prices and promo code information. Input corresponds to requests to perform actions on the cart, such as adding items, or checking out. Output corresponds to changes in the cart, such as items being added or removed. Finally, the transition function τ encodes the logic for handling requests on a given a cart.

Event-sourcing makes the observation that rather than persisting state, we can persist the output events. An instance of state can be reconstituted by running a fold over past outputs using a delta function:

Δ : State × Event → State |— defines how an event changes state.

We assign sequence numbers to output events and define the sequence number of an instance of state as the sequence number of the last event used to derive it. The transition function defined above can be factored into a delta function and an execute function:

ε : State × Input → Event |— takes inputs to outputs at a state.

In order to run an event-sourced service, we need a storage mechanism that can store event streams for each entity in the system. These capabilities can be summarized as follows:

get : StreamId × SN → Events |— returns events in a stream.

add : StreamId × SN × Events → Ack |— adds new events to a stream.

The get operation returns the set of events in a stream starting at the specified sequence number SN. The add operation appends a set of events to a stream at a specified sequence number. If the sequence number does not match the stream an error is returned — optimistic concurrency control.

Figure 1: Event stream index.

In addition, the event store should also provide access to the log of all events in a collection:

log : LSN → Events |— returns all events in a partition.

LSN herein refers to a logical point in the log of all events in a collection, and it may be a sequence number, or a more complex structure such as a vector if the log is partitioned. The log enables service orchestration — downstream services can perform operations in response to events in an upstream system, or the upstream system itself can be replicated — state-machine replication. Moreover, the log allows for communication to be consistent with respect to the state — events are used to reconstitute state and notify downstream systems of changes in the upstream system. Without a log, care must be taken to prevent missed communications, or communications with respect to uncommitted states.

Figure 2: Event log.

The collection of streams depicted in Figure 1 is an index of the log depicted in Figure 2.

One data store that has these capabilities is EventStore and it is used for many systems at Jet. With this definition of event-sourcing in mind, we can characterize the different ways that we’ve scaled our event-sourcing platform.

Scaling Reads

Recall that the get function, defined above, returns the events in a stream starting at a specified sequence number. Throughout the lifetime of a system, streams can get arbitrarily large, eventually making reads of the entire stream prohibitive during an operation. A common way to scale reads with event-sourcing is using a technique called a rolling snapshot. A snapshot captures the state of a stream at a particular point in time and is constructed using the delta function defined above. Then, only events occurring after this point in time need to be read in order to reconstitute the last known state.

Snapshotting is not to be confused with distributed snapshots using a snapshot algorithm— a different, albeit somewhat related notion. A snapshot algorithm approximates a global state of a distributed system, and is most often used for asserting stable properties, such as termination or deadlock.

Snapshots can be managed in a few different ways. A service can persist a snapshot when performing operations. Alternatively, snapshots can be generated by a downstream services consuming the log. Snapshots can be generated for every event, or based on an interval. Snapshots can be stored in another stream, or in an entirely different data store. (Snapshots can also be stored alongside events in a stream, however this requires an ability to read streams backwards, couples the snapshotting interval to the read access pattern, and interleaves state — a particular interpretation of events — with the events themselves). Before an operation is performed, a snapshot can be read, followed by a read of any remaining events to reconstitute the latest state. Alternatively, the operation can be performed speculatively with respect to the retrieved state, relying on the optimistic control of the event-store to ensure consistency of the underlying stream. By regarding state snapshots as an optimization mechanism rather than a core storage pattern we had flexibility in terms of how reads could be scaled, all while retaining the entire history of events.

Scaling Projections

In the context of event-sourcing, a projection refers to a running fold of events. In essence, a projection embodies a state-machine whose inputs are events from the log. Its output events may form another stream, or the projection may be used solely for computing a state. A projection may rely on state, or it may be stateless. One type of projection is a filter — it forms a stream of events matching a predicate. Another is a transformer — it either enriches events, or translates them into another form. Since projections are state-machines, they can also perform aggregations and joins.

π : State × Event → State × Output

The EventStore projection system is quite handy and has several built-in filters, such as the stream prefix filter just described, an event-type filter and a projection running custom JavaScript. An issue with EventStore projections is that they haven’t worked well on a cluster. As such, the first step to scaling the projection system was running projections on a replica EventStore instance, downstream from the cluster. This instance could run as a single node and its sole purpose would be to generate and distribute projections. An asynchronous replication service would consume the log to populate the projection node.

Another issue using EventStore for projections is that its log isn’t partitioned, and as such, the single reader thread becomes a bottleneck. To scale the projection system, we introduced Kafka as the distribution medium. A service executes projection state machines, and emits outputs to Kafka topics. This service can run filtering projections as defined above, but it can also run more complex transformations. For example, a projection can be defined to translate between internal and external contracts of a system. Stream snapshots can also be computed using a projection.

Kafka serves well as a distribution medium, however we don’t rely on it as a source of truth. The projected topics have a retention policy, and architecturally, the projection system is designed to tolerate failures in the Kafka cluster, either by reading the upstream event store, awaiting rehydration of the cluster or failing over to another region as described next.

Geo-Replication

Another dimension of scaling is redundancy — continuous operation becomes increasingly critical as the business grows. Redundancy of storage systems is typically achieved using state-machine replication wherein data is replicated across a cluster, tolerating failures of some number of machines. It is quite common for database products to support clustering within a data center. It is much less common for database products to support clustering both within a datacenter and between data centers. This isn’t simply a matter of growing a cluster to include nodes across different data centers — the latency differences between a LAN and a WAN must be taken into account and reflected in the replication protocol.

EventStore runs as a cluster using a synchronous replication protocol which ensures consistency, or more precisely linearizability, among a quorum of nodes. A clustered mode of operation is essential in a cloud environment where individual VMs are routinely restarted for maintenance. EventStore however does not support cross datacenter clustering, which we’ve had implement as a bolt-on component. Since EventStore exposes the underlying log, this was possible.

The bolt-on design augmented a single-region system with asynchronous cross-region replication. Since cross-region replication is asynchronous, there is a possibility of data loss, which was taken as acceptable in during regional failures. However, the regional failover and fail-back processes still need to take the system through consistent states.

Consider a system with two regions — a primary and a secondary. Each region contains a cluster, and there is an asynchronous replication channel from the primary to the secondary. The primary region accepts all writes. The secondary region can’t accept writes — this would result in conflicts — but its log can be consumed by downstream systems, including a projection system. During a failure in the primary region, the secondary region can be turned into a primary, re-routing all writes to it. At this point, the system can continue to operate, though possibly in a compromised state. For example, a failure to the secondary region cannot be tolerated. Moreover, some downstream systems may only operate in the primary region and must therefore await its recovery.

In order to fail-back and recover the primary region the asynchronous replication channel must be reversed and directed into a suitable replica. The logs between the primary and secondary regions may have diverged and conflicts would result if replication is reversed into the original primary. A suitable replica can be obtained by restoring a backup of the secondary region in the primary region and then reversing the asynchronous replication channel.

A more graceful way to achieve the fail-back is to extend the chain to replicate from the secondary region back to the primary region, but into a 3rd replica cluster. This makes it possible to fail-back to the 3rd replica in the primary region — turning the tail into a head. Meanwhile, a new tail can be bootstrapped resulting in a continuous rotation of the chain. This design provides a tradeoff between the costs of operating a 3rd cluster and recovery time.

In essence, this design is akin to chain replication. In chain replication, nodes are organized into linearly ordered chains, wherein a head node accepts writes, which are propagated across the chain, the last node of which is the tail node. Reads can be served by any node in the chain, depending on recency and availability needs. Reads in our case are reads of the log performed by the projection system.

The following diagram depicts the architecture of the Jet event-sourcing platform:

Figure 3: Jet Event-Sourcing Platform

The diagram illustrates the chain paradigm described earlier. The head of the chain is in the primary region and it accepts all writes and reads used for writes. The secondary region hosts the middle of the chain, and an F# service asynchronously replicates evens from the head. A third replica is again hosted in the primary region. The projection system is situated downstream from a replica in each region, and because it is asynchronous by nature, it doesn’t need to consume the log of the head node. In the diagram, the projection system emits outputs to Kafka, however it can just as well emit outputs to another system. Moreover, we can rely on Kafka’s streaming component to form downstream systems.

A nice property of this architecture is that it allows downstream systems to inherit geo-replication capabilities from the event store. For example, Kafka is not a geo-replicated system, and while tools exist to make it so, it is much easier to reason about the system if the source-of-truth itself is geo-replicated. Moreover, rather than geo-replicating each downstream component using its own replication mechanism, all components can piggy back on a single platform. In addition to Kafka, we’ve used this approach to add geo-replication to several other systems, including SQL databases, ElasticSearch clusters, Redis caches, etc.

Consistency Verification

In an asynchronous system, independent parts operate independently, which also means they fail independently. The regional replication system and the projection system are both asynchronous systems, and we needed a way to monitor their consistency with respect to the upstream event store. We did this by building an out-of-band verification system, which would compare a log to a downstream system. One configuration of the system compares EventStore logs, and another compares an EventStore log to a Kafka topic. The system checks to make sure that:

All applicable events are transmitted
That they’re transmitted in the correct order
And with an expected latency

This verification system helped us find bugs in our bolt-on projection and replication systems, in EventStore as well as our Kafka client Kafunk. Moreover, it provides continuous monitoring of safety and liveness properties.

Distributed Tracing

Understanding and debugging systems involving multiple nodes is difficult. Understanding and debugging systems involving multiple nodes and asynchronous interactions is even more difficult. As a result, distributed tracing in an event-sourced system is particularly important, and unlike many existing tracing platforms, it must support tracing of asynchronous interactions. The verification system described above provides a degree of confidence in the system. However, it doesn’t provide the level granularity and scope sufficient for all scenarios. For example, we may wish to inspect the handling of a particular external request across various systems. The verification system can tell us that all events are suitably replicated, but it doesn’t record information about particular traces. A trace is a collection of events associated with a key, and the events denote domain-specific system state changes. The trace key is a unique id, typically generated at a system boundary, and propagates across communication mediums in accordance with the tracing protocol. The tracing system collects and indexes tracing events.

Ongoing Work

Cool Storage — while we’ve scaled reads as described above, the issue of ever-growing streams remains. A cool storage mechanism archives older events into cheaper storage mediums.
Projections using Azure functions — the projection system can reference Azure functions to support execution of arbitrary logic. While care must be taken to ensure that the resulting system is well-behaved, we can expand the scope to allow declarative definitions of microservices
Event-Sourcing Engine — while we’ve gotten quite far with EventStore, we’ve set out to build a replacement event-sourcing data store to continue to meet our scaling demands. With this data store, we’re looking to have built-in support for geo-replication.
Causally-consistent Geo-replication — as noted above, geo-replication is asynchronous and therefore susceptible to data loss. For some operations, we would like to synchronously replicate events before acknowledgement. This would provide causal consistency with respect to individual streams across regions.

Conclusion

Event-sourcing is founded on sound principles, and while there certainly are challenges to building such a platform — as evidenced herein — the benefits outweigh the risks. A notable benefit for the systems engineer is the stability of the architecture —it is possible to scale individual components without changing the core. Teams can build their systems autonomously, but also integrate seamlessly when required. From a theoretical standpoint, the log at the heart of event-sourcing allows disparate components to reach consensus in a non-blocking manner. Moving forward, we will continue to enhance the event-sourcing platform to continue meeting the demands of a world-class shopping experience at Jet.

Acknowledgements

The event-sourcing platform was made possible by efforts of many individuals across several teams at Jet.

Contributors: Cole Dutcher, Andrew Duch, Erich Ess, Lev Gorodinski, Scott Havens, Mike Hanrahan, Gina Maini, Brian Mitchell, John Turek, Ido Samuelson.

We’re Hiring

We’re hiring — if you’d like to get involved in some of these efforts, reach out to Jet Technology Careers or to me directly.

References

Event-Sourcing — Martin Fowler’s definition of event-sourcing.
EventStore — The open-source, functional database with Complex Event Processing in JavaScript.
CQRS Documents — CQRS/Event-sourcing article by Greg Young.
Apache Kafka
Chain Replication for Supporting High Throughput and Availability — introduces the chain replication protocol.
Using I/O Automata for Developing Distributed Systems — the IO automaton formalism introduced by Stephen Garland and Nancy Lynch.
The Log: What every software engineer should know about real-time data’s unifying abstraction
Bottled Water: Real-time integration of PostgreSQL and Kafka — demonstrates the use of a log in a data store to integrate with Kafka.
Dapper, a Large-Scale Distributed Systems Tracing Infrastructure — Google’s distributed tracing platform.
Kafunk — F# Kafka client.
Event-sourcing with F# and EventStore.
Universality of Consensus — for details on the relationship between event-sourcing and consensus.

Universality of Consensus

Leo Gorodinski — Wed, 18 Oct 2017 12:14:43 GMT

Consensus is a fundamental problem in distributed computing and in this post we will see exactly why that is the case. In the spirit of modularity, we investigate the essence of distributed computation and seek fundamental building blocks with which we can compose larger systems. A consensus object is one such building block forming the core of a universal construction which provides a linearizable wait-free implementation of any other object given its sequential specification. In what follows, we shall define the notions of wait-freedom, linearizability and consensus. Furthermore, we discuss sequential specifications and what it means for one object to implement another. That consensus is at the heart of a universal construction is a reasonably natural notion. After all, the difficulty of distributed computation comes from the absence of common knowledge about the system and consensus gives us just that — common knowledge. To make our intuitions more precise, we begin by establishing a system model on which we base our assertions.

System Model

The system consists of objects accessed concurrently by processes, and both can be modeled by IO automata. In essence, an IO automaton is a state machine, consisting of a set of states and a set of events depicting transitions between states. A process is a sequential thread of control and its events model interactions with objects — an invoke event corresponds to the invocation of an operation on an object and a receive event corresponds to the receipt of a response from an object. An object is a data structure shared by processes and its events model invocations of operations by processes — an invoke event corresponds to the invocation of an operation and a respond event corresponds to the response. An operation is therefore delimited by two events — the invocation and the response. The invoke event on a process is an output event — an outgoing communication, while the invoke event on an object is an input event — an incoming communication. Similarly, the response event on a process in an input event, and the response event on an object is an output event. Notice how an invoke event on a process and object are output and input events respectively, forming a symmetric matching pair when they refer to the same process and object. Automata can be composed by matching their respective events. The states of the resulting composite automaton are tuples of states of the constituent automata, and the set of events is the union of events of the constituent automata. For example, a queue is an object with two operations: enqueue which given an object, adds it to the queue, and dequeue which returns the object at the head of the queue, or a null value if the queue is empty. A queue may be accessed concurrently by multiple processes which together form a concurrent system.

Histories

A collection of events resulting from execution of processes and objects in a system is called a history. Events at individual automata are totally ordered, which means that given any pair of events we can determine which came first. We can extend this total order of events to a partial order on operations. An operation O comes before operation P if the response event to O is received before the invocation event of P. The reason this ordering is partial is because some operation may overlap. That is to say, operations may be concurrent. Recall that we referred to a process as a sequential thread of control. We can formalize this notion using histories. Given a history H, a sub-history belonging to a specific automaton A is denoted H|A. A history of events for a process H|P consists of alternating, matching invocation-response pairs. In other words, a process consists of a totally ordered set of operations. A totally ordered history of operations is called a sequential history. The difficulty in implementing concurrent objects is that in general, histories are not sequential and operations from different processes can overlap, resulting in a possibility of ill-defined states. For example, the dequeue operation on a queue might first check if the queue is empty and if it isn’t return the item at the head of the queue, otherwise return null. However, if a process invokes the dequeue operation while a dequeue operation from another process is still in progress, they could be dequeue the same item, violating the invariant of a queue.

Sequential Specifications

A sequential specification is a specification of an object assuming a sequential history. Under these circumstances, we can regard an object as a state machine with a transition function δ, such that δ(s,op(args)) — given a state and an invocation of an operation — returns a pair consisting of a new state s’ and a value res to return to the calling process. The sequential specification can then be defined as a set of pre-conditions and post-conditions on object states before and after the execution of operations. For example, we might say that the state of a queue object doesn’t contain an item before an enqueue operation, and does contain the enqueued item after an enqueue operation. Sequential specifications serve as a recipe for implementing an object in a universal construction.

Wait Freedom

A simple way to make a concurrent implementation of an object is through the use of locks. Locks or mutexes enclose critical sections wherein only a single process can execute at a given time. This allows the programmer to use sequential reasoning to assert the correctness of an algorithm. The use of locks however isn’t ideal — a faulty process can stall or arbitrarily delay the execution of other processes. The absence of locks in an implementation is called lock-freedom. An implementation is lock free if at least one thread is guaranteed to make progress. Use of a lock makes this guarantee impossible — if two threads are contending for a resource, and the one who acquires it first stalls, the second thread will be deadlocked. Wait-freedom is a stronger progress condition which guarantees that each process can make progress in a finite number of steps regardless of the behavior of other processes. Thus any wait-free implementation is also lock-free, but not necessarily so the other way around. In what follows, we will consider wait-free implementations of an object based on another object. Progress conditions are a particularly important consideration in a distributed system, where independent failures are more common and the scheduler may consist of multiple disparate components.

Wait-freedom is also an important notion for theoretical reasons because it applies to all processes and it is independent of the process scheduler — so long as processes are scheduled, wait-freedom guarantees progress for all processes. By contrast, lock-freedom only guarantees progress for some process. And a lock-based procedure relies on the scheduler to provide starvation-freedom, by ensuring that processes eventually leave the critical section. Wait-freedom is therefore starvation-freedom in presence of failures.

Linearizability

In order to be able to transfer the correctness of a sequential system to a concurrent system, we must define a correctness condition for concurrent objects. Linearizabilty is such a correctness condition and it states that a concurrent object is linearizable if it all possible histories involving that object and any number of processes can be linearized. A linearization of a concurrent history H is a valid sequential history S wherein the partial order on operations imposed by H is respected by S. Therefore, S completes the partial order defined by H, but agrees with H on orderings that H does define. Linearizabilty is useful because it allows us to transfer the reasoning about a sequential system to reasoning about a concurrent system, significantly simplifying analysis. Moreover, linearizabilty is a local property in that if a system consists of linearizable objects, the system is itself linearizable. In other words, linearizabilty composes.

Figure 1 bellow depicts two processes interacting with an object. The solid disks represent events, histories are horizontal lines, operations are delimited by invoke and respond events and form matching pairs on process and object histories. The interaction involves two operations, and the operations overlap. Even though we may impose a total order on events across all three histories, the two operations are concurrent. There are therefore two linearizations of this history — one where operation 1 happens first and another where operation 2 happens first.

Figure 1. Space-time diagram with processes, objects, events and operations.

Consensus Objects

A consensus object is a concurrent object with a single operation propose which behaves in accordance with the following sequential specification:

let propose value =
  if state = null then
    state <- value
  state

A consensus object must adhere to the following conditions:

Consistency: all processes invoking the consensus object receive the same value.
Integrity: the value returned by the consensus object is a value proposed by some process.
Termination: the operation is wait-free.

The first two conditions are safety properties — they assert that consensus does what one would expect it to do. The last condition is a liveness property — it asserts that the operation completes and tolerates failures in participating processes. The transition function δ in the sequential specification of a consensus object would take the state as the first argument, and the proposed value as the second argument and execute in accordance with the code above.

Consensus Numbers

To show that there is a wait-free implementation of an object using another, we can map the objects to their consensus number and compare the numbers. A consensus number is the number of processes for which an object can solve consensus. If one object X can solve consensus for an equal to or greater than number of processes than object Y, then X can be used to implement Y. For example, a shared register cannot be used to solve consensus for even two processes. An test-and-set object can be used to solve consensus for at most two processes. A compare-and-exchange object on the other can be used to solve consensus for any number of processes — its consensus number is ∞. In this way, a hierarchy is formed enumerating objects by their synchronization power:

|----------------|------------------|
| Object         | Consensus Number |
|-----------------------------------|
| MRSW register  | 1                | 
| test-and-set   | 2                |
| cas            | ∞                |
| queue w/ peek  | ∞                |
|----------------|------------------|

Consensus numbers were introduced by Maurice Herlihy in Wait-Free Synchronization. It is a natural progression from his earlier work on Linearizability. A central result proved in the paper is the following theorem:

If [object] X has consensus number n, and Y has consensus number m < n, then there exists no wait-free implementation of X by Y in a system of more than m processes.

The proof proceeds by contradiction — it assumes the existence of an implementation, then represents the implementation as a composite IO automaton and finally shows the contradiction. This theorem is significant because it mathematically defines the implementable-by relation. Herein, it allows us to establish that a universal construction can wait-free implement any other object.

Universal Constructions

An object is universal if it can implement any other object given a sequential specification, and this implementation is both wait-free and linearizable. Such implementations are called universal constructions. One such object is the consensus object defined above. An implementation may also make use of any number of shared read/write registers.

Figure 2. A visual depiction of a universal construction.

The following is a universal construction due to Michel Raynal and it is based on the state-machine replication paradigm. Each process maintains a copy of the constructed object, and uses consensus object to keep the copies consistent. The construction consists of two parts— the operation and a background helper process. The operation assigns the proposal and waits for the background process to assign a result. The background process runs in a loop, checking if a proposal has been assigned. If it has, then it sends that proposal to a consensus object which returns the first received proposal for that round. The proposal itself consists of the operation and its arguments, and the proposing process. The background process then executes the state machine transition function on the latest state and the proposed operation. Each process keeps a local copy of the state and the consensus protocol ensures that operations are applied to local states in the same order. As such, all processes have a common view of the object. If the proposal that was returned by the consensus object is from the calling process, we assign the result so that it operation can return it to the caller.

The operation is defined as follows:

1. let perform args =
2.  result[i] <- null
3.  proposal[i] <- ("op(args)",i)
4.  wait (result[i] != null)
5.  result[i]

The local variables result[i] and proposal[i] refer respectively to the result and proposal of process i. On line 3 process i stores its proposal which is proposed by the background process bellow.

The operation on line 3 is quoted to indicate that we’re referring to a representation of an operation rather than an invocation. In programming language, this can be implemented in a few different ways, such as using reflection, quotations as in F#, or using a free monad.

The background process:

1. while true do
2.  if proposal[i] != null then
3.    k[i] <- k[i] + 1
4.    exec[i] <- CONS[k[i]].propose(proposal[i])
5.    (s[i],res) <- δ(s[i],exec[i].op)
6.    if (i = exec[i].proc) then 
7.      proposal[i] <- null
8.      result[i] <- res

The local variable k[i] refers to the execution round for process i, CONS refers to the shared consensus object indexed by the round. The local variable exec stores the result of the proposal. The pair s and res store the state and the result of the state machine transition function δ applied to the proposed operation and the current state.

The problem with the construction above is that it is not wait-free. Depending on the scheduler, the background processing loop may never terminate. We can fathom an adverse schedule wherein a given process never “wins” consensus, thereby creating an unbounded delay. In order to make the construction wait-free we have to make a few adjustments. To account for the possibility of failures of other processes, each process can keep an array of sequence numbers representing the last operation applied by every other process. Only if sequence numbers of other processes are increasing — as communicated by a shared register — do they participate in proposals. Therefore, if a process fails, its proposals no longer count and can’t prevent other processes from making progress.

The operation is defined much in the same way as before, however each process stores the last operation it applied in a shared register LAST_OP and keeps track of the sequence number of the operations applied across other processes using the local variable last_sn.

1. let perform args =
2.   result[i] <- null
3.   LAST_OP[i] <- ("op(param)",last_sn[i][i] + 1)
4.   wait (result[i] != null)
5.   result[i]

The background process constructs a proposal by collecting operations executed by processes since the last round. The local variable last_sn keeps track of the operations applied across all other processes. Then, for each pair of operation and process in the decided proposal, the background task executes the transition function δ and increments last_sn for the corresponding process. If the process is the calling process we assign the result as before.

 1. while true do
 2.   prop[i] = []
 3.   for j in [1..n] do
 4.     if (LAST_OP[j].sn > last_sn[i][j]) then
 5.       prop[i] += (LAST_OP[j].op,j)
 6.   if (prop[i] != []) then
 7.     k[i] <- k[i] + 1
 8.     exec[i] <- CONS[k[i]].propose(prop[i])
 9.     for r = 1 to |exec[i]| do
10.       (s[i],res) <- δ(s[i],exec[r].op)
12.       let j = exec[i].proc
13.       last_sn[i][j] <- last_sn[j] + 1
14.       if (i = j) then 
15.         result[i] <- res

To prove that this construction is wait-free we must ascertain that for a non-failing process i, line 4 in the definition of the operation eventually completes, which means that line 15 in the background task must eventually execute. We claim that there is a decided proposal containing process i. If the claim is true, then at some point, the test on line 14 will be true and the desired outcome is achieved. We can prove the claim by considering the loop on line 3. It traverses the entire list of processes and includes the operation of each process in the proposal of the calling process. This means that at some point, the proposals of all processes contain process i, proving our claim.

Related Work

The Glitch Phenomenon — Leslie Lamport regards consensus to be an intrinsic puzzle of the universe and gave this idea a topological presentation, which of course the author is quite fond of. The idea is perhaps even more fundamental than the consensus object considered herein. Consider an arbiter — an electronic device which must decide which input signal from a processor arrived first in order to determine whom to grant access to a shared resource. If the requests arrive within a short duration of each-other, the arbiter may have a hard time deciding. While an engineering solution was found at some point the 70s, the underlying problem endures. Lamport provides a formalism wherein the device is represented as a continuous function between inputs and outputs, where inputs and outputs are themselves time-dependent function spaces (behaviors for those familiar with FRP). The continuity of this function representing the device is what ultimately allows for the possibility of an unstable state. It can be shown that the the space of outputs for which a timely decision is made is not connected — only becoming connected if time is allowed to tend to infinity. If we assume that the input space is connected, a continuous mapping between the input space and the output space requires use of a zero-function — which corresponds to an inability to decide.
Event-Sourcing —Among other nice properties, event-sourcing enables state-machine replication. Recall that in the consensus hierarchy described above, a queue with a peek operation has an infinite consensus number. Such a queue is what we get from the log at the heart of an event-sourcing architecture. This log is not just a simple queue, because rather than providing a dequeue operation, it allows consuming processes to traverse the queue in order without mutating the queue itself. As a result, each process can reach a common state without interfering with other processes. With event-sourcing however, it is more typical for this common state to be reached asynchronously, thus foregoing linearizability with respect to the log, but it still maintains order and therefore sequential consistency.

Summary

We’ve defined a universal construction — a mechanism for providing a linearizable wait-free implementation of any object given its sequential specification. This result exhibits a great degree of generality and demonstrates the fundamental nature of a consensus object at the core of the construction. Intuitively, it makes sense — consensus is fundamental because it is fundamentally lacking in a distributed system where each participant has their own, independent world view. Somewhat ironically, while consensus is universal it is also impossible as we have seen earlier. So what are we to conclude from the impossibility and universality of consensus? It is important to regard consensus as a limiting notion — we can have consensus but with weaker termination properties, or stronger termination properties, but with a weaker notion of consensus. But once we have a consensus object, we can use it to implement any other object, and this implementation will be wait-free outside of any restrictions on wait-freedom imposted by the consensus object itself. Moreover, as we have seen in the aside on event-sourcing, consensus is useful with a weaker consistency condition.

References

The Asynchronous Computability Theorem

Leo Gorodinski — Mon, 18 Sep 2017 13:18:20 GMT

In this post I describe the Asynchronous Computability Theorem, which uses tools from Algebraic Topology to show whether a task is solvable in a distributed system. The theorem yields quite readily to intuition, however we will take care to make our statements as precise as possible without getting too caught up in the details. Loosely speaking, the theorem states that a task is solvable with a distributed system if its input space maps continuously into its output space. In particular, a continuous mapping fails to exist when the spaces at hand are incompatible with respect to holes within them. The beauty of the theorem lies in that it provides a geometric interpretation of the execution of a distributed protocol, allowing us to draw on our innate intuition for space to tame the combinatorial explosion of states in concurrent computations. In what follows, we shall answer the following questions: What is a task? What does it mean for a task to be solvable with a distributed system? What are input and output spaces? What is a continuous map? Finally, we shall state the Asynchronous Computability theorem and use it to demonstrate the impossibility of consensus.

The Impossibility of Consensus

To gain intuition for tasks and solvability, we start with a well known impossibility result — Impossibility of Distributed Consensus with One Faulty Process. Known colloquially as the FLP result, it states that processes communicating over an asynchronous network can’t all agree on a proposal if even one of the processes fails. An asynchronous network is one where communication delays can be arbitrarily long. The task in this case is consensus. Solvability corresponds to whether there exists a terminating algorithm which the participating processes can run to come to consensus. Since it is impossible to distinguish between a communication delay and a failed process, any consensus algorithm may not be able to terminate if a process fails. More generally, k-set agreement refers to tasks where processes must agree on at most k distinct values. The consensus task in FLP is 1-set agreement. It has been shown that as long as k < n, the k-set agreement task is unsolvable with a wait-free protocol [8]. That the algorithm is deterministically terminating is a key stipulation of this impossibility result. In the next section, we discuss solutions to consensus wherein the termination property is relaxed.

The Alpha & Omega of Consensus

In the analysis of distributed algorithms, we make assertions about the correctness of system states — safety properties, and the temporality of system states — liveness properties. Safety ensures that nothing bad happens, and liveness ensures that something eventually happens. The problem revealed by the FLP result is that of liveness — due to arbitrary timing delays, we can’t assert that a consensus algorithm will eventually terminate. Consensus protocols such as Paxos and Raft overcome this limitation through elaborate use of quorums and timeouts, or more generally, failure detectors [4]. Indeed, one can decompose the problem of consensus along safety and liveness properties — the alpha and omega of consensus [3]. Alpha captures the safety properties of consensus — agreement, validity, and integrity, whereas omega captures liveness properties — the eventual termination of the algorithm. It is this latter component of consensus that does not admit a wait-free protocol.

Wait Freedom

Consensus is solvable, but as demonstrated by FLP and the Asynchronous Computability Theorem, it is not wait-free solvable. Wait freedom is a particular type of progress condition:

No process can be prevented from completing an operation by undetected halting failures of other processes, or by arbitrary variations in their speed. [6]

In this way, wait-freedom gives us a type of fault-tolerance particularly important in a distributed system where components fail independently. This is a remarkable global reasoning property that ensures that each non-faulty process of a distributed protocol will make progress regardless of failures of other processes, and do so in a bounded number of steps. It should be noted that the wait freedom property is perhaps too strong — the absence of a wait-free protocol for a task does not mean it can’t be solved with weaker progress guarantees. For example, an achievable progress property for consensus is anon-blocking guarantee which states that eventually progress can be made — albeit slowed arbitrarily by retries in case of collision.

System Model

The system model used in defining the Asynchronous Computability Theorem is based on I/O automata. An I/O automaton is defined as a set of states, events and a transition relation. Individual processes are represented as automatons — Turing machines — with well-defined start and finish events, as well as events corresponding to interaction with a shared memory object. The shared memory object is a linearizable atomic read/write memory, represented as an automaton with events corresponding to read and write operations. Processes have exclusive write access to their memory cell, but may read the contents of memory cells belonging to other processes. A protocol is defined as a set of processes and a shared memory object.

Interestingly, the methodology used to establish the result in this post has been used to unify the synchronous and asynchronous message passing models.

Tasks

Informally, a task is a problem in a distributed system. Formally, a task is a tuple ⟨I,O,Δ⟩ where I is a set of input vectors, O a set of output vectors and Δ a specification relation mapping inputs to outputs. An input vector represents a possible assignment of input values to the processes of a distributed system. Similarly, an output vector represents process outputs at the end of execution. Components of input and output vectors correspond to processes, such that I[i] represents the input to the i-th process and O[i] its output. If I[i] = ⊥ then process i does not participate in the execution. If a process i fails, then O[i]=⊥. The task specification relation Δ maps input vectors to sets of allowable output vectors.

Take for example the binary consensus task. Each process starts with an input value of 0 or 1 and at the end of execution, chooses 0 or 1 such that they all agree and the chosen value was some processes’s input. A task specification for a two process binary consensus task is shown in Figure 1 bellow:

https://medium.com/media/bffd697988186f079a04ba87bb606c27/href

Solvability

Recall that a protocol is a collection of processes interacting through shared memory to solve a task. A protocol solves a task if 1) no process takes a infinite number of steps to finish, and 2) the output value of a process is consistent with the task specification Δ. A protocol wait-free solves a task if it solves the task even if all but one of the processes fail.

Topological Definitions

Topology is a branch of mathematics which studies continuous transformations of space. It abstracts the concept of the familiar Euclidean space to spaces that are considerably more general. In particular, a topological space need not impose a notion of distance between its points, while still retaining a rigorous treatment of continuity and connectedness. The points in a topological space can represent points of a Euclidean space, but they can also represent functions, other topological spaces, and in our present discussion, configurations of a distributed system.

Connectedness refers to a space being “all in one piece”, and allows us to formally classify spaces according to the number of “holes” they contain. A familiar example is the correspondence of a bagel and a coffee mug — topologically speaking, these spaces are equivalent because one can be continuously transformed into the other. On the other hand, a bagel is incompatible with a solid ball because the former contains a hole. If we represent the state of distributed system as a particular type of topological space, we can characterize computability as compatibility between spaces. In the next section, we will introduce the concept of a simplex — a member of a topological space known as a simplicial complex. We will then represent a configuration of a distributed system of n + 1 processes as an n-dimensional simplex, a set of possible configurations as a simplicial complex, and wait-free solvability as a continuous mapping between an input complex and an output complex. Despite the generality of topological spaces, we can still appeal to our geometric intuitions in order to visualize their properties. Indeed, every topological space X has a geometric realization |X| which regards the space as a subset of a high-dimensional Euclidean space.

Simplexes

A simplex is a generalization of a triangle to arbitrary dimensions. A triangle is a 2-simplex, a 1-simplex is a line segment, a 0-simplex a point, a 3-simplex a tetrahedron, etc. Simplices themselves are built from lower dimensional simplices called (proper) faces of the simplex. For example, a triangle can be formed by gluing together line segments at the appropriate ends. A line segment can be formed by gluing its two endpoints.

Figure 2: a 0-simplex, 1-simplex, 2-simplex, 3-simplex.

More precisely, a simplex consists of a set of vertices which are affinely independent. In essence, this means that the vertices are distinct such that if they were represented as vectors, there is no pair where one vector can be scaled into the other. In case of a 2-simplex, this means that there doesn’t exist a line segment on which all three vertices lie.

In distributed system, a simplex can be used to represent a configuration — an assignment of states to each process. For example, a triangle can represent a system of three processes, each vertex corresponding to a particular state of each process. The dimension of a simplex corresponds to the notion of a process participating in a protocol. If a process fails, it ceases to participate, and the resulting simplex has a smaller dimension.

Simplicial Complexes

A simplicial complex is a collection of simplices K, such that every face of every simplex in K is also a simplex of K, and every intersection of a pair of simplexes is also a simplex of K. Due to the second property, the formation of a complex does not result in any new simplices. A simplicial complex thereby forms a topological space, with simplexes as the points.

For our purposes, a simplicial complex corresponds to a set of possible configurations of a distributed system. We take individual configurations and connect them at their intersections. In this way, we obtain a geometric interpretation of similar system states — similar configurations are those which intersect at their boundaries. Moreover, unlike graph-theoretic models, we have the dimension of the intersection as the degree of similarity between states.

The following diagram depicts a 2 dimensional complex — a 2-complex — consisting of two simplices — ⟨P0, Q1, R2⟩ and ⟨P3, Q1, R2⟩. These in turn consist of 1-simplices such as ⟨P0, Q1⟩ and 0-simplices such as ⟨P0⟩. The intersection of the 2-simplices is the 1-simplex ⟨Q1, R2⟩ and represents the fact that the two system states are similar — the only differ in the state of process P.

Figure 3: a complex for a 2 state system with 3 processes. [Credit 1]

The simplex ⟨P0, Q1, R2⟩ can represent a configuration of three processes, P, Q and R with local states 0,1 and 2 respectively. The simplex ⟨P3, Q1, R2⟩ represents a configuration different only in that P has local state 3.

Simplicial Maps

A simplicial map between complexes K and L carries vertexes of K to vertexes of L and also simplexes of K to simplexes of L. In other words, if a set of vertexes span a simplex in K, their image under a simplicial map will also span a simplex. A simplicial map can however collapse a simplex if the dimension of the simplex under the map is zero. A simplicial map that does not collapse any simplices it is called non-collapsing. We shall make use of simplicial maps two both formalize the notion of associating processes with vertexes and to represent transitions between system states.

Chromatic Complexes

A coloring of a n-dimensional complex K is a non-collapsing simplicial map from K to an n-dimensional simplex. The vertexes of the target simplex correspond to distinct colors, and a coloring is thus a labeling of the vertexes such that no two neighboring vertexes have the same color. Take note of the non-collapsing property of the simplicial map at play— if two neighboring vertexes in the complex were to map to the same vertex in the simplex — or in other words, share a color — it would result in the collapse of the simplex formed by the vertexes in the complex. A complex K together with a coloring χ is called a chromatic complex (K,χ). A simplicial map is color-preserving if the color of a source vertex is the same as the color of the destination vertex. Coloring provides a mechanism to associate the vertexes of a complex to individual processes. Color-preserving maps represent state transitions wherein the processes maintain their identity.

We can now provide a topological rendering to the formal definition of tasks described above as triples ⟨I,O,Δ⟩. We represent an input vector i ∈ I as a simplex S(i) whose vertices are pairs of processes and their values. An output vector o ∈ O is represented as the simplex S(o) accordingly . The topological task specification is set of all pairs (S(i), S(o)) wherein Δ holds.

Subdivisions

A subdivision of a complex K is a complex σ(K) such that:

Each simplex of σ(K) is contained in a simplex in K
Each simplex of K can be formed by combining simplices in σ(K)

A subdivision can be thought of as a triangulation of a complex. A carrier of a simplex S in a subdivision σ(K), denoted by carrier(S,K), is the unique smallest simplex T in the original complex K such that S is contained in T. A chromatic subdivision is a subdivision of a chromatic complex (K,χ) such that σ(K) is a subdivision of K, and for all S in σ(K), its set of colors in the subdivision is a subset of the colors of its carrier. A chromatic subdivision is therefore a subdivision wherein all constituent simplices are assigned a color in accordance with the colors assigned to their carriers. In Figure 4, the 2-simplex shaded orange on the right hand side of the diagram is a simplex of a chromatic complex. The chromatic subdivision of this complex is shown on the left. On the left is also shown a 2-simplex in the chromatic subdivision. The shaded 2-simplex on the right is its carrier. Moreover, the colors of the vertexes of the 2-simplex on the right are a taken from colors of its carrier.

Figure 4: a simplex in a subdivision (left) and its carrier (right). [Credit 1]

Chromatic subdivisions model the execution of a protocol. If a protocol starts with a given input simplex, a round of execution subdivides the simplex, such that the subdivision represents the reachable states of the system after the round. These reachable states form the protocol complex which reflects the structure of the protocol.

To illustrate the unfolding of a protocol complex, consider the following 2-process task. The two processes, P and Q have inputs p and q respectively, represented by the input simplex in Figure 5. The processes interact using a shared memory object with two cells — one for each process. Each process first writes its input to shared memory and then scans the entire memory object. The processes run concurrently, but there are only three possible executions:

If P reads before Q writes then P’s view is (p,⊥) and Q’s is (p,q).
If both P and Q read after both have written then both have view (p,q).
If Q reads before P writes then Q’s view is (⊥,q) and P’s is (p,q).

Each executon is represented as a simplex in the diagram bellow:

Figure 5: protocol complex for a simple protocol. [Credit 1]

Execution 1 is the 1-simplex ⟨P(p,⊥),Q(p,q)⟩ at the bottom of Figure 5, execution 2 is ⟨P(p,q),Q(p,q)⟩ and execution 3 is ⟨P(p,q),Q(⊥,q)⟩. These three simplexes form the protocol complex. If we consider the original input simplex, then the protocol complex is a subdivision, dividing the input simplex into three pieces. Note also how the vertex P(p,q) is an intersection of two 1-simplexes. This is a geometric representation of the fact that in this state, P can’t tell between executions 2 and 3.

Connectivity

A graph is connected if there is a path between every pair of vertices. Similarly, a simplicial complex is 0-connected if there is a path between every pair of vertices. More generally, a simplex is 1-connected if any loop can be continuously deformed to a point, a ball for a 2-connected space, and so on. This generalization can be visualized with the following diagram:

Figure 6: left — contractible loop, right — not contractable. [Credit 1]

The loop at the top of Figure 5 can indeed be contracted to a point. On the other hand, the loop at the bottom cannot be contracted to a point because of the “hole” in the way. Therefore, while the shape in Figure 5 above is 0-connected, it isn’t 1-connected. We can also view connectivity from a combinatorial perspective — if K and L are n-connected complexes such that their intersection K ∩ L is (n-1) connected, then their union K ∪ L is n-connected. Thus we can infer connectivity of a composite complex via connectivity of its constituents.

Connectivity has the pleasant property that if a complex is n-connected, the complex with a dimension removed — a faulty process — is (n-1) connected, thereby extending all assertions to configurations where some processed have faulted. A complex is disconnected if it fails to be connected. For example, a complex representing the final states of a system, can be disconnected if certain configuration are not admitted under the task specification, creating “holes” in the complex. These holes, in turn, present a challenge to constructing a simplicial map to this complex. In the context of distributed systems, connectivity maps to the notion of reachability of system states. A particularly notable property of a wait-free protocol is that the protocol complex for any input n-simplex is n-connected [1].

The Asynchronous Computability Theorem

The definitions provided so far should suffice to state the Asynchronous Computability Theorem:

A task ⟨I,O,Δ⟩ has a wait-free protocol if and only if there exists a chromatic subdivision σ of I and a color-preserving simplicial map

μ : σ(I) → O

such that for each vertex s ∈ σ(I), μ(s) ∈ Δ(carrier(s,I)).

The task tuple consists of an input complex, output complex and a task specification determining valid outputs. The subdivision reflects the execution of the protocol. The existence of the map, and specifically a simplicial map is what governs the possibility of a wait-free solution. The beauty of this theorem lies in that it gives a purely static, topological approach to problems which are typically modeled operationally as computations unfolding in time.

The theorem can be visualized using Figure 7 bellow:

Figure 7: The Asynchronous Computability Theorem. [Credit 1]

On the left is an input complex a 3-simplex (tetrahedron). We have a subdivision of the input complex and one simplex in the subdivision is shaded orange. We also have an output complex — a torus (this actually represents an output complex for a renaming task). The simplicial map μ takes a simplex in the subdivision of the input complex to a simplex in the output complex. Moreover, this simplex in the output complex is a member of the sub-complex which represents the acceptable outputs of the task under Δ given the input simplex containing the subdivision (colored white).

Impossibility of Consensus, Topologically

We can now demonstrate the impossibility of consensus topologically, using the Asynchronous Computability Theorem. We begin by illustrating the challenges presented by consensus in a geometric way. Suppose we have two processes P and Q which are given inputs 0 or 1. A state of a system where process P has input 0 and only process P executes, is represented as a vertex labeled P0. A configuration wherein both P and Q participate, starting with inputs both equal to 0 is represented as a 1-simplex ⟨P0,Q0⟩. The decision of the protocol in both of these configurations is simple — it always decides 0 since there are no alternatives to choose from. The situation gets more complicated when the processes have distinct inputs. According to the task, they must both decide one value or the other. The following diagram shows the input and output complexes for this example:

Figure 8: Complexes for 2-process consensus. [Credit 1]

The input complex for the 2-process binary consensus task consists of four vertexes, corresponding to the pairs of process and input value. The four edges correspond to the four possible 2-process configurations. The output complex captures the fact that the processes must decide on the same value. The output complex is disconnected and therefore incompatible with the input complex which is connected. Since subdivisions and simplicial maps preserve connectivity, invocation of the Asynchronous Computability Theorem leads us to a contradiction — if we suppose such a map μ exists, it cannot be simplicial.

Given this intuition, we can depict the impossibility of consensus in a more rigorous manner. In fact, we shall make a stronger statement by showing the impossibility of k-set agreement described above. Recall that k-set agreement is a task wherein the participating processes are allowed to select up to k distinct values. If k is less than the number of processes, then this task is impossible to solve wait-free. Consensus is 1-set agreement. To express this result using the Asynchronous Computability Theorem, we make use of a standard result from algebraic topology known as Sperner’s Lemma.

Sperner’s Lemma

Sperner’s Lemma is a combinatorial analog of the Brouwer Fixed Point Theorem:

Let σ(S) be a subdivision of an n-simplex S. If f : σ(S) → S is a map sending each vertex of σ(S) to a vertex in its carrier, then there is at least one n-simplex T in σ(S) such that for each vertex v of T, the f(v) are distinct.

A good way to think of Sperner’s lemma is in terms of subdivisions as discussed earlier. Take the chromatic subdivision shown on the left of Figure 4. Sperner’s lemma states that there always exists a simplex in this subdivision who’s vertexes are all distinct colors. This simplex with distinctly colored vertexes is the “fixed point” of the subdivision, and the lemma states that such a fixed point always exists. The following diagram is an example:

Figure 9: Sperner’s lemma. [Credit: Wikipedia]

The vertexes of the outer 2-simplex are colored red, green and blue. The 2-simplexes in a subdivision are colored with colors in this set, and the 2-simplexes shaded gray make use of all three colors. Sperner’s lemma guarantees that at least one such simplex exists in the subdivision.

Sperner’s lemma applies to impossibility of consensus in the following way. Consider a configuration with three processes each with a distinct input value. Suppose also that k is 2 — we can allow up to two distinct output values. We can represent this configuration as a 2-simplex I colored with three distinct colors and labeled with three distinct values. The output complex O consists of simplexes labeled with up to two distinct values. Assume by way of contradiction that a protocol does exist. This means there is a subdivision σ(I) and a simplicial map μ : σ(I) → O. This map satisfies the conditions of Sperner’s lemma, and therefore carries some simplex in the subdivision to an output simplex labeled with all three values. However, the output complex has no such simplex — we would thus map to a “hole” in the output complex. Note that when using Sperner’s lemma we considered the labeling of vertexes not only with the process it represents, but also with the local state of the process.

Proof Sketch

For the complete details of the proof, refer to [1] where Herlihy and Shavit prove both directions — that the conditions of the theorem are necessary and sufficient.

Necessity states that a decision task has a wait-free protocol only if there is a subdivision and simplicial map as required by the theorem. Informally, the proof of necessity is based on the notion of a protocol complex described earlier. Recall that a protocol complex represents the possible executions of a protocol. It is shown that a protocol complex P(I) corresponds to a subdivision of the input complex, and that the simplicial map is induced by a final decision transition in the protocol called the decision map. The simplicial map μ then expressed as a composition of φ : σ(I) → P(I), the map defining the protocol complex, and δ : P(I) → O, the decision map. The proof proceeds by stating and proving lemmas about connectivity, which as we discussed earlier corresponds to reachability in a distributed system.

Sufficiency states that if we have such a subdivision and simplicial map, then the protocol wait-free solves the task. This is done by providing a construction of an algorithm for any task satisfying the conditions of the theorem. They proceed by reducing the problem to approximate agreement which is solved by the participating set protocol [10] and invoking the Simplicial Approximation Theorem, which essentially states that a solution can be attained after a sufficient number of execution rounds. Interestingly, this line of reasoning leads to results about the complexity of asynchronous computations [12].

Related Work

The CALM Principle states that if a program is expressed using monotonic logic — or loosely speaking logic where you can’t revoke past statements — then it can be made eventually consistent in a distributed system in a way that does not require coordination. So, a protocol refuting an earlier claim necessitates coordination amongst the involved processes. With the Asynchronous Computability Theorem in mind, one would expect that protocols expressed using monotonic logic are able to solve their tasks in a wait-free manner.
Coordination Avoidance in Database Systems presents necessary and sufficient conditions for safe, coordination-free execution. Similarly, Highly Available Transactions: Virtues and Limitations discusses consistency guarantees which can be provided in a highly-available manner. If we express an application consistency requirement as a task using the framework presented herein, it should be possible to asses wait-free solvability of this task using the Asynchronous Computability Theorem. If the task is wait-free solvable, it should be possible to implement it without coordination.
In our discussion of connectivity, the combinatorial connectivity theorem had the property that allowed us to infer connectivity of complexes based on their constituent complexes. This notion of crafting global observations from local ones is captured by the gluing axiom which forms the foundation of mathematical structures called sheaves. Using sheaves to provide a semantics for distributed systems is attempted in Sheaves, Objects, and Distributed Systems.
Geometric renderings of distributed systems have been around for quite a while. Indeed one can regard Minkowski space — the mathematical formulation of Special Relativity — as a model for a distributed system. See also: Modeling Concurrency with Geometry, Homotopy Theory and Concurrency, and Interaction Categories. All make use of tools for understanding space as a means for understanding concurrency. Perhaps if we look at Homotopy Type Type theory as the logic of space, we can inch closer toward semantic foundations for concurrency and distributed systems.
Perhaps by computing the Fundamental Group of a topological space we can automate the reasoning required to determine whether a task is solvable, and allow tasks to adapt dynamically to their environment.

Acknowledgements

Most of the theoretical content and diagrams in this post are taken from the The Topological Structure of Asynchronous Computability by M. Herlihy and N. Shavit.

References

[1] The Topological Structure of Asynchronous Computability

[2] Impossibility of Distributed Consensus with One Faulty Process

[3] The Alpha of Indulgent Consensus

[4] The weakest failure detector for solving consensus

[5] Linearizability: A Correctness Condition for Concurrent Objects

[6] Wait-free Synchronization

[7] Concurrent Programming: Algorithms, Principles, and Foundations

[8] Wait-free k-set agreement is impossible: the topology of public knowledge

[9] Sheaves, Objects, and Distributed Systems

[10] Immediate atomic snapshots and fast renaming

[11] Unifying synchronous and asynchronous message-passing models

[12] Towards a Topological Characterization of Asynchronous Complexity

[13] Algebraic Topology

Stories by Leo Gorodinski on Medium

F# Async Guide

Table of Contents

Definition

Thread Pool

Async Workflows

Hazards

Async.RunSynchronously

Summary

Async.Start

Summary

Async.Parallel

Summary

Compute-Bound Computations

MailboxProcessor

Summary

CancellationToken

Summary

Async.AwaitTask

Summary

Indefinite Suspension

Summary

Laziness

Thread Local Storage

Summary

Related Programming Models

.NET System.Threading.Tasks.Task

Java java.util.concurrent.Future

Akka

Go Goroutine

JavaScript Promise

Haskell Control.Concurrent.Async

Concurrent ML

Joinads

Hopac

Clojure Async

Concepts

Concurrency & Parallelism

Asynchronous & Synchronous

Selective Communication

Acknowledgements

Scaling Event-Sourcing at Jet

Event Sourcing

Scaling Reads

Scaling Projections

Geo-Replication

Consistency Verification

Distributed Tracing

Ongoing Work

Conclusion

Acknowledgements

We’re Hiring

References

Universality of Consensus

System Model

Histories

Sequential Specifications

Wait Freedom

Linearizability

Consensus Objects

Consensus Numbers

Universal Constructions

Related Work

Summary

References

The Asynchronous Computability Theorem

The Impossibility of Consensus

The Alpha & Omega of Consensus

Wait Freedom

System Model

Tasks

Solvability

Topological Definitions

Simplexes

Simplicial Complexes

Simplicial Maps

Chromatic Complexes

Subdivisions

Connectivity

The Asynchronous Computability Theorem