Showing posts with label concurrency. Show all posts
Showing posts with label concurrency. Show all posts

Wednesday, July 15, 2009

Improving performance in Jython

About two weeks ago I published a writeup about my findings about the performance of synchronization primitives in Jython from my presentation at JavaOne. During the presentation I said that these performance issues were something that I was going to work on, and improve. And indeed I did. I cannot take full credit for this, Jim Baker played a substantial part in this work as well. The end result is still something I'm very proud of since we managed to improve the performance of this benchmark as much as 50 times.

The benchmarks

The comparisons were performed based on the execution of this benchmark script invoked with:

  • JAVA_HOME=$JAVA_6_HOME jython synchbench.py
  • JAVA_HOME=$JAVA_6_HOME jython -J-server synchbench.py
# -*- coding: utf-8 -*-
from __future__ import with_statement, division

from java.lang.System import nanoTime
from java.util.concurrent import Executors, Callable
from java.util.concurrent.atomic import AtomicInteger

from functools import wraps
from threading import Lock

def adder(a, b):
    return a+b


count = 0
def counting_adder(a, b):
    global count
    count += 1 # NOT SYNCHRONIZED!
    return a+b


lock = Lock()
sync_count = 0
def synchronized_counting_adder(a, b):
    global sync_count
    with lock:
        sync_count += 1
    return a+b


atomic_count = AtomicInteger()
def atomic_counting_adder(a,b):
    atomic_count.incrementAndGet()
    return a+b


class Task(Callable):
    def __init__(self, func):
        self.call = func

def callit(function):
    @Task
    @wraps(function)
    def callable():
        timings = []
        for x in xrange(5):
            start = nanoTime()
            for x in xrange(10000):
                function(5,10)
            timings.append((nanoTime() - start)/1000000.0)
        return min(timings)
    return callable

def timeit(function):
    futures = []
    for i in xrange(40):
        futures.append(pool.submit(function))
    sum = 0
    for future in futures:
        sum += future.get()
    print sum

all = (adder,counting_adder,synchronized_counting_adder,atomic_counting_adder)
all = [callit(f) for f in all]

WARMUP = 20000
print "<WARMUP>"
for function in all:
    function.call()
for function in all:
    for x in xrange(WARMUP):
        function.call()
print "</WARMUP>"

pool = Executors.newFixedThreadPool(3)

for function in all:
    print
    print function.call.__name__
    timeit(function)
pool.shutdown()

glob = list(globals())
for name in glob:
    if name.endswith('count'):
        print name, globals()[name]

And the JRuby equivalent for comparison:

require 'java'
import java.lang.System
import java.util.concurrent.Executors
require 'thread'

def adder(a,b)
  a+b
end

class Counting
  def initialize
    @count = 0
  end
  def count
    @count
  end
  def adder(a,b)
    @count = @count + 1
    a+b
  end
end

class Synchronized
  def initialize
    @mutex = Mutex.new
    @count = 0
  end
  def count
    @count
  end
  def adder(a,b)
    @mutex.synchronize {
      @count = @count + 1
    }
    a + b
  end
end

counting = Counting.new
synchronized = Synchronized.new

puts "<WARMUP>"
10.times do
  10000.times do
    adder 5, 10
    counting.adder 5, 10
    synchronized.adder 5, 10
  end
end
puts "</WARMUP>"

class Body
  def initialize
    @pool = Executors.newFixedThreadPool(3)
  end
  def timeit(name)
    puts
    puts name
    result = []
    40.times do
      result << @pool.submit do
        times = []
        5.times do
          t = System.nanoTime
          10000.times do
            yield
          end
          times << (System.nanoTime - t) / 1000000.0
        end
        times.min
      end
    end
    result.each {|future| puts future.get()}
  end
  def done
    @pool.shutdown
  end
end

body = Body.new

body.timeit("adder") {adder 5, 10}
body.timeit("counting adder") {counting.adder 5, 10}
body.timeit("synchronized adder") {synchronized.adder 5, 10}

body.done

Where we started

A week ago the performance of this Jython benchmark was bad. Compared to the equivalent code in JRuby, Jython required over 10 times as much time to complete.

When I analyzed the code that Jython and JRuby generated and executed, I came to the conclusion that the reason Jython performed so badly was that the call path from the running code to the actual lock/unlock instructions introduced too much overhead for the JVM to have any chance at analyzing and optimizing the lock. I published this analysis in my writeup on the problem. It would of course be possible to lower this overhead by importing and utilizing the pure Java classes for synchronization instead of using the Jython threading module, but we like how the with-statement reads for synchronization:

with lock:
    counter += 1

Getting better

Based on my analysis of the how the with-statement compiles and the way that this introduces overhead I worked out the following redesign of the with-statement context manager interaction that would allow us to get closer to the metal, while remaining compatible with PEP 434:

  • When entering the with-block we transform the object that constitutes the context manager to a ContextManager-object.
  • If the object that constitutes the context manager implements the ContextManager interface it is simply returned. This is where context managers written in Java get their huge benefit by getting really close to the metal.
  • Otherwise a default implementation of the ContextManager is returned. This object is created by retrieving the __exit__ method and invoking the __enter__ method of the context manager object.
  • The compiled code of the with-statement then only invokes the __enter__ and __exit__ methods of the returned ContextManager object.
  • This has the added benefit that even for context managers written in pure Python the ContextManager could be optimized and cached when we implement call site caching.

This specification was easily implemented by Jim and then he could rewrite the threading module in Java to let the lock implementation take direct benefit of the rewritten with-statement and thereby get the actual code really close to the locking and unlocking. The result were instantaneous and beyond expectation:

Image

Not only did we improve performance, but we passed the performance of the JRuby equivalent! Even using the client compiler, with no warm up we perform almost two times better than JRuby. Turn on the server compiler and let the JIT warm up and perform all it's compilation and we end up with a speedup of slightly more than 50 times.

A disclaimer is appropriate here. With the first benchmark (before this was optimized) I didn't have time to wait for a full warmup. This because of the fact that the benchmark was so incredibly slow at that point and the fact that I was doing the benchmarks quite late before the presentation and didn't have time to leave it running over the night. Instead I turned down the compilation threshold of the Hotspot server compiler and ran just a few warmup iterations. It is possible that the JVM could have optimized the previous code slightly better given (a lot) more time. The actual speedup might be closer to the speedup from the first code to the new code using the client compiler and no warmup. But this is still a speedup of almost 20 times, which is still something I'm very proud of. There is also the possibility that I didn't run/implement the JRuby version in the best possible way, meaning that there might be ways of making the JRuby version run faster that I don't know about. The new figures are still very nice, much nicer than the old ones for sure.

The current state of performance of Jython synchronization primitives

It is also interesting to compare how the current implementation compares to the other versions in Jython that I included in my presentation:

Image

Without synchronization the code runs about three times as fast as with synchronization, but the counter does not return the correct result here due to race conditions. It's interesting from the point of view of analyzing the overhead added by synchronization but not for an actual implementation. Two times overhead is quite good in my opinion. What is more interesting to see is that the fastest version from the presentation, the one using AtomicInteger, is now suffering from the overhead of reflection required for the method invocations compared to the synchronized version. In a system with more hardware threads (commonly referred to as "cores") the implementation based on AtomicInteger could still be faster though.

Where do we proceed from here?

Now that we have proven that it was possible to get a nice speedup from this redesign of the code paths the next step is to provide the same kind of optimizations for code written in pure Python. Providing a better version of contextlib.contextmanager that exploits these faster code paths should be the easiest way to improve context managers written in Python. Then there are of course a wide range of other areas in Python where performance could be improved through the same kind of thorough analysis. I don't know at this point what we will focus on next, but you can look forward to many more performance improvements in Jython in the time to come.

Sunday, June 28, 2009

Performance of Synchronization primitives in Jython

As I pointed out in my previous blog post about JavaOne there are a few points where the performance of Jython is much worse than that of the “competitors”. In this post I provide my analysis of the the performance of Jython in the micro benchmark I used in my presentation. I will compare Jython to JRuby here for three reasons. The first reason is the obvious, Python and Ruby are quite similar languages. This takes us to the second reason, the performance of JRuby and Clojure was not that different in comparison, and it's therefore better to choose the most similar language for comparison. Finally, the third reason, JRuby is a darn good, highly optimized dynamic language implementation. That makes it a good baseline for comparison.

The first diagram shows Jython and JRuby with no contention at all. We can clearly see, as early as in this picture that Jython needs a bit more work in the performance department. I still feel excited about this though. The recent release of Jython, version 2.5, focused heavily on Python compatibility. The next versions will be where we start focusing on performance. The work towards Jython 3.0 is even more exciting here. But that excitement will be saved for a later post.

Image

The second slide shows another version of pretty much the same code, but with a shared, contended resource. Here we find that Jython performs much worse than JRuby. While the JRuby version is about 10 times slower with a contended shared resource, Jython is almost 100 times slower. Why is this?

Image

The effect of the implementation

I have already mentioned the first reason for JRuby performing better than Jython. It applies here as well. The JRuby team has spent more time on performance tuning than the Jython team, this is something we will be able to improve. In fact this plays a bigger part than one might think.

The JRuby Mutex implementation is written in about a page of plain old Java (not that important) that gets compiled to a small number of simple bytecodes (more important). Because a closure or block in JRuby is just an object with a method containing the compiled bytecode (compared to a function that contains argument parsing logic before that), the dispatch from the synchronization code to the guarded code is just a simple virtual call. For the call into the synchronization code there are two VM level calls, since this is a JRuby method. First there is the call to the dispatch logic and argument parsing, then the call to the actual code for the synchronization code. Much of the work of the first call is cached by JRubys call site cache from the first invocation so that subsequent invocations are much faster.

The Jython implementation on the other hand has no call site caching. So each call does full dispatch and argument parsing. The call structure is also different. A with statement in Python is just a try/except/finally block with simplified syntax, where the context manager (a lock in this case) is called for setup before the block inside the with statement and then invoked again after the with statement block completes.

In shorter terms: JRuby has much lower call overhead on the VM level. This makes a difference because a call is a call, and even when in-lined it imposes some overhead. It is also important because the JVM is better at making optimizations across few levels of calls than across several. Still, this only explains a small part of the difference seen in the chart.

The incredible benefit of closures

Most JVMs has built in support for analyzing and optimizing locks. In order to be able to do that it needs to have the the entire synchronized region in one compilation unit, i.e. both the lock instruction and the unlock instruction needs to be in the same compilation unit. Due to the fact that code analysis is super-linear there is a limit to how big a compilation unit is allowed to be. Initially a compilation unit corresponds to a Java bytecode method, but may grow as an effect of code in-lining (up to the compilation unit size limit). The key to get good performance from synchronized regions is therefore to either have both the lock and unlock instructions in the same method, or at least in a shallow call hierarchy from the same method.

Compare the two following snippets of pseudo bytecode (indented regions are the body of the invoked methods).

JRuby:

lock()
invoke closure_body
    // do stuff...
unlock()

Jython:

invoke __enter__
    Do Jython dispatch
        Do Jython to Java dispatch
            Do Java reflection dispatch
                lock()
// do stuff...
invoke __exit__
    Do Jython dispatch
        Do Jython to Java dispatch
            Do Java reflection dispatch
                unlock()

The key insight here is that in the first code snippet the lock and unlock instructions are in the same compilation unit. In the second example they are in two different call paths. The Jython dispatch logic is three levels of calls, and the Jython to Java dispatch logic is two levels, then there is the reflection dispatch that is a number of calls as well. Not only that, but there is quite a lot of code in those call paths as well: parsing of arguments, setting up call frame reflection objects, and more. Add all this together and there is no chance for the JVM to see both the lock and unlock instructions in the same compilation unit. Compared to the situation in the JRuby implementation where they are in the same compilation unit before any in-lining.

Having un-optimized locks make a huge difference for applications running on the JVM. This together with the fact that JRuby is more optimized in general, accounts for most of the difference in execution time for these examples. If we could fix this, we would get a substantial improvement of performance in Jython.

For further details on how we intend to improve this situation in Jython, you are very welcome to attend my presentation on "A better Python for the JVM" at EuroPython, this Tuesday (June 30, 2009) at 15:30. I will also continue posting here on specific improvements and plans, so stay tuned on Twitter or subscribe to my feed.

Disclaimers: The understanding of a JVM in this post is mostly based on the Hotspot JVM. Other JVMs might work slightly different, but the basic understanding should be at least similar.
The descriptions of both Jython and JRuby are somewhat simplified, the synchronization in JRuby is for example even slightly better optimized than what I have outlined here, but the full description would make the post overly complicated. The essentials are still the same.
In my presentation at JavaOne some numbers suffered from classic micro benchmark errors, ask me if you want to know more about that.

Wednesday, December 17, 2008

On Double Checked Locking in Java

You have probably all heard/seen the “Double-Checked Locking is Broken” declaration. And if you haven't I'll just tell you that before the improved memory model of Java 5, double checked locking didn't work as expected. Even in Java 5 and later there are things you need to consider to get it to work properly, and even more things to consider if you want it to perform well.

Joshua Bloch suggests in his book Effective Java, and keeps repeating in several presentations that there is one way to implement double checked locking and that this code snippet should be copied to every place where it's needed. I disagree. Not with the part about there only being one way of doing it, but with the copying part. In my eyes this is perfect use case for an abstract class:

package java.util.concurrent;

public abstract class LazyLoaded<T> {
private volatile T value;
// Get the value, guard the loading by double checked locking
public final T getValue() {
T result = value;
if (result == null) { // First check
synchronized (this) { // Lock
result = value;
if (result == null) { // Second check
result = loadValue();
value = result;
}
}
}
return result;
}
protected abstract T loadValue();
}

This class would then be used like this:

package org.thobe.example;

import java.util.concurrent.LazyLoaded;

class UsesDoubleCheckedLocking {
// The generics even makes it read well:
private final LazyLoaded<Something>() {
@Override protected Something loadValue() {
return new Something();
}
};
// ... your code will probably do something useful here ...
}

Just expressing my opinion. I hope someone important reads this and makes sure that it gets included in Java7. And while you're at it, make sure we get some standardized ways of composing Iterable/Iterators as well...