6

I am trying to speed up some code with multiprocessing in Python, but I cannot understand one point. Assume I have the following dumb function:

import time
from multiprocessing.pool import Pool

def foo(_):
    for _ in range(100000000):
        a = 3 

When I run this code without using multiprocessing (see the code below) on my laptop (Intel - 8 cores cpu) time taken is ~2.31 seconds.

t1 = time.time()
foo(1)
print(f"Without multiprocessing {time.time() - t1}")

Instead, when I run this code by using Python multiprocessing library (see the code below) time taken is ~6.0 seconds.

pool = Pool(8)
t1 = time.time()
pool.map(foo, range(8))
print(f"Sample multiprocessing {time.time() - t1}")

From the best of my knowledge, I understand that when using multiprocessing there is some time overhead mainly caused by the need to spawn the new processes and to copy the memory state. However, this operation should be performed just once when the processed are initially spawned at the very beginning and should not be that huge.

So what I am missing here? Is there something wrong in my reasoning?

Edit: I think it is better to be more explicit on my question. What I expected here was the multiprocessed code to be slightly slower than the sequential one. It is true that I don't split the whole work across the 8 cores, but I am using 8 cores in parallel to do the same job (hence in an ideal world the processing time should more or less stay the same). Considering the overhead of spawning new processes, I expected a total increase in time of some (not too big) percentage, but not of a ~2.60x increase as I got here.

6
  • You're not doing anything, and you're spawning 8 extra processes to do nothing. I'm really not sure what it is you want to accomplish with this... Commented May 19, 2020 at 21:19
  • Also, don't try to use time to work out how long code it taking, use something like timeit or another performance testing module. Commented May 19, 2020 at 21:24
  • 1
    @match what is wrong with using the time module to time things? Commented May 20, 2020 at 0:03
  • @Makoto I tried to be more explicit on my question, hope it can help. Commented May 20, 2020 at 10:28
  • 1
    @match Yes, I guess you're right. If I am not wrong timeit should do multiple comparisons by executing the same code multiple times, so measures should be more reliable right? Commented May 20, 2020 at 10:31

2 Answers 2

9

Well, multiprocessing can't possibly make this faster: you're not dividing the work across 8 processes, you're asking each of 8 processes to do the entire thing. Each process will take at least as long as your code doing it just once without using multiprocessing.

So if multiprocessing weren't helping at all, you'd expect it to take about 8 times as long (it's doing 8x the work!) as your single-processor run. But you said it's not taking 2.31 * 8 ~= 18.5 seconds, but "only" about 6. So you're getting better than a factor of 3 speedup.

Why not more than that? Can't guess from here. That will depend on how many physical cores your machine has, and how much other stuff you're running at the same time. Each process will be 100% CPU-bound for this specific function, so the number of "logical" cores is pretty much irrelevant - there's scant opportunity for processor hyper-threading to help. So I'm guessing you have 4 physical cores.

On my box

Sample timing on my box, which has 8 logical cores but only 4 physical cores, and otherwise left the box pretty quiet:

Without multiprocessing 2.468580484390259
Sample multiprocessing 4.78624415397644

As above, none of that surprises me. In fact, I was a little surprised (but pleasantly) at how effectively the program used up the machine's true capacity.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you. Good catch in measuring based on physical cores rather than logical ones (and indeed I have 4 physical cores!). I get similar results to yours with 4 processes, but still we have an increase of ~95%. Is this totally due to time taken to spawn new processes and other stuff the laptop is doing? In my mind I expected something like an increase of 2% from spawning new processes (no big things to copy in the new processes here) and +20% because each processor maybe is assigned to the task at 80% if no doing other stuff
Modern processors, and OSes, are extremely complex, and there's no real reason to believe anything you see here is specific to Python. Try, e.g., firing off 8 copies of the same CPU-bound program as fast as you can (via, say, a shell program that just repeats the same command line 8 times), written in any language - even assembler. Timings will likely be all over the map across runs. On my "quiet" box just now, there are over 200 processes and 2000 threads, mostly idle. There's no guessing when they'll wake up, at the cache effects, when processors will decide to switch clock speeds, etc.
3

@TimPeters already answered that you are actually just running the job 8 times across the 8 Pool subprocesses, so it is slower not faster.

That answers the issue but does not really answer what your real underlying question was. It is clear from your surprise at this result, that you were expecting that the single job to somehow be automatically split up and run in parts across the 8 Pool processes. That is not the way that it works. You have to build in/tell it how to split up the work.

Different kinds of jobs needs need to be subdivided in different ways, but to continue with your example you might do something like this:

import time
from multiprocessing.pool import Pool

def foo(_):
    for _ in range(100000000):
        a = 3 

def foo2(job_desc):
    start, stop = job_desc
    print(f"{start}, {stop}")

    for _ in range(start, stop):    
        a = 3 

def main():
    t1 = time.time()
    foo(1)
    print(f"Without multiprocessing {time.time() - t1}")

    pool_size = 8
    pool = Pool(pool_size)

    t1 = time.time()

    top_num = 100000000
    size = top_num // pool_size
    job_desc_list = [[size * j, size * (j+1)] for j in range(pool_size)]
    # this is in case the the upper bound is not a multiple of pool_size
    job_desc_list[-1][-1] = top_num

    pool.map(foo2, job_desc_list)
    print(f"Sample multiprocessing {time.time() - t1}")


if __name__ == "__main__":
    main()

Which results in:

Without multiprocessing 3.080709171295166
0, 12500000
12500000, 25000000
25000000, 37500000
37500000, 50000000
50000000, 62500000
62500000, 75000000
75000000, 87500000
87500000, 100000000
Sample multiprocessing 1.5312283039093018

As this shows, splitting the job up does allow it to take less time. The speedup will depend on the number of CPUs. In a CPU bound job you should try to limit it the pool size to the number of CPUs. My laptop has plenty more CPU's but some of the benefit is lost to the overhead. If the jobs were longer this should look more useful.

1 Comment

Thank you. Indeed I was referring to doing the whole work across all the processors, but my questions was not very explicit (sorry about that). I edited it now!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.