Sunday, July 14, 2013

Effect of GIL release on Numpy array operation

Release of GIL

At present numpy release Global Interpreter Lock or GIL for all two or one operand loops. Macro NPY_BEGIN_THREADS is used to save the Python state and releases the GIL. Hence it can be placed right before code that does not need the Python interpreter. Like in ufunc_object.c trivial_three_operand_loop and trivial_two_operand_loop use it for innerloop.

Not so good for small ones

But for short length array, it produces relative overhead instead. Releasing GIL for smaller operations doesn't benefit at all.
Here, Nathaniel has mentioned few things as
  • Vast majority of numpy code is single-threaded, so dropping the GIL is pure overhead.
  • Dropping the GIL for microseconds at a time probably produces no benefit even for multi-threaded code, since by the time the other thread gets started and starts producing useful work, the numpy loop is done.
  • Most numpy code calls + a lot more than it calls sin or even ** or /.


Profiling data

Time taken by addition operation has been, for array length upto 100000. Datasheet has been embedded in this post, showing times with and without GIL. So for array lesser than 1000 elements, use of GIL only create overhead and delay.

Implementations

NPY_BEGIN_THREADS_THRESHOLDED(loop_size), variant of NPY_BEGIN_THREADS macro, has been create in ndarraytypes.h with threshold. Threshold has been taken 500 conservatively and loss of guessing low can be neglected.
#define NPY_BEGIN_THREADS_THRESHOLDED(loop_size) do { if (loop_size > 500) { _save = PyEval_SaveThread();} } while (0);

More

  • Datasheet is at G-drive. 
  • Pull request for this is #3521.
  • Though the overall speedup is not so significant, but still there is certain in trivial_three_operand_loop. Please see the cumulative time improvement via call-graph to get clear image, http://goo.gl/9zldJ

3 comments:

  1. Hi,
    I have to say, I'm surprised. You mean we should release the GIL for small values? if not, the spreadsheet is not clear enough.
    I have other concerns:
    - are the time differences statistically relevant? Except for the first and last, the difference is less than 1% (and usually, we say in that case that they are the same value)
    - why a do {} while(0)? Why not just {} ?

    ReplyDelete
  2. - Sorry! spreadsheet column name was wrongly written. I mean to say we should not release the GIL for small values. As cost of saving python state and releasing GIL, then restoring it back is relatively significant for smaller array.
    - Though the overall speedup is not so significant, but still there is certain in trivial_three_operand_loop. Please see the cumulative time improvement via call-graph to get clear image, https://docs.google.com/file/d/0B3Pqyp8kuQw0ZU9pUmN0UHJFajg/edit?usp=sharing
    - Why macro is implemented that way.. there is better answer on http://stackoverflow.com/a/257425

    ReplyDelete
  3. Matthieu BrucherJuly 16, 2013 at 7:16 PM

    Thanks for the header fix.

    I guess that the number of calls is more relevant than the timings you gave, as timing difference is really, really small. Also, the number of calls inside your optimized function would be more relevant if the number of calls to the function could be identical to see the changes not only in this function but also if it has an impact on other ones (it sometimes does in numerical applications).

    I never bothered with the semi-column difference, I guess this is a reason why I hate macros so much ;)

    ReplyDelete