Multi-threading on Multi-CPU machines
Author |
Message |
Garry Tayl #1 / 12
|
 Multi-threading on Multi-CPU machines
Hello, I am attempting to make a multi-threading function in one of my programs in an effort to gain a speed increase, but I'm getting quite the opposite, even on a dual-CPU Intel/Linux box. Can anyone enlighten me as to why, my code is below: -------- import thread import time ThreadCounter = 0 Iterations = 100 def Threader(): global ThreadCounter global Iterations Counter = 0 temptime = time.time() while Counter < Iterations: Counter = Counter + 1 thread.start_new_thread(TakesTime,()) while ThreadCounter < Iterations: pass print "Threaded: "+str(time.time() - temptime) def TakesTime(): global ThreadCounter Text = "Test" Counter = 0 while Counter < 20: Text = Text + Text Counter = Counter + 1 ThreadCounter = ThreadCounter + 1 def NoThreader(): global Iterations temptime = time.time() Counter = 0 while Counter < Iterations: Counter = Counter + 1 TakesTime() print "Non-Threaded: "+str(time.time() - temptime) Threader() NoThreader() -------- This does the same thing, threaded and then not, but on all of my machinbes, the multi-threaded is slower, what can I do about this? Thanks garry
|
Fri, 24 Dec 2004 23:02:54 GMT |
|
 |
Steve #2 / 12
|
 Multi-threading on Multi-CPU machines
Quote: > Hello, > I am attempting to make a multi-threading function in one of my > programs in an effort to gain a speed increase, but I'm getting quite > the opposite, even on a dual-CPU Intel/Linux box. Can anyone enlighten > me as to why, my code is below: ..... > This does the same thing, threaded and then not, but on all of my > machinbes, the multi-threaded is slower, what can I do about this?
does the python interpreter run on both CPU's? that is, do you have the Python threads executing concurrently on both CPU's, I'd imagine that they would be, but just wondering... I tested on my machine, which is a single AthlonXP 1900, and get times of about 3.6 and 3.4 for the Threader and NoThreader versiosn respectively, and I'd attribute a fair bit to the creation time for a thread, though I don't have any real timing to back that up. I did change it slightly so that the function passed did no work, so it was basically just the overhead of setting up and finishing a thread, and the threaded version took much longer than the non-threaded (as you'd expect on a single CPU machine). As it is, although you can have a CPU running each tasks, you still do very frequent checking of variables that are global to the whole thing, so that constant checking of the THreadCounter is going to introduce an overhead that isn't present on the non-threaded, as is the increment operation in each thread, the non-threaded version doesn't have the overhead of changing the variable on whatever CPU its sitting. If there were a method running half the threads on one CPU, and half on the other, and not communicating with that 'parent' until the very very end, when all of each thread-spawners threads had completed, then you might see a performance increase. I'm sure someone with a better knowledge of threading on Python can give you a better answer... Steven
|
Sat, 25 Dec 2004 00:27:17 GMT |
|
 |
Joseph A Knapk #3 / 12
|
 Multi-threading on Multi-CPU machines
Quote:
> Hello, > I am attempting to make a multi-threading function in one of my > programs in an effort to gain a speed increase, but I'm getting quite > the opposite, even on a dual-CPU Intel/Linux box. Can anyone enlighten > me as to why,
Yes. CPython threads cannot utilize multiple CPUs, due to the Global Interpreter Lock, which may only be acquired by one thread at a time. Apparently Jython threads do not have this limitation, as the GIL doesn't exist in Jython, or so I'm told. So if you simply ran your program under Jython you might see an improvement. Cheers, -- Joe
|
Sat, 25 Dec 2004 04:57:52 GMT |
|
 |
Aah #4 / 12
|
 Multi-threading on Multi-CPU machines
Quote:
>I am attempting to make a multi-threading function in one of my >programs in an effort to gain a speed increase, but I'm getting quite >the opposite, even on a dual-CPU Intel/Linux box. Can anyone enlighten >me as to why, my code is below:
Pure Python code will always slow down when threaded; in order to gain a speedup, you must call an extension that releases the GIL. All I/O functions in Python release the GIL, for example. For more info, see the slides on my home page. --
Project Vote Smart: http://www.vote-smart.org/
|
Sat, 25 Dec 2004 12:14:01 GMT |
|
 |
Garry Tayl #5 / 12
|
 Multi-threading on Multi-CPU machines
Quote:
> > Hello, > > I am attempting to make a multi-threading function in one of my > > programs in an effort to gain a speed increase, but I'm getting quite > > the opposite, even on a dual-CPU Intel/Linux box. Can anyone enlighten > > me as to why, > Yes. CPython threads cannot utilize multiple CPUs, due to the > Global Interpreter Lock, which may only be acquired by one > thread at a time. Apparently Jython threads do not have > this limitation, as the GIL doesn't exist in Jython, or so > I'm told. So if you simply ran your program under Jython > you might see an improvement. > Cheers, > -- Joe
Thank you both for your anwsers, unfortunatly running under Jython is not an option, as the whole program which I am writing runs to about 5,000 lines and use lots of Python modules which I don't really fancy trying to get to work under Jython. So, am I correct in thinking that there is nothing I can do about this, and still use standard Python? I understand that Solaris has a very good threading library, but from the comments above, I assume that this would make no difference? Do you have any tips/ideas how I could make use of multiple processors in a Python program? Thanks again Garry
|
Sat, 25 Dec 2004 17:02:38 GMT |
|
 |
Alex Martell #6 / 12
|
 Multi-threading on Multi-CPU machines
... Quote: > that this would make no difference? Do you have any tips/ideas how I > could make use of multiple processors in a Python program?
Use multiple *processes* rather than multiple *threads* within just one process. Multiple processes running on the same machine can share data very effectively via module mmap (you do need separate process-synchronization mechanisms if the shared data structures need to be written, of course), and you can use other fast same-machine mechanisms such as pipes, in addition of course to general distributed programming approaches that offer further scalability since they also run on multiple machines on the same network as well as within a single machine (pyro, corba, etc etc). Optimal-performance architectures will be different for multiple processes than for a single multi-thread process (and different for really-distributed versus single-machine), but the key issue tends always to be, who shares / sends what data with/to whom. If your problem is highly parallelizable anyway, the architectural distinction between multithread, multiprocess and distributed can boil down to using larger "slices" to farm out to workers to reduce the per-slice communication overhead, sometimes. Say for example that your task is to perform some pointwise computation cpuintensivefunction(x) on each point x of some huge array (assume without loss of generality the array is one-dimensional -0- the pointwise assumption allows that). With a multithreaded approach you might keep the array in memory and have the main thread farm out work requests to worker threads via a bounded queue. You want the queue a bit larger than the number of worker threads, and you can determine the optimal size for a work request (could be one item, or maybe two, or, say, 4) via some benchmarking. Upon receiving a work request from the Queue, a worker thread would: -- get a local copy of the relevant points from the large array, -- enter the C-coded computation function which -- releases the GIL, -- does the computations getting the nes points, -- acquires the GIL again, -- put back the resulting new points to the same area of the large array where the input came from, then go back to peel one more work request from the Queue. If you can't release the GIL during the computation, e.g. because your computation is in Python or anyway requires you to interact with the interpreter, then multithreading will give no speedup and should not be used for that purpose. A similar architecture might work for a single-machine multi process design IF multiple processes can use mmap to read and write different regions of a shared-memory array at the same time, without locking (I don't think mmap ensures that on all platforms, alas). "Get the next work request" would become a bit less simple than just peeling an item off a queue, which makes it likely that a rather larger size of work request might be optimal -- depending on what guarantees you can count on for simultaneous reads and writes from/to pipes or message queues, those might provide the Queue equivalents. Alternatively, wrap the data with a dedicated process which knows how to respond to requests for "next still-unassigned slice of work please" and (no return-acknowledgment needed) "here's the new computed data for the slice at coordinate X". pyro might be a good mechanism for such a task, and it would scale from one multi-CPU running multiple processes to a network (you might want to build-in sanity checking, most particularly for the network case -- if a node goes down, then after a while without a response from it the slices that had been assigned to it should be farmed out to others...). Of course, most parallel-computing cases are far more intricate than simple albeit CPU-intensive computations on a pointwise basis, but I hope this very elementary overview can still help!-) Alex
|
Sat, 25 Dec 2004 17:29:51 GMT |
|
 |
Duncan Boot #7 / 12
|
 Multi-threading on Multi-CPU machines
Quote: > So, am I correct in thinking that there is nothing I can do about > this, and still use standard Python? I understand that Solaris has a > very good threading library, but from the comments above, I assume > that this would make no difference? Do you have any tips/ideas how I > could make use of multiple processors in a Python program?
Can you split your program into several communicating processes? Each process has its own GIL, so if you can run multiple processes they can make better use of CPU. The only other option really is to see if you can isolate CPU intensive sections and rewrite them in C, then you might be able to release the GIL enough to get a useful speedup. Then again it may be possible to get enough speed improvement by modifying existing code. I find it can be quite hard working out exactly where Python is spending all its time. Do you know where in your current code most of the CPU is actually used? --
int month(char *p){return(124864/((p[0]+p[1]-p[2]&0x1f)+1)%12)["\5\x8\3" "\6\7\xb\1\x9\xa\2\0\4"];} // Who said my code was obscure?
|
Sat, 25 Dec 2004 20:50:29 GMT |
|
 |
anton wilso #8 / 12
|
 Multi-threading on Multi-CPU machines
Quote: > With a multithreaded approach you might keep the array in memory > and have the main thread farm out work requests to worker threads > via a bounded queue. You want the queue a bit larger than the > number of worker threads, and you can determine the optimal size > for a work request (could be one item, or maybe two, or, say, 4) > via some benchmarking. Upon receiving a work request from the > Queue, a worker thread would: > -- get a local copy of the relevant points from the > large array, > -- enter the C-coded computation function which > -- releases the GIL, > -- does the computations getting the nes points, > -- acquires the GIL again,
If the bounded queue were declared in a C extention module, would a thread doing the calculations really have to reaquire the GIL everytime that thread accessed this C data structure? Could mutexes be used instead? Quote: > -- put back the resulting new points to the same area > of the large array where the input came from, > then go back to peel one more work request from the Queue.
|
Sat, 25 Dec 2004 21:52:59 GMT |
|
 |
Christopher Saunt #9 / 12
|
 Multi-threading on Multi-CPU machines
: Hello, : I am attempting to make a multi-threading function in one of my : programs in an effort to gain a speed increase, but I'm getting quite : the opposite, even on a dual-CPU Intel/Linux box. Can anyone enlighten : me as to why, my code is below: <snip code> Hi Gerry, As other people have said, 'native' Python code does not bennefit from multiple CPUs in one PC due to the GIL. Depending on what you are doing with your threads, you may be able to utilise more then one processor by splititng the threads into multiple programs, running them simultaneously and communincating between them somehow (MPI etc...) - from what I have seen it requires a little more effort, but can work well. This is mainly usefull for 'number crunching' threads etc. --- cds
|
Sat, 25 Dec 2004 17:31:11 GMT |
|
 |
Alex Martell #10 / 12
|
 Multi-threading on Multi-CPU machines
Quote:
>> With a multithreaded approach you might keep the array in memory >> and have the main thread farm out work requests to worker threads >> via a bounded queue. You want the queue a bit larger than the >> number of worker threads, and you can determine the optimal size >> for a work request (could be one item, or maybe two, or, say, 4) >> via some benchmarking. Upon receiving a work request from the >> Queue, a worker thread would: >> -- get a local copy of the relevant points from the >> large array, >> -- enter the C-coded computation function which >> -- releases the GIL, >> -- does the computations getting the nes points, >> -- acquires the GIL again, > If the bounded queue were declared in a C extention module, would a thread > doing the calculations really have to reaquire the GIL everytime that > thread accessed this C data structure? Could mutexes be used instead?
C code talking to other C code, with Python *nowhere* in the picture, does not need the GIL but may make its own arrangements. However, it's hard to see how the Python data placed in the queue would get turned into C-usable data WITHOUT using some of the Python API -- whenever ANY use of the Python API is made, the thread making such use must hold the GIL (of course Python can't _guarantee_ that EVERY such GIL-less use will crash the program, burn the CPU AND raze the machine room to the ground, unfortunately, but you should still program AS IF that was the case). Given that a C-coded function is called from Python, it IS holding the GIL when it starts executing -- what it must do it to RELEASE the GIL as soon as it's finished doing calls to the Python API in order to let other threads use the Python interpreter, then acquire the GIL again before it can return control to the Python that called it. There is no benefit that I can see in duplicating the Queue module in C with all the attendant locking headaches &c -- moving the loop itself into C seems to be a tiny, irrelevant speedup anyway. Alex
|
Sat, 25 Dec 2004 23:28:47 GMT |
|
 |
Tim Churche #11 / 12
|
 Multi-threading on Multi-CPU machines
Quote:
> > > Hello, > > > I am attempting to make a multi-threading function in one of my > > > programs in an effort to gain a speed increase, but I'm getting quite > > > the opposite, even on a dual-CPU Intel/Linux box. Can anyone enlighten > > > me as to why, > > Yes. CPython threads cannot utilize multiple CPUs, due to the > > Global Interpreter Lock, which may only be acquired by one > > thread at a time. Apparently Jython threads do not have > > this limitation, as the GIL doesn't exist in Jython, or so > > I'm told. So if you simply ran your program under Jython > > you might see an improvement. > > Cheers, > > -- Joe > Thank you both for your anwsers, unfortunatly running under Jython is > not an option, as the whole program which I am writing runs to about > 5,000 lines and use lots of Python modules which I don't really fancy > trying to get to work under Jython. > So, am I correct in thinking that there is nothing I can do about > this, and still use standard Python? I understand that Solaris has a > very good threading library, but from the comments above, I assume > that this would make no difference? Do you have any tips/ideas how I > could make use of multiple processors in a Python program?
As someone else suggested, consider using MPI, which can be used to parallelise code on shared memory SMP machines as well as networked clusters. Installing user-mode LAM/MPI is very easy, although other forms of MPI such as MPICH may be a bit more difficult. However, once you have MPI installed, there are a number of Python MPI modules around which make using it a cinch. May I recommend PyPar, by Ole Nielsen at the Austrlian National University, as being particularly easy to use? See http://datamining.anu.edu.au/~ole/pypar/ I am pretty sure Ole has been using MPI and Pypar on multi-CPU Solaris machines as well as Linux Beowulf clusters and the hybrid shared/distributed memory APAC supercomputer at ANU. Tim C Quote: > Thanks again > Garry > -- > http://mail.python.org/mailman/listinfo/python-list
|
Sun, 26 Dec 2004 03:55:38 GMT |
|
 |
Garry Tayl #12 / 12
|
 Multi-threading on Multi-CPU machines
Quote:
> >I am attempting to make a multi-threading function in one of my > >programs in an effort to gain a speed increase, but I'm getting quite > >the opposite, even on a dual-CPU Intel/Linux box. Can anyone enlighten > >me as to why, my code is below: > Pure Python code will always slow down when threaded; in order to gain a > speedup, you must call an extension that releases the GIL. All I/O > functions in Python release the GIL, for example. For more info, see > the slides on my home page.
Thanks to those who replied, I think releasing the GIL would appear to be my best bet, as I don't want to add yet another dependency to the program, i.e. MPI and PyPar, also by 'Shared Memory' I take it you mean NUMAFlex machines and similar rather than a little 2x1GHz P4 Dell Server? The kind of machines my program will run on will max out at 4 way, I would expect, and it's not math-intensive or anything, I just want to speed up fairly average tasks, which only take around 10 seconds on a single processor. Thanks again Garry
|
Sun, 26 Dec 2004 17:59:52 GMT |
|
|
|