I write a simple script that executes a system command on a sequence of files. To speed things up, I'd like to run them in parallel, but not all at once - i need to control maximum number of simultaneously running commands. What whould be the easiest way to approach this ?
-
@unholysampler: This question is neither related to multithreading nor to thread pools. Threads might be one solution to the given problem, but a bad one in my opinion. I will remove these tags again.Sven Marnach– Sven Marnach2011-02-14 13:25:55 +00:00Commented Feb 14, 2011 at 13:25
-
related: stackoverflow.com/questions/3194018/…tokland– tokland2011-02-14 13:35:05 +00:00Commented Feb 14, 2011 at 13:35
-
3@S.Lott. Limiting the maximum number of processes seems reasonable. Imagine you have 100k processes to launch, you'll want to run them spawn all of them at once? even if the OS could cope with it...tokland– tokland2011-02-14 13:43:17 +00:00Commented Feb 14, 2011 at 13:43
-
1@S.Lott If the processes that are being launched are database intensive you might get a speed up by running a small number in parallel, but after a certain point contention will result in a slow down.Andrew Wilkinson– Andrew Wilkinson2011-02-14 14:07:32 +00:00Commented Feb 14, 2011 at 14:07
-
1@S.Lott If the system command is sftp, for example, then you might want to run a limited number of processes in parallel. Given the question references a system command my reference to a database was probably not helpful, but that's why I've been in this situation in the past.Andrew Wilkinson– Andrew Wilkinson2011-02-14 14:32:05 +00:00Commented Feb 14, 2011 at 14:32
7 Answers
If you are calling subprocesses anyway, I don't see the need to use a thread pool. A basic implementation using the subprocess module would be
import subprocess
import os
import time
files = <list of file names>
command = "/bin/touch"
processes = set()
max_processes = 5
for name in files:
processes.add(subprocess.Popen([command, name]))
if len(processes) >= max_processes:
os.wait()
processes.difference_update([
p for p in processes if p.poll() is not None])
On Windows, os.wait() is not available (nor any other method of waiting for any child process to terminate). You can work around this by polling in certain intervals:
for name in files:
processes.add(subprocess.Popen([command, name]))
while len(processes) >= max_processes:
time.sleep(.1)
processes.difference_update([
p for p in processes if p.poll() is not None])
The time to sleep for depends on the expected execution time of the subprocesses.
7 Comments
The answer from Sven Marnach is almost right, but there is a problem. If one of the last max_processes processes ends, the main program will try to start another process, and the for looping will end. This will close the main process, which can in turn close the child processes. For me, this behavior happened with the screen command.
The code in Linux will be like this (and will only work on python2.7):
import subprocess
import os
import time
files = <list of file names>
command = "/bin/touch"
processes = set()
max_processes = 5
for name in files:
processes.add(subprocess.Popen([command, name]))
if len(processes) >= max_processes:
os.wait()
processes.difference_update(
[p for p in processes if p.poll() is not None])
#Check if all the child processes were closed
for p in processes:
if p.poll() is None:
p.wait()
2 Comments
Glory be to those that answerYou need to combine a Semaphore object with threads. A Semaphore is an object that lets you limit the number of threads that are running in a given section of code. In this case we'll use a semaphore to limit the number of threads that can run the os.system call.
First we import the modules we need:
#!/usr/bin/python
import threading
import os
Next we create a Semaphore object. The number four here is the number of threads that can acquire the semaphore at one time. This limits the number of subprocesses that can be run at once.
semaphore = threading.Semaphore(4)
This function simply wraps the call to the subprocess in calls to the Semaphore.
def run_command(cmd):
semaphore.acquire()
try:
os.system(cmd)
finally:
semaphore.release()
If you're using Python 2.6+ this can become even simpler as you can use the 'with' statement to perform both the acquire and release calls.
def run_command(cmd):
with semaphore:
os.system(cmd)
Finally, to show that this works as expected we'll call the "sleep 10" command eight times.
for i in range(8):
threading.Thread(target=run_command, args=("sleep 10", )).start()
Running the script using the 'time' program shows that it only takes 20 seconds as two lots of four sleeps are run in parallel.
aw@aw-laptop:~/personal/stackoverflow$ time python 4992400.py
real 0m20.032s
user 0m0.020s
sys 0m0.008s
3 Comments
with statement. A drawback is that in the case of really many processes, you will unconditionally start a whole lot of threads first.I merged the solutions by Sven and Thuener into one that waits for trailing processes and also stops if one of the processes crashes:
def removeFinishedProcesses(processes):
""" given a list of (commandString, process),
remove those that have completed and return the result
"""
newProcs = []
for pollCmd, pollProc in processes:
retCode = pollProc.poll()
if retCode==None:
# still running
newProcs.append((pollCmd, pollProc))
elif retCode!=0:
# failed
raise Exception("Command %s failed" % pollCmd)
else:
logging.info("Command %s completed successfully" % pollCmd)
return newProcs
def runCommands(commands, maxCpu):
processes = []
for command in commands:
logging.info("Starting process %s" % command)
proc = subprocess.Popen(shlex.split(command))
procTuple = (command, proc)
processes.append(procTuple)
while len(processes) >= maxCpu:
time.sleep(.2)
processes = removeFinishedProcesses(processes)
# wait for all processes
while len(processes)>0:
time.sleep(0.5)
processes = removeFinishedProcesses(processes)
logging.info("All processes completed")
Comments
What you are asking for is a thread pool. There is a fixed number of threads that can be used to execute tasks. When is not running a task, it waits on a task queue in order to get a new piece of code to execute.
There is this thread pool module, but there is a comment saying it is not considered complete yet. There may be other packages out there, but this was the first one I found.
Comments
This answer is very similar to other answers present here but it uses a list instead of sets. For some reason when using those answers I was getting a runtime error regarding the size of the set changing.
from subprocess import PIPE
import subprocess
import time
def submit_job_max_len(job_list, max_processes):
sleep_time = 0.1
processes = list()
for command in job_list:
print 'running {n} processes. Submitting {proc}.'.format(n=len(processes),
proc=str(command))
processes.append(subprocess.Popen(command, shell=False, stdout=None,
stdin=PIPE))
while len(processes) >= max_processes:
time.sleep(sleep_time)
processes = [proc for proc in processes if proc.poll() is None]
while len(processes) > 0:
time.sleep(sleep_time)
processes = [proc for proc in processes if proc.poll() is None]
cmd = '/bin/bash run_what.sh {n}'
job_list = ((cmd.format(n=i)).split() for i in range(100))
submit_job_max_len(job_list, max_processes=50)