running several system commands in parallel in Python

Question

I write a simple script that executes a system command on a sequence of files. To speed things up, I'd like to run them in parallel, but not all at once - i need to control maximum number of simultaneously running commands. What whould be the easiest way to approach this ?

@unholysampler: This question is neither related to multithreading nor to thread pools. Threads might be one solution to the given problem, but a bad one in my opinion. I will remove these tags again. — Sven Marnach
– Sven Marnach, Commented Feb 14, 2011 at 13:25
@S.Lott. Limiting the maximum number of processes seems reasonable. Imagine you have 100k processes to launch, you'll want to run them spawn all of them at once? even if the OS could cope with it... — tokland
– tokland, Commented Feb 14, 2011 at 13:43
@S.Lott If the processes that are being launched are database intensive you might get a speed up by running a small number in parallel, but after a certain point contention will result in a slow down. — Andrew Wilkinson
– Andrew Wilkinson, Commented Feb 14, 2011 at 14:07
@S.Lott If the system command is sftp, for example, then you might want to run a limited number of processes in parallel. Given the question references a system command my reference to a database was probably not helpful, but that's why I've been in this situation in the past. — Andrew Wilkinson
– Andrew Wilkinson, Commented Feb 14, 2011 at 14:32

Sven Marnach · Accepted Answer · 2014-03-31 18:50:02Z

36

If you are calling subprocesses anyway, I don't see the need to use a thread pool. A basic implementation using the subprocess module would be

import subprocess
import os
import time

files = <list of file names>
command = "/bin/touch"
processes = set()
max_processes = 5

for name in files:
    processes.add(subprocess.Popen([command, name]))
    if len(processes) >= max_processes:
        os.wait()
        processes.difference_update([
            p for p in processes if p.poll() is not None])

On Windows, os.wait() is not available (nor any other method of waiting for any child process to terminate). You can work around this by polling in certain intervals:

for name in files:
    processes.add(subprocess.Popen([command, name]))
    while len(processes) >= max_processes:
        time.sleep(.1)
        processes.difference_update([
            p for p in processes if p.poll() is not None])

The time to sleep for depends on the expected execution time of the subprocesses.

edited Mar 31, 2014 at 18:50

answered Feb 14, 2011 at 13:23

Sven Marnach

608k123 gold badges968 silver badges865 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

michal Over a year ago

thanks! this seems to be what I need - and very simple. However I should have pointed out that I'm on Windows and it seems os.wait() is not supported. Any easy workaround for it ?

Sven Marnach Over a year ago

@user476983: Windows unfortunately does not allow to wait for the termination of any child. You can work around this by polling all child processes once per second or so (depending on how long the execution of the child processes takes).

Mannaggia Over a year ago

it seems there is a problem with the line line "processes.difference_update( p for p in processes if p.poll() is not None)". This causes "RuntimeError: Set changed size during iteration"

Mannaggia Over a year ago

Solution: "tmp=p for p in processes if p.poll() is not None)" and then "processes.difference_update(tmp)". This is a bit strange but it works. I am using python 2.7.

Sven Marnach Over a year ago

@Mannaggia: Your suggested code has mismatched parens. Assigning the generator expression to a temporary variable shouldn't make a difference. Turning it into a list comprehension should fix the problem -- I'll update the answer. (The error is probably caused by a rare race condition. Editing the code and trying again won't tell you whether the race condition is fixed. It might just not have occurred in that particular run, but would occur again in the next one.)

|

Pete · Accepted Answer · 2015-06-01 13:24:36Z

20

The answer from Sven Marnach is almost right, but there is a problem. If one of the last max_processes processes ends, the main program will try to start another process, and the for looping will end. This will close the main process, which can in turn close the child processes. For me, this behavior happened with the screen command.

The code in Linux will be like this (and will only work on python2.7):

import subprocess
import os
import time

files = <list of file names>
command = "/bin/touch"
processes = set()
max_processes = 5

for name in files:
    processes.add(subprocess.Popen([command, name]))
    if len(processes) >= max_processes:
        os.wait()
        processes.difference_update(
            [p for p in processes if p.poll() is not None])
#Check if all the child processes were closed
for p in processes:
    if p.poll() is None:
        p.wait()

edited Jun 1, 2015 at 13:24

Pete

17.2k10 gold badges42 silver badges55 bronze badges

answered Aug 23, 2012 at 5:29

Thuener

1,3991 gold badge13 silver badges14 bronze badges

2 Comments

CornSmith Over a year ago

I think you should delete this and add it to Sven's answer via an edit. Is this bad form on SO?

deepelement Over a year ago

Glory be to those that answer

Andrew Wilkinson · Accepted Answer · 2011-02-14 13:32:52Z

8

You need to combine a Semaphore object with threads. A Semaphore is an object that lets you limit the number of threads that are running in a given section of code. In this case we'll use a semaphore to limit the number of threads that can run the os.system call.

First we import the modules we need:

#!/usr/bin/python

import threading
import os

Next we create a Semaphore object. The number four here is the number of threads that can acquire the semaphore at one time. This limits the number of subprocesses that can be run at once.

semaphore = threading.Semaphore(4)

This function simply wraps the call to the subprocess in calls to the Semaphore.

def run_command(cmd):
    semaphore.acquire()
    try:
        os.system(cmd)
    finally:
        semaphore.release()

If you're using Python 2.6+ this can become even simpler as you can use the 'with' statement to perform both the acquire and release calls.

def run_command(cmd):
    with semaphore:
        os.system(cmd)

Finally, to show that this works as expected we'll call the "sleep 10" command eight times.

for i in range(8):
    threading.Thread(target=run_command, args=("sleep 10", )).start()

Running the script using the 'time' program shows that it only takes 20 seconds as two lots of four sleeps are run in parallel.

aw@aw-laptop:~/personal/stackoverflow$ time python 4992400.py 

real    0m20.032s                                                                                                                                                                   
user    0m0.020s                                                                                                                                                                    
sys     0m0.008s

answered Feb 14, 2011 at 13:32

Andrew Wilkinson

10.9k3 gold badges38 silver badges38 bronze badges

3 Comments

Sven Marnach Over a year ago

I don't like using threads for this. They are completely unnecessary -- you are starting subprocesses anyway.

Andrew Wilkinson Over a year ago

Threads are cheap though, and a semaphore makes tracking the number of running processes extremely simple.

Sven Marnach Over a year ago

Yeah, the code looks nice, especially when using the with statement. A drawback is that in the case of really many processes, you will unconditionally start a whole lot of threads first.

Devashish Das · Accepted Answer · 2014-07-25 11:23:02Z

I merged the solutions by Sven and Thuener into one that waits for trailing processes and also stops if one of the processes crashes:

def removeFinishedProcesses(processes):
    """ given a list of (commandString, process), 
        remove those that have completed and return the result 
    """
    newProcs = []
    for pollCmd, pollProc in processes:
        retCode = pollProc.poll()
        if retCode==None:
            # still running
            newProcs.append((pollCmd, pollProc))
        elif retCode!=0:
            # failed
            raise Exception("Command %s failed" % pollCmd)
        else:
            logging.info("Command %s completed successfully" % pollCmd)
    return newProcs

def runCommands(commands, maxCpu):
            processes = []
            for command in commands:
                logging.info("Starting process %s" % command)
                proc =  subprocess.Popen(shlex.split(command))
                procTuple = (command, proc)
                processes.append(procTuple)
                while len(processes) >= maxCpu:
                    time.sleep(.2)
                    processes = removeFinishedProcesses(processes)

            # wait for all processes
            while len(processes)>0:
                time.sleep(0.5)
                processes = removeFinishedProcesses(processes)
            logging.info("All processes completed")

unholysampler · Accepted Answer · 2011-02-14 13:09:04Z

2

What you are asking for is a thread pool. There is a fixed number of threads that can be used to execute tasks. When is not running a task, it waits on a task queue in order to get a new piece of code to execute.

There is this thread pool module, but there is a comment saying it is not considered complete yet. There may be other packages out there, but this was the first one I found.

answered Feb 14, 2011 at 13:09

unholysampler

17.4k7 gold badges51 silver badges65 bronze badges

Comments

Jakob Bowyer · Accepted Answer · 2011-02-14 13:15:13Z

2

If your running system commands you can just create the process instances with the subprocess module, call them as you want. There shouldn't be any need to thread (its unpythonic) and multiprocess seems a tad overkill for this task.

answered Feb 14, 2011 at 13:15

Jakob Bowyer

34.8k8 gold badges80 silver badges92 bronze badges

Comments

skeept · Accepted Answer · 2012-12-17 18:30:13Z

2

This answer is very similar to other answers present here but it uses a list instead of sets. For some reason when using those answers I was getting a runtime error regarding the size of the set changing.

from subprocess import PIPE
import subprocess
import time


def submit_job_max_len(job_list, max_processes):
  sleep_time = 0.1
  processes = list()
  for command in job_list:
    print 'running {n} processes. Submitting {proc}.'.format(n=len(processes),
        proc=str(command))
    processes.append(subprocess.Popen(command, shell=False, stdout=None,
      stdin=PIPE))
    while len(processes) >= max_processes:
      time.sleep(sleep_time)
      processes = [proc for proc in processes if proc.poll() is None]
  while len(processes) > 0:
    time.sleep(sleep_time)
    processes = [proc for proc in processes if proc.poll() is None]


cmd = '/bin/bash run_what.sh {n}'
job_list = ((cmd.format(n=i)).split() for i in range(100))
submit_job_max_len(job_list, max_processes=50)

answered Dec 17, 2012 at 18:30

skeept

12.5k8 gold badges45 silver badges53 bronze badges

1 Comment

knowone Over a year ago

Quick query man. I was trying your solution. Basically trying to pass one shell script with multiple commands in the solution listed above. The value mentioned in the range(100), it just executes 1 command for 100 times. Which basically doesn't satisfy what a parallel approach should be. Please correct me if I'm wrong; just starting Python so lot of confusions. Appreciate the help.

Collectives™ on Stack Overflow

running several system commands in parallel in Python

7 Answers 7

7 Comments

2 Comments

3 Comments

Comments

Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

7 Comments

2 Comments

3 Comments

Comments

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related