Running Python scripts in parallel
How to launch multiple simultaneous Windows command-line processes
As a data-oriented researcher or practitioner you’ll probably sometimes need to process multiple datasets using a program or script. Or, especially if you’re working with probabilistic models, you might want to execute the same script multiple times using different input parameter values for each run. In that way you can generate a set of outputs which collectively represent a sampling of some possibility space under the model.
A case of the latter sort arose for me recently, in the context of computational structural biology. I’d developed a command-line script for manipulating protein structures, as represented by Protein Databank (PDB) files, and amongst its input parameters were an initial seed value and a final seed value. The script included code for various operations making use of Python’s random
function, and this code was contained in a for
loop that iterated over the range of seed values defined by the initial seed value and final seed value parameters. Every time the code within the for
loop was run, a different seed value was used and hence the random
function returned a different set of values.
These details about random functions and seed values are all in a sense by the by. The point is that each time I executed the script, I was initiating a series of runs of the code in the for
loop. If the initial seed value was 1 and the final seed value was 10, then the code in the for
loop ran 10 times, one after another. If the protein in question was large, each run (corresponding to a particular seed value) could take some hours to complete — and 10 runs in succession might take literally days. How could things be speeded up?
The need for speed
I needed to exploit the fact that modern CPUs have multiple cores, with each core typically offering several logical cores. When I looked at how much of the CPU’s power was being harnessed (using Task Manager) when running a single script instance, I found that it didn’t exceed a rather paltry 14%. It felt like the full power of the CPU was unavailable, even though my script was computationally quite demanding (involving 3D matrix transformations acting on thousands of atom records, and iterating through those atoms to a variety of ends in a pretty intense fashion).
Parallel processes
The answer is to execute multiple processes simultaneously. That means opening a new command window for each process, and then in each window entering the command to run your script along with the relevant input parameters. If you open six command windows and launch your script in each one — supplying different parameter values of course (in my case meaning different random seed values) — you end up, if process execution takes a significant amount of time, with six processes running in parallel. If you check out the CPU utilization now, you will see a rather healthier percentage than you saw when just one script instance was running.
Exactly how much utilization you get will depend on how many cores, and logical cores, your processor has. Mine — an Intel Core i5 processor — features six cores, each with two logical cores. This means I don’t see 100% CPU utilization until I have 12 processes running in parallel, implying that I might need 12 command windows open and running my script to see full utilization. Supposing there’s a case for running the same script twelve times in parallel under different parameter values, it soon becomes tedious to have to manually open all those command-line windows and type the script execution command (including all its parameters) into each. What to do? It turns out that there’s a solution to meet this new challenge: batch files.
Batch files to the rescue!
You might remember batch (.bat) files from days of old, when dinosaurs roamed the Earth and DOS ruled our PCs. Using a single batch file we can automate the creation of all those command windows and the launching of our multiple processes. A double click of the mouse on the relevant batch file name in Windows Explorer is then all it takes. So what is the batch file magic that needs to be invoked? An example will tell you everything you need to know. Here’s a batch file for launching 12 copies of my protein unstructuring script (called unstruct_v2.py):
start cmd /k python unstruct_v2.py "1hrc.pdb" "1" "15" "1500" "1" "1"
start cmd /k python unstruct_v2.py "1hrc.pdb" "2" "15" "1500" "2" "2"
start cmd /k python unstruct_v2.py "1hrc.pdb" "3" "15" "1500" "3" "3"
start cmd /k python unstruct_v2.py "1hrc.pdb" "4" "15" "1500" "4" "4"
start cmd /k python unstruct_v2.py "1hrc.pdb" "5" "15" "1500" "5" "5"
start cmd /k python unstruct_v2.py "1hrc.pdb" "6" "15" "1500" "6" "6"
start cmd /k python unstruct_v2.py "1hrc.pdb" "7" "15" "1500" "7" "7"
start cmd /k python unstruct_v2.py "1hrc.pdb" "8" "15" "1500" "8" "8"
start cmd /k python unstruct_v2.py "1hrc.pdb" "9" "15" "1500" "9" "9"
start cmd /k python unstruct_v2.py "1hrc.pdb" "10" "15" "1500" "10" "10"
start cmd /k python unstruct_v2.py "1hrc.pdb" "11" "15" "1500" "11" "11"
start cmd /k python unstruct_v2.py "1hrc.pdb" "12" "15" "1500" "12" "12"
Each line opens a command window (with the /k switch keeping the command window open — because I wanted to see what was being written to standard output), runs the Python script against the PDB file 1hrc.pdb (which is cytochrome c, in case you’re interested) and supplies it with a specific set of input parameters. You don’t need to worry about what they are in this case, other than that the last two are the initial seed value and the final seed value. You can see that here they have the same value — meaning that each process will run the script for just a single seed value. This contrasts with the command line prompt for non-parallel execution, which to perform the same 12 runs (in series) would just be:
python unstruct_v2.py "1hrc.pdb" "12" "15" "1500" "1" "12"
The batch file enables me to run 12 process instances of my script in parallel, thereby harnessing all 12 of my CPU’s logical processors for 100% CPU utilization. (But remember that you could use this batch file script execution technique to launch different scripts in parallel if you so wished.)
Here’s graphic evidence of that full CPU utilization:
In-parallel + in-series execution
In the case of my protein script, each script run has the ability to handle a range of seed values, processing them in series, i.e. one after another, as mentioned above. So if I were to specify 10 seed values for each process instance launched, via the initial seed value and final seed value parameters, then with parallelization of 12 script process instances via a suitable batch file I could in one fell swoop launch 120 runs with different seed values. This is proper ‘fire and forget’ (or ‘fire and come back tomorrow to harvest the results’) computing!
The time savings of parallel process execution can be significant. Running the unstructuring script to perform artificially short test simulations (just to compare timings) for 12 seed values in series resulted in a total run time of around 647 seconds, i.e. about 54 seconds per seed value. However, running the same 12 simulations in parallel, by assigning one seed value to each process, took only around 130 seconds in total (or 10.8 seconds per seed value). That represents a very worthwhile five-fold speed-up.
So there we are: running scripts in parallel on Windows isn’t difficult and can deliver big time savings. I hope this overview has been helpful.