Converting script to execute external program using multiple cores

I'm a real beginner at python but I have the following script working successfully. It firstly creates a list of .xml files and then executes the program for each .xml.

Each .xml takes 2-4 minutes to complete and I need to run thousands, so I've been trying to speed up my script by using multiprocessing, but it appears beyond my skills.

Any suggestions on how to modify it would be greatly appreciated.

# import modules
import os, sys, shutil, subprocess, fnmatch
from datetime import datetime, timedelta
from time import strptime

# Set variables
project_folder = r"T:\erin\indivs_sample"
phoenix_exe_file = r'C:\Phoenix\Phoenix.exe'

# Create definitions

def runPhoenix(project_file):
    print "Running Phoenix @: " + str(datetime.now().strftime("%a, %d %b %Y %H:%M:%S GMT")) + " - " + project_file
    process = subprocess.Popen([phoenix_exe_file,project_file])
    process.wait() 
    print "Phoenix Complete @: " + str(datetime.now().strftime("%a, %d %b %Y %H:%M:%S GMT"))

# Create list of XMLs

project_files = []

for file_name in os.listdir(project_folder):
    if fnmatch.fnmatch(file_name,'*.xml'):
        file_path = os.path.join(project_folder, file_name)
        project_files.append(file_path)

# run project files

for project_file in project_files:
    runPhoenix(project_file)


print "completed"

1 answer

  • answered 2018-04-17 05:06 tlfong01

    Your question looks a bit complicated. Let me see if I understand your Python program correctly. Your program does two main things.

    1. Look into a project folder, and find the xml files than match some criteria, and create a list of file names of the matched xml files.

    2. Use the runPhoenix function to process, or possibly convert each of the xml files in the old list to a new list of "phoenix" files.

    I know very little of html and xml, and of course nothing about the phoenix thing.

    But I think your problem in general if trying to speed up a list of time consuming jobs by executing them in parallel.

    Let me give a specific example of your general problem. You have, say, 1,000 text files in English, and you want to translate the English text files into Spanish. For now, you have only translator doing the job sequentially and it takes a very long time.

    Now you would like to get say, 4 translators, each translating 1000/4 = 250 files.

    One possible solution is to use the Python multiprocessing package, which can create of a pool of say, 4 translator worker processes doing the jobs at the same time. This way, you can be 4 times faster.

    If you think I understand your problem correctly, I can suggest a rough Python multiprocessing program to do some simple text processing for your reference.