Running a queue of MPI calls in parallel with SLURM and limited resources

I'm trying to run a Particle Swarm Optimization problem on a cluster using SLURM, with the optimization algorithm managed by a single-core matlab process. Each particle evaluation requires multiple MPI calls that alternate between two Python programs until the result converges. Each MPI call takes up to 20 minutes.

I initially naively submitted each MPI call as a separate SLURM job, but the resulting queue time made it slower than running each job locally in serial. I am now trying to figure out a way to submit an N node job that will continuously run MPI tasks to utilize the available resources. The matlab process would manage this job with text file flags.

Here is a pseudo-code bash file that might help to illustrate what I am trying to do on a smaller scale:

#!/bin/bash

#SBATCH -t 4:00:00 # walltime
#SBATCH -N 2 # number of nodes in this job
#SBATCH -n 32 # total number of processor cores in this job

# Set required modules
module purge
module load intel/16.0
module load gcc/6.3.0

# Job working directory
echo Working directory is $SLURM_SUBMIT_DIR
cd $SLURM_SUBMIT_DIR
echo Running on host `hostname`
echo Time is `date`
echo Directory is `pwd`

# Run Command
while <"KeepRunning.txt” == 1>
do
  for i in {0..40}
  do
    if <“RunJob_i.txt” == 1>
    then
      mpirun -np 8 -rr -f ${PBS_NODEFILE} <job_i> &
    fi
  done
done

wait

This approach doesn't work (just crashes), but I don't know why (probably overutilization of resources?). Some of my peers have suggested using parallel with srun, but as far as I can tell this requires that I call the MPI functions in batches. This will be a huge waste of resources, as a significant portion of the runs finish or fail quickly (this is expected behavior). A concrete example of the problem would be starting a batch of 5 8-core jobs and having 4 of them crash immediately; now 32 cores would be doing nothing while they wait up to 20 minutes for the 5th job to finish.

Since the optimization will likely require upwards of 5000 mpi calls, any increase in efficiency will make a huge difference in absolute walltime. Does anyone have any advice as to how I could run a constant stream of MPI calls on a large SLURM job? I would really appreciate any help.