Pragmatic Concurrency for Python


  1. Eilif Muller (eilif dot mueller at epfl dot ch)
  2. Zbigniew Jędrzejewski-Szmek (zbyszek at in dot waw dot pl)

Topics covered

  • Using a compute cluster
  • Basic parallel programming concepts
  • ipython (ipcluster)
  • multiprocessing
  • mpi4py (Message Passing Interface for Python)

Using a compute cluster

First, we need to get familiar with an compute cluster environment. To this end, we are fortunate to be able to offer an Alces Flight HPC cluster ( running on Amazon EC2 credits generously donated by the AWS Cloud Credits for Research program.

We will assign a number 1..15 for each pair, to login to the cluster head-node as follows:

$ ssh student<1..15>

When working on clusters, a very convenient tool is GNU screen. It allows you to keep multiple shells open with only one ssh connection, and you can reconnect if the ssh connection dies, without disturbing your long running jobs.

$ screen

Screen quick ref: - CTRL-A ? - help/listing of the various key commands - CTRL-A c - create a new screen - CTRL-A n - next screen - CTRL-A p - prev screen - CTRL-A k - kill the screen - CTRL-A d - detach the screen but it stays running To reattach later (even logout)

$ screen -d -r 

The following command will request an interactive shell from the scheduler

$ qrsh -pe smp 18 -now no

Now we're going to launch a jupyter notebook.
First we need anaconda3, which is a module to be loaded:

$ module load apps/anaconda3/2.5.0/bin

Next, we need to fetch the IP addr of the compute node, so we know the IP address of the jupyter notebook server we will launch:

dig +short

Copy the IP address determined by this command onto the clipboard.
Clone the materials git, and launch the jupyter server (specify the port according to your user id):

git clone
cd ParallelExercises/jupyter
jupyter notebook --no-browser --port=800<1..15>

In another local terminal (on your laptop) enable a port forward to the running jupyter notebook

$ ssh -N -f -L127.0.0.1:8000:<1..15> student<1..15>@<paste-ip-addr-from-clipboard>

Now point the browser to http://localhost:8000.
Click on the IPClusters tab, increase # engines to 18 and click start.
Click on the Files tab, launch ipyparallel_demo.ipynb and follow along the lecture.

Further reading:


The purpose of these exercises is to run and modify a few examples, become comfortable with APIs, and implement some simple parallel programs.

1) Running MPI programs

We're going to use the job scheduler (our AWS cluster uses Sun Grid Engine)

In the course material parallel/hello_world there is a simple python program using the mpi4py module which imports mpi4py.MPI and displays the COMM_WORLD.rank, size and MPI.Get_processor_name() on each process. It is always handy to have such a program around to verify that the MPI environment is working as expected. In a distributed environment, the processor name will further inform you that your MPI execution was spawned accross machine boundaries, and how many processes are allocated per machine.

Note: To run a program which uses mpi4py, it can be started as if it was any MPI program. First load the openmpi module

$ module load apps/anaconda3/2.5.0/bin
$ module load mpi/openmpi/1.8.5/gcc-4.8.5
$ mpirun -np 8 python

This is what you should see:

Hello from login1. 0 of 10
Hello from login1. 1 of 10
Hello from login1. 3 of 10
Hello from login1. 4 of 10
Hello from login1. 6 of 10
Hello from login1. 9 of 10
Hello from login1. 5 of 10
Hello from login1. 7 of 10
Hello from login1. 2 of 10
Hello from login1. 8 of 10

Note, all processes are running on login1. Now let's use the scheduler to run on the compute nodes:

$  qsub -pe mpislots 8

Further reading:

Commands to look into: qrsh, qsub, qstat, qhost, qacct

Command line options to consider: -l h_rt=24:00:00 -l h_vmem -pe smp 2 -pe mpislots 64

2) Matrix Multiplication

  • Configure them to have the same matrix sizes, and compare speeds. It would be nice to also look at speedup and scaling, for those of you with (remote) access to machines with more than 4 cores (true cores, not hyper-threads) with the appropriate software installed.

For ipython, you need to start an ipcluster:


$ ipcluster start -n X  # or similar for your ipython version, see lecture notes

Where -n X is the number of slave processes to start.

3) Parallelization of mandelbrot

  1. Using similar decomposition techniques to the Matrix Multiplication example, parallelize the serial implementation of the mandelbrot plotter provided in the examples, using mpi4py, ipython and multiprocessing.
  2. Load balancing - the mandelbrot compuation has the property that computing some pixels take much longer than others.

First, quantify the degree of inbalance by gathering and plotting the distribution of execution times per pixel. Assuming you used chunked decomposition as for matrix multiplication, how does this per-pixel imbalance translate into a per-chunk inbalance?

Second, Can you modify the decomposition of the problem to provide each worker with work-loads which are more equal?


ipython: read-up on the LoadBalancedView here: mpi4py: a pure mpi4py approach is more tricky. One method might be to use asynchronous messaging (Isend, Irecv) to a set of workers and let a master (e.g. rank 0) re-assign work as workers complete.

4) IPython map-reduce

Using the ipython approach, get a collection of processes to count the occurrences of a word in a collection of documents, and then reduce the results to a total count per word on the master process.

See also:,

Lecture material