# When parallelization does not help: the starving CPUs problem

##### Executive summary

Computational scientists should know that most of the time their CPUs are waiting for data to arrive. Knowing where the low-level bottlenecks are, and knowing what can be done to ameliorate them, may save hours of frustration when trying to understand why apparently well-written programs perform poorly.

I'll talk about why current CPUs are starving for data, and how to address this issue in modern computers by using different techniques.

##### Contents

1. Motivation

2. The Data Access Issue

• Why Modern CPUs Are Starving?
• Caches And The Hierarchical Memory Model
• Techniques For Fighting Data Starvation

3. High Performance Libraries

Preliminary slides here. The multiprocessing script for NumPy is here.

##### Exercises

Fetch the guidelines in TXT format or PDF.

You will need these files:

Obviously, the solutions are not provided (yet!).

##### Server for exercises
You should see the following
```Host key fingerprint is 08:43:b6:f6:ca:fb:7d:de:4a:19:9a:31:70:fd:1b:eb
+--[ RSA 2048]----+
|    o            |
|   o .   .       |
|    = . . .      |
|   . + +   .     |
|      o S . o    |
|   . .   = o +   |
|    o   o o o    |
|     . . ..o     |
|    ... .oo.E    |
+-----------------+```

The exercises need to be performed on a machine with more power than a normal notebook. To access the server you need to use a different port (to bypass the firewall :( ):

`ssh -p 443 student@escher.fuw.edu.pl -o VisualHostKey=yes`
##### CPU vs Memory Benchmark

Save the following as cpu_vs_mem.py and execute. Thanks to Stéfan van der Walt for contributing this.

cpu_vs_mem.py
```import numpy as np
import timeit
import sys

N = 10*1000*1000
x = np.linspace(1, 10, N)

def raw():
"""Straight-forward NumPy evaluation of polynomial.

"""
return (((.25 * x) + .75) * x - 1.5) * x - 2

def inplace(block_size=20000):
"""Blocked evaluation of polynomial.

"""
y = np.empty(len(x))
for k in range(len(x) // block_size + 1):
b, e = k * block_size, (k+1) * block_size
y[b:e] = x[b:e]
y[b:e] *= .25
y[b:e] += .75
y[b:e] *= x[b:e]
y[b:e] -= 1.5
y[b:e] *= x[b:e]
y[b:e] -= 2

return y

def bench():

Break up a computation in chunks and benchmark. Small blocks fit into
cache easily, but the NumPy overhead and the outer Python for-loop
takes long to execute.  With large blocks, the overhead for NumPy and
the for-loop is negligible, but the blocks no longer fit into cache,
resulting in delays.

Returns
-------
block_sizes : list
Size of the different data chunks.
times : list
Execution times.

"""
times = []
blocks = np.round(np.logspace(3, 6, num=50))
for b in blocks:
times.append(timeit.timeit('cpu_vs_mem.inplace(block_size=%d)' % b,
'import cpu_vs_mem', number=1))
print 'Block size: %d  Execution time: %.2f' % (b, times[-1])
sys.stdout.flush()

return blocks, times

if __name__ == "__main__":
import matplotlib.pyplot as plt

blocks, times = bench()
plt.semilogx(blocks, times, 'o-')
plt.xlabel('Block size [b]')
plt.ylabel('Execution time [s]')
plt.title('CPU vs Memory Benchmark')
plt.show()``` 