Problem-set: Advanced NumPy at Zurich Summer School 2013

These files may be downloaded from https://python.g-node.org/wiki/numpy

Please do explore beyond the problems given, and feel free to ask questions at any time.

Note

Solutions to some problems are provided, but try to avoid peeking at them until you've made an attempt!

Before we start, load NumPy:

import numpy as np

The NumPy array

These exercises explore some of the more advanced features of NumPy. If you prefer to learn the basics from scratch, have a look at this introduction.

  1. Let's explore the basics of a NumPy array. Create a 3x4 array of random values (using np.random.random), and call it x. Create another array as follows: y = x[2]. What happens when you modify y--does x also change?

  2. In the array above, have a look at its different properties: x.shape, x.ndim, x.dtype, x.strides. What does each property tell you about the array?

  3. Compare the strides property of x.T to the above. What is x.T and can you explain the new strides?

  4. Compute the sum of the columns of x. You may find the function to do that by using np.<TAB> in IPython, print dir(np) in standard Python or np.lookfor('sum'). Remember that you can always look at the docstring of any function in Python (use np.func_name?<ENTER> in IPython or help(np.func_name) in standard Python).

  5. Construct the array x = np.array([0, 1, 2, 3], dtype=np.uint8) (here, uint8 represents a single byte in memory, an unsigned integer between 0 and 255). Can you explain the results obtained by x + 1 and x / 2? Also try x.astype(float) + 1 and x.astype(float) / 2.

  6. What is the maximum number of dimensions a NumPy array can have? Use one of the array constructors (np.zeros, np.empty, np.random.random, etc.) to find out.

Structured Arrays

  1. Design a data-type for storing the following record:

    • Timestamp in nanoseconds (a 64-bit unsigned integer)
    • Position (x- and y-coordinates, stored as floating point numbers)

    Use it to represent the following data:

    dt = np.dtype(<your code here>)
    x = np.array([(100, (0, 0.5)),
                  (200, (0, 10.3)),
                  (300, (5.5, 15.1))], dtype=dt)
    

    Have a look at the np.dtype docstring if you need help. After constructing x, try to print all the x values for which time is greater than 100 (hint: something of the form y[y > 100]).

  2. Place the following data in a text file:

    % rank         lemma (10 letters max)      frequency       dispersion
    21             they                        1865844         0.96
    42             her                         969591          0.91
    49             as                          829018          0.95
    7              to                          6332195         0.98
    63             take                        670745          0.97
    14             you                         3085642         0.92
    35             go                          1151045         0.93
    56             think                       772787          0.91
    28             not                         1638883         0.98
    

Now, design a suitable structured data type, then load the data from the text file. Examine the data you got, for example: extract words only, extract the 3rd row, print all words with rank < 30.

Sort the data according to frequency (see np.sort). Save the result to a Numpy data file sorted.npz with np.savez and load back with np.load. Do you get back what you put in?

Broadcasting

  1. Take a look at the following. Many of you may know it as a meshgrid:

    x, y = np.mgrid[:10, :5]
    z = x + y
    

    Now, reproduce z from the above snippet, but this time use NumPy's ogrid together with broadcasting.

  2. In our solution, broadcasting is used "behind the scenes". To see what happens more explicitly, apply np.broadcast_arrays on the x and y given by ogrid ogrid. This should correspond to the x and y produced by mgrid.

  3. Benchmark the two approaches (one using mgrid, the other ogrid), using IPython's timeit function, on a larger grid. Can you explain the difference in execution time?

  4. Given a list of XYZ-coordinates, p,

    [[1.0, 2.0, 10],
     [3.0, 4.0, 20],
     [5.0, 6.0, 30],
     [7.0, 8.0, 40]]
    

    Normalise each coordinate by dividing with its Z (3rd) element. For example, the first row becomes:

    [1/10, 2/10, 10/10]
    

    Hint: extract the last column as y, and then change its dimensions so that p / y works.

Indexing

  1. Create a 3x3 ndarray. Use fancy indexing to slice out the diagonal elements.

  2. Generate a 10 x 3 array of random numbers (in range [0,1]). From each row, pick the number closest to 0.75. Make use of np.abs and np.argmax to find the column j which contains the closest element in each row.

  3. Predict and verify the shape of the following slicing operation.

    x = np.empty((10, 8, 6))
    
    idx0 = np.zeros((3, 8)).astype(int)
    idx1 = np.zeros((3, 1)).astype(int)
    idx2 = np.zeros((1, 1)).astype(int)
    
    x[idx0, idx1, idx2]
    

  4. Advanced: This is not strictly speaking a question on indexing, but it's a fun exercise either way.

    Construct an array

    x = np.arange(12, dtype=np.int32).reshape((3, 4))
    

    so that x is

    array([[ 0,  1,  2,  3],
           [ 4,  5,  6,  7],
           [ 8,  9, 10, 11]])
    

    Now, provide to np.lib.stride_tricks.as_strided the strides necessary to view a sliding 2x2 window over this array. The output should be

    array([[[[ 0,  1],
             [ 4,  5]],
    
            [[ 1,  2],
             [ 5,  6]],
    
            [[ 2,  3],
             [ 6,  7]]],
    
    
           [[[ 4,  5],
             [ 8,  9]],
    
            [[ 5,  6],
             [ 9, 10]],
    
            [[ 6,  7],
             [10, 11]]]], dtype=int32)
    

    The code is of the form

    z = as_strided(x, shape=(2, 3, 2, 2),
                      strides=(..., ..., ..., ...))
    

    This sort of stride manipulation is very handy to implement techniques such as region based statistics, convolutions, etc.

Array Interface

Documentation for NumPy's __array_interface__ may be found in the online docs.

  1. An author of a foreign package (array_interface/mutable_str.py) provides a string class that allocates its own memory:

    In [1]: from mutable_str import MutableString
    
    In [2]: s = MutableString('abcde')
    
    In [3]: print s
    abcde
    

    You'd like to view these mutable (mutable means the ability to modify in place) strings as ndarrays, in order to manipulate the underlying memory.

    1. Add an __array_interface__ dictionary attribute to s, then convert s to an ndarray. Use the given array_interface/template.py as a guide.
    2. Numerically add "1" to the array. Now print the original string to ensure that its value was modified.

Advanced Exercises

  1. Construct the following two arrays:

    x = np.array([[1, 2], [3, 4]], order='C', dtype=np.uint8)
    y = np.array([[1, 2], [3, 4]], order='F', dtype=np.uint8)
    

    Compare the bytes they store in memory by using

    [ord(c) for c in x.data]
    

    Note that, even though these arrays store data in different memory order, they are identical from the user's perspective.

    print x
    print y
    

  2. Examine the bytes stored by the following array (using the "ord" trick shown above).

    x = np.array([[1, 2], [3, 4]], dtype=np.uint32)
    

    Note that, on most laptops, the byte order will be little Endian, i.e. least significant byte first.

  3. Attempt the Fortran-ordering quiz.