NB: NumPy Continued

Quick Refresher on Shape

The shape of an array is represented a tuple, e.g. (n, m) for a 2D array.

The length of the tuple is number of dimensions (i.e. axes).
The values of the tuple are the number of elements in each dimension (axis).

Consider the array foo:

import numpy as np

foo = np.ones((6,4))

foo

array([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])

It was created with the \(2\) dimensions. * The first dimension, axis 0, has \(6\) elements. In this case, these elements are arrays. * The second dimension, axis 1, has \(4\) elements. Its elements are scalars (numbers in this case).

The shape of foo[0] is \(4\). It contains \(1\) axis with \(4\) elements.

foo[0], foo[0].shape

(array([1., 1., 1., 1.]), (4,))

It has a shape of \(1\) and not \(4 \times 1\) because it is a vector, not a matrix.

SO, there is a difference between a vector and a 1-column matrix.

Reshaping

If we want to make it into a 1-column matrix, we need to reshape it using np.reshape().

Note that the first value of the shape argument is \(-1\). This means use the length of the vector that is passed to it.

## np.reshape?

foo2 = np.reshape(foo[0], (-1, 1))

foo2, foo2.shape

(array([[1.],
        [1.],
        [1.],
        [1.]]),
 (4, 1))

When indexing an array, think of the positions of the comma-delimitted tuple as the axis.

The values are the element offsets in the containing array. The

foo[2:, 2:]
##   ^   ^ 
##   0   1  <- AXIS NUMBER

array([[1., 1.],
       [1., 1.],
       [1., 1.],
       [1., 1.]])

Example: The Normal Equation

Sometimes, you need to convert vectors into 1-column matrices to make certain linear algebraic functions work.

Consider two random variables, \(x\) and \(y\).

x = np.random.randn(100)
y = np.random.randn(100)

We can fit a regression line using the normal equation, which appears in this week’s homework.

\(\begin{aligned} \hat\beta_i=(x^Tx)^{−1}x^Ty \end{aligned}\)

Here is a standward way of expressing it in NumPy:

get_beta1 = lambda x, y: (np.linalg.inv(x.T.dot(x))) * (x.T.dot(y))

However, it will fail if we pass it our two variables, x and y.

The reason is that it expects x to be a matrix, since it is designed to handle n-dimension predictor variables, usually represented as \(\textbf{X}\).

get_beta1(x, y)

LinAlgError: 0-dimensional array given. Array must be at least two-dimensional

The revised function will work with a vector as x:

\(\hat\beta_i = \large\frac{1}{x^Tx} \small(x^Ty)\)

get_beta2 = lambda x, y: (1 / x.T.dot(x)) * (x.T.dot(y))

get_beta2(x, y)

0.025226057106433126

We can fix the problem in the general case by converting our vector into a matrix using np.reshape().

x_as_matrix = np.reshape(x, (-1, 1))

get_beta1(x_as_matrix, y)

array([[0.02522606]])

get_beta1(x_as_matrix, y)[0][0]

0.025226057106433126

get_beta1(x_as_matrix, y)[0][0] == get_beta2(x, y)

True

One take-away here is that there is a difference betweek a scalar value and a 1 x 1 array.

Broadcasting

What happens when you try to perform an element-wise operation on two arrays of different shape?

NumPy will convert a low-dimensional array into a high-dimensional array to allow the operation to take place.

This is called broadcasting.

Let’s look at at our array foo:

foo

array([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])

If we multiply it by 5, the scalar is converted into an array of the same shape as foo with the value 5 broadcast to populate the entire array.

foo * 5

array([[5., 5., 5., 5.],
       [5., 5., 5., 5.],
       [5., 5., 5., 5.],
       [5., 5., 5., 5.],
       [5., 5., 5., 5.],
       [5., 5., 5., 5.]])

If we want to multiply an array by a vector, the vector is broadcast to become a 2D array.

foo * np.array([5, 10, 6, 8])

array([[ 5., 10.,  6.,  8.],
       [ 5., 10.,  6.,  8.],
       [ 5., 10.,  6.,  8.],
       [ 5., 10.,  6.,  8.],
       [ 5., 10.,  6.,  8.],
       [ 5., 10.,  6.,  8.]])

Note that NumPy can’t always make the adjustment:

foo * np.array([5, 10])

ValueError: operands could not be broadcast together with shapes (6,4) (2,)

Array-Oriented Programming

Using NumPy arrays enables you to express many kinds of data processing tasks as concise array expressions without writing loops.

This practice of replacing explicit loops with array expressions is referred to by some people as vectorization.

Vectorized array operations are often significantly faster than their pure Python equivalents.

They are also visually concise and elegant, although loops have the virtue of visualizing what’s under the hood in an algorithm.

Expressing Conditional Logic as Array Operations

`np.where()`

The np.where function is a vectorized version of the ternary expression x if condition else y.

Suppose we had a boolean array and two arrays of values:

xarr = np.array([1.1, 1.2, 1.3, 1.4, 1.5])
yarr = np.array([2.1, 2.2, 2.3, 2.4, 2.5])
cond = np.array([True, False, True, True, False])

result = [(x if c else y) for x, y, c in zip(xarr, yarr, cond)]

result

[1.1, 2.2, 1.3, 1.4, 2.5]

Here is the vectorized version:

result = np.where(cond, xarr, yarr)

result

array([1.1, 2.2, 1.3, 1.4, 2.5])

arr = np.random.randn(4, 4)

arr

array([[-0.07558452,  0.06060905,  0.08395713, -0.50584862],
       [-1.02703874, -2.08227842, -0.70058386,  0.42922211],
       [ 1.65264793, -0.0081504 , -0.88014697,  0.53720441],
       [ 0.98719467, -2.26285173,  0.51148161, -0.1798714 ]])

arr > 0

array([[False,  True,  True, False],
       [False, False, False,  True],
       [ True, False, False,  True],
       [ True, False,  True, False]])

np.where(arr > 0, 2, -2)

array([[-2,  2,  2, -2],
       [-2, -2, -2,  2],
       [ 2, -2, -2,  2],
       [ 2, -2,  2, -2]])

np.where(arr > 0, 2, arr) # set only positive values to 2

array([[-0.07558452,  2.        ,  2.        , -0.50584862],
       [-1.02703874, -2.08227842, -0.70058386,  2.        ],
       [ 2.        , -0.0081504 , -0.88014697,  2.        ],
       [ 2.        , -2.26285173,  2.        , -0.1798714 ]])

Mathematical and Statistical Methods

Statistical computations are aggregate functions apply to vectors within an array.

In a 2D array, they can be applied to rows or columns, i.e. axis \(0\) or axis \(1\).

Let’s create an array of random values. We can think of it is a table of observations and random variables.

arr = np.random.randn(5, 4)

arr

`.mean()`

arr.mean()

np.mean(arr)

arr.mean(axis=1)

arr.mean(axis=0)

`.sum()`

arr.sum()

Row wise aggregration

arr.sum(axis=0)

arr.mean(axis=1)

Column-wise aggregration

`.cumsum()`

arr = np.array([0, 1, 2, 3, 4, 5, 6, 7])

arr.cumsum()

arr = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])

arr

arr.cumsum(axis=0)

arr.cumprod(axis=1)

Methods for Boolean Arrays

`.sum()`

Since booleans are \(0\)s and \(1\), we can sum them to get a total truth count.

arr = np.random.randn(100)
(arr > 0).sum() # Number of positive values

bools = np.array([False, False, True, False])

`.any()`

bools.any()

`.all()`

bools.all()

Sorting

`.sort()`

arr = np.random.randn(6)

arr

arr.sort()

arr

arr = np.random.randn(5, 3)

arr

arr.sort(1)

arr

large_arr = np.random.randn(1000)
large_arr.sort()
large_arr[int(0.05 * len(large_arr))] # 5% quantile

0.05 * len(large_arr)

Unique and Other Set Logic

`np.unique()`

names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
np.unique(names)

sorted(set(names))

ints = np.array([3, 3, 3, 2, 2, 1, 1, 4, 4])
np.unique(ints)

`np.in1d()`

Tests whether each element of a 1-D array is also present in a second array.

values = np.array([6, 0, 0, 3, 2, 5, 6])
np.in1d(values, [2, 3, 6])

File Input and Output with Arrays

`np.save()`

Save an array to a binary file in NumPy .npy format.

Automatically adds the .npy file extension.

arr = np.arange(10)
np.save('some_array', arr)

`np.load()`

np.load('some_array.npy')

`np.savetxt()`

Save an array to a text file.

## np.savetxt?

np.savetxt('some_array.txt', arr)

## !more some_array.txt

`np.savez()`

Save several arrays into a single file in uncompressed .npz format.

np.savez('array_archive.npz', a=arr, b=arr)

arch = np.load('array_archive.npz')
arch['b']

np.savez_compressed('arrays_compressed.npz', a=arr, b=arr)

Clean up …

!rm some_array.npy
!rm some_array.txt
!rm array_archive.npz
!rm arrays_compressed.npz

Linear Algebra

`.dot()`

Dot product of two arrays. Specifically, - If both a and b are 1-D arrays, it is inner product of vectors (without complex conjugation). - If both a and b are 2-D arrays, it is matrix multiplication, but using matmul() or a @ b is preferred. - If either a or b is 0-D (scalar), it is equivalent to multiply() and using numpy.multiply(a, b) or a * b is preferred. - If a is an N-D array and b is a 1-D array, it is a sum product over the last axis of a and b. - If a is an N-D array and b is an M-D array (where M>=2), it is a sum product over the last axis of a and the second-to-last axis of b:

    dot(a, b)[i,j,k,m] = sum(a[i,j,:] * b[k,:,m])

x = np.array([[1., 2., 3.], [4., 5., 6.]])
y = np.array([[6., 23.], [-1, 7], [8, 9]])

x, y

x.dot(y)

np.dot(x, y)

np.dot(x, np.ones(3))

In NumPy, the @ operator means matrix multiplication.

x @ np.ones(3)

`np.linalg.inv()`

from numpy.linalg import inv, qr
X = np.random.randn(5, 5)
mat = X.T.dot(X)
inv(mat)

mat.dot(inv(mat))
q, r = qr(mat)

Pseudorandom Number Generation

`np.random.normal()`

samples = np.random.normal(size=(4, 4))
samples



from random import normalvariate
N = 1000000
%timeit samples = [normalvariate(0, 1) for _ in range(N)]
%timeit np.random.normal(size=N)

np.random.seed(1234)

rng = np.random.RandomState(1234)

rng.randn(10)

Example: Random Walks

Let simulate a random walk. The walk will be represented as a vector.

We’ll do it first as loop, then with vectorization.

Loops

import random
position = 0
walk = [position] # initialize the walk
steps = 1000
for i in range(steps):
    step = 1 if random.randint(0, 1) else -1  # Coin toss
    position += step
    walk.append(position)

import matplotlib.pyplot as plt

plt.figure()
plt.plot(walk[:100]);

Vectors

np.random.seed(12345)

nsteps = 1000
draws = np.random.randint(0, 2, size=nsteps)
steps = np.where(draws > 0, 1, -1)
walk2 = steps.cumsum()

plt.plot(walk[:100]);

walk2.min(), walk2.max()

Simulating Many Random Walks at Once

nwalks = 5000
nsteps = 1000

draws = np.random.randint(0, 2, size=(nwalks, nsteps)) # 0 or 1
steps = np.where(draws > 0, 1, -1)
walks = steps.cumsum(1)

draws

walks

walks.max(), walks.min()

hits30 = (np.abs(walks) >= 30).any(1)
hits30

hits30.sum() # Number that hit 30 or -30

crossing_times = (np.abs(walks[hits30]) >= 30).argmax(1)
crossing_times.mean()

steps = np.random.normal(loc=0, scale=0.25, size=(nwalks, nsteps))

Feel free to experiment with other distributions for the steps other than equal-sized coin flips. You need only use a different random generator method, like standard_normal to generate normally distributed steps with some mean and standard deviation:

draws = 0.25 * rng.standard_normal((nwalks, nsteps)) # Walks: observations, Steps: features

sample_walk = np.random.choice(len(draws))
plt.plot(draws[sample_walk])
plt.title(f"Walk #{sample_walk}");