NB: NumPy Functions

Programming for Data Science

NumPy provides a wide variety of functions and methods that take advantage of the multi-dimensional array data structure it introduces.

These are grouped into modules according to purpose, such as linalg for linear algebra and math for standard mathematical functions.

Here, we go over some of these tools, but our goal is more to give you a taste of what NumPy offers.

You will want to frequent the documentation to discover the full breadth of functions and methods.

More important, the goal is to introduce how NumPy thinks of functions and methods.

As mentioned earlier, NumPy offers an approach to applying functions to collections of data without resorting to looping, nesting, etc.

Let’s start with this.

Universal Functions

NumPy introduces the concept of a universal function.

A universal function, or ufunc, is a function that performs element-wise operations on data in ndarrays.

Universal functions use the same pattern we saw with arithmetic operations, applying a function to elements.

You may think of these as fast vectorized wrappers for simple functions that take one or more scalar values and produce one or more scalar results.

Let’s look at some examples.

import numpy as np

Let’s first define an array that we want to operate on.

arr = np.arange(10)
arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

To perform an element-wise square root operation on this, we might do the following:

from math import sqrt
np.array([sqrt(x) for x in arr])
array([0.        , 1.        , 1.41421356, 1.73205081, 2.        ,
       2.23606798, 2.44948974, 2.64575131, 2.82842712, 3.        ])

However, NumPy offers a more concise approach:

np.sqrt(arr)
array([0.        , 1.        , 1.41421356, 1.73205081, 2.        ,
       2.23606798, 2.44948974, 2.64575131, 2.82842712, 3.        ])

It’s also faster:

%timeit np.array([sqrt(x) for x in arr])
%timeit np.sqrt(arr)
2.26 µs ± 18.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
710 ns ± 3.97 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Here’s an element-wise exponent operation:

np.exp(arr)
array([1.00000000e+00, 2.71828183e+00, 7.38905610e+00, 2.00855369e+01,
       5.45981500e+01, 1.48413159e+02, 4.03428793e+02, 1.09663316e+03,
       2.98095799e+03, 8.10308393e+03])

Some vectorized functions can take two arrays.

x = np.random.randn(8)
y = np.random.randn(8)
x, y
(array([-0.02865278, -1.47829863,  0.820916  ,  0.86317255,  0.85184202,
        -0.52943286,  0.79710862,  0.43352724]),
 array([ 0.79527927,  0.98962373,  0.30685449, -1.88671886,  0.24585142,
        -0.79768499, -1.30316536,  0.84871309]))

This takes the maximum between the corrresponding elements in x and y:

np.maximum(x, y)
array([ 0.79527927,  0.98962373,  0.820916  ,  0.86317255,  0.85184202,
       -0.52943286,  0.79710862,  0.84871309])

Let’s create another array of random floats and apply the modulus function.

This returns two values.

arr_rand = np.random.randn(7) * 5
arr_rand
array([ 6.18511759, -2.17250696,  2.74951073,  3.81623877, -1.56924266,
        8.44800585,  3.22968695])
remainder, whole_part = np.modf(arr_rand)
remainder
array([ 0.18511759, -0.17250696,  0.74951073,  0.81623877, -0.56924266,
        0.44800585,  0.22968695])
whole_part
array([ 6., -2.,  2.,  3., -1.,  8.,  3.])

What happens when we try to get the square root of a negative number?

np.sqrt(arr_rand)
/tmp/ipykernel_946251/3237590434.py:1: RuntimeWarning: invalid value encountered in sqrt
  np.sqrt(arr_rand)
array([2.48698966,        nan, 1.65816487, 1.95351959,        nan,
       2.90654535, 1.79713298])

nan is a special value in NumPy.

Vectorization

Beyond universal functions, NumPy expresses many kinds of data processing tasks as concise array-oriented expressions without writing loops.

This practice of replacing explicit loops with array expressions is referred to by some people as vectorization.

Vectorized array operations are often significantly faster than their pure Python equivalents.

They are also visually concise and elegant, although loops have the virtue of visualizing what’s under the hood in an algorithm.

Conditional Logic

np.where()

The np.where function is a vectorized version of the ternary expression x if condition else y.

Suppose we had a boolean array and two arrays of values and we wanted to pick between the two arrays based on the boolean.

We might do something like this:

xarr = np.array([1.1, 1.2, 1.3, 1.4, 1.5])
yarr = np.array([2.1, 2.2, 2.3, 2.4, 2.5])
cond = np.array([True, False, True, True, False])
result = [(x if c else y) for x, y, c in zip(xarr, yarr, cond)]
result
[1.1, 2.2, 1.3, 1.4, 2.5]

A quick note on zip().

This function takes two or more iterables (e.g. lists) of the same length and returns their grouped elements by index.

In effect, it binds a group of lists as columns in a table and returns their rows.

For example,

A = [1, 2, 3, 4]
B = ['a', 'b', 'c', 'd']
C = [(a, b) for a, b in zip(A, B)]
C
[(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]

Here is the vectorized version:

result = np.where(cond, xarr, yarr)
result
array([1.1, 2.2, 1.3, 1.4, 2.5])

And it’s faster, too:

%timeit [(x if c else y) for x, y, c in zip(xarr, yarr, cond)]
%timeit np.where(cond, xarr, yarr)
1.42 µs ± 6.33 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
1.03 µs ± 8.67 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Here is another example, where we use a boolean expression that returns an array to conditionally operate on a \(2\)-D array.

arr = np.random.randn(4, 4)
arr
array([[-0.37372765, -0.07008402, -0.9804428 , -0.4962722 ],
       [-0.48452222, -1.63528018,  2.02413952,  0.95929382],
       [ 0.85871743, -0.32967862, -0.22090242, -0.95784047],
       [ 0.31219406,  0.98487624,  1.00231528, -0.12983138]])
arr > 0
array([[False, False, False, False],
       [False, False,  True,  True],
       [ True, False, False, False],
       [ True,  True,  True, False]])
np.where(arr > 0, 2, -2)
array([[-2, -2, -2, -2],
       [-2, -2,  2,  2],
       [ 2, -2, -2, -2],
       [ 2,  2,  2, -2]])
np.where(arr > 0, 2, arr)
array([[-0.37372765, -0.07008402, -0.9804428 , -0.4962722 ],
       [-0.48452222, -1.63528018,  2.        ,  2.        ],
       [ 2.        , -0.32967862, -0.22090242, -0.95784047],
       [ 2.        ,  2.        ,  2.        , -0.12983138]])

Mathematical and Statistical Methods

Statistical computations are aggregate functions applied to vectors within an array.

In a \(2\)-D array, they can be applied to rows or columns, i.e. axis \(0\) or axis \(1\).

To demonstrate, let’s create an array of random values.

We can think of it is a table of observations and random variables.

arr = np.random.randn(5, 4)
arr
array([[ 0.85124258,  1.26274113, -1.41403354, -1.92964977],
       [-0.9302931 ,  0.21807627, -0.19301232, -1.69534407],
       [ 1.06757947, -0.34279466, -0.33276255,  1.68059036],
       [ 0.38249005,  0.17731579, -0.46004143, -0.00862738],
       [-1.7617584 , -0.17263553,  0.56952929, -0.50669218]])

.mean()

We can get the mean of all the numbers in the array by using a method or a function.

arr.mean()
-0.17690399891686628
np.mean(arr)
-0.17690399891686628

Here we get the mean for each row.

arr.mean(axis=1)
array([-0.3074249 , -0.6501433 ,  0.51815315,  0.02278426, -0.4678892 ])

And here for each column.

arr.mean(axis=0)
array([-0.07814788,  0.2285406 , -0.36606411, -0.49194461])

.sum()

We can do the same with summing.

np.sum(arr), arr.sum()
(-3.5380799783373256, -3.5380799783373256)
arr.sum(axis=0)
array([-0.39073939,  1.142703  , -1.83032055, -2.45972304])
arr.mean(axis=1)
array([-0.3074249 , -0.6501433 ,  0.51815315,  0.02278426, -0.4678892 ])

.cumsum()

And here is cumulative summing.

arr = np.array([0, 1, 2, 3, 4, 5, 6, 7])
arr.cumsum()
array([ 0,  1,  3,  6, 10, 15, 21, 28])
arr = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])
arr
array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])
arr.cumsum(axis=0)
array([[ 0,  1,  2],
       [ 3,  5,  7],
       [ 9, 12, 15]])
arr.cumprod(axis=1)
array([[  0,   0,   0],
       [  3,  12,  60],
       [  6,  42, 336]])

Boolean Arrays

.sum()

Since booleans are \(0\)s and \(1\), we can sum them to get a total truth count.

arr = np.random.randn(100)
(arr > 0).sum() # Number of positive values
43
bools = np.array([False, False, True, False])
bools.sum()
1

.any() and .all()

bools.any(), bools.all()
(True, False)

Sorting

NumPy provides several sorting options. Here we look at the most basic forms, the sort function np.sort() and the method .sort().

Here’s the method option.

arr = np.random.randn(6)
arr
array([-0.13048996, -0.72462279, -0.06963874,  0.3639324 ,  2.22785044,
       -0.47912818])
arr.sort()
arr
array([-0.72462279, -0.47912818, -0.13048996, -0.06963874,  0.3639324 ,
        2.22785044])

Note that this performs an in-place operation — it changes the state of the object.

The method also does not return anything.

If you just want to get a sorted version of the vector without changing the state of the object, use the function.

This does return a value.

np.sort(arr)
array([-0.72462279, -0.47912818, -0.13048996, -0.06963874,  0.3639324 ,
        2.22785044])

To reverse sort, you have to slice the result and walk backwords:

arr[::-1]
array([ 2.22785044,  0.3639324 , -0.06963874, -0.13048996, -0.47912818,
       -0.72462279])

Here we sort a \(2\)-D array by row.

That is, each row is sorted.

arr = np.random.randn(5, 3)
arr
array([[ 0.75982535, -1.44121737,  0.55483639],
       [-2.06462226,  1.14922865, -0.18207867],
       [-2.47882161,  0.16483045,  1.21204657],
       [ 0.12048109,  0.885755  ,  1.77209294],
       [-1.39199022,  0.87544244,  1.42260675]])
arr.sort(axis=1)
arr
array([[-1.44121737,  0.55483639,  0.75982535],
       [-2.06462226, -0.18207867,  1.14922865],
       [-2.47882161,  0.16483045,  1.21204657],
       [ 0.12048109,  0.885755  ,  1.77209294],
       [-1.39199022,  0.87544244,  1.42260675]])

NumPy defaults to sorting on the last axis, which you can specify by using \(-1\).

arr.sort(-1)
arr
array([[-1.44121737,  0.55483639,  0.75982535],
       [-2.06462226, -0.18207867,  1.14922865],
       [-2.47882161,  0.16483045,  1.21204657],
       [ 0.12048109,  0.885755  ,  1.77209294],
       [-1.39199022,  0.87544244,  1.42260675]])

Here we use sort to compute the \(5\%\) quantile of a vector of floats.

large_arr = np.random.randn(1000)
large_arr.sort()
large_arr[int(0.05 * len(large_arr))]
-1.6779175381412759
0.05 * len(large_arr)
50.0

Set Logic

NumPy provides some methods to perform set membership operations.

np.unique()

This returns a sorted array of distinct values in an array.

names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
np.unique(names)
array(['Bob', 'Joe', 'Will'], dtype='<U4')
ints = np.array([3, 3, 3, 2, 2, 1, 1, 4, 4])
np.unique(ints)
array([1, 2, 3, 4])

np.in1d()

This method tests whether each element of a \(1\)-D array is also present in a second array.

values = np.array([6, 0, 0, 3, 2, 5, 6])
np.in1d(values, [2, 3, 6])
array([ True, False, False,  True,  True, False,  True])

File Input and Output

NumPy provides methods and functions to read and write data from disk.

np.save()

Save an array to a binary file in NumPy .npy format.

Automatically adds the .npy file extension.

arr = np.arange(10)
np.save('some_array', arr)

np.load()

This reads in a file in NumPy format.

np.load('some_array.npy')
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

np.savetxt() and np.loadtxt()

These read from and write to plain text files.

np.savetxt('some_array.txt', arr)
np.loadtxt('some_array.txt')
array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])

np.savez()

This saves several arrays into a single file in uncompressed .npz format.

np.savez('array_archive.npz', a=arr, b=arr)
arch = np.load('array_archive.npz')
arch['b']
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

This uses a compressed format.

np.savez_compressed('arrays_compressed.npz', a=arr, b=arr)

Clean up …

!rm some_array.npy
!rm some_array.txt
!rm array_archive.npz
!rm arrays_compressed.npz

Linear Algebra

NumPy also offers a number of linear algebraic methods.

Because linear algebra is so fundamental to numeric analysis, many functions are in the main NumPy namespace. These include: dot, vdot, inner, outer, matmul, tensordot, einsum, einsum_path and kron.

Functions present in numpy.linalg are the following:

  • Matrix and vector products: multi_dot, matrix_power
  • Decompositions: cholesky, qr, svd
  • Matrix eigenvalues: eig, eigh, eigvals, eigvalsh
  • Norms and other numbers: norm, cond, det, matrix_rank, slogdet
  • Solving equations and inverting matrices: solve, tensorsolve, lstsq, inv, pinv, tensorinv
  • Exceptions: LinAlgError

We’ll look at just a couple of these.

.dot() and np.dot()

This gives you the dot product of two arrays.

It returns a result based on the shape of the input data.

So, given two arrays \(a\) and \(b\):

If Then
\(a\) and \(b\) are \(1\)-D (vector) inner product
\(a\) and \(b\) are \(2\)-D (matrix) marix multiplaction
\(a\) or \(b\) are \(0\)-D (scalar) equivalent to multiply()
\(a\) is \(N\)-D and \(b\) is \(1\)-D sum product over the last axis of \(a\) and \(b\)
\(a\) is \(N\)-D and \(b\) is \(M\)-D (where \(M >= 2\)) sum product over the last axis of \(a\) and the second-to-last axis of \(b\):
dot(a, b)[i,j,k,m] = sum(a[i,j,:] * b[k,:,m])

Let’s look at an example.

x = np.array([[1., 2., 3.], [4., 5., 6.]])
y = np.array([[6., 23.], [-1, 7], [8, 9]])
x
array([[1., 2., 3.],
       [4., 5., 6.]])
y
array([[ 6., 23.],
       [-1.,  7.],
       [ 8.,  9.]])
x.dot(y)
array([[ 28.,  64.],
       [ 67., 181.]])
np.dot(x, y)
array([[ 28.,  64.],
       [ 67., 181.]])

Here we get the dot product of a matrix and a vector.

This gets the sum product over the rows of x.

np.dot(x, np.ones(3))
array([ 6., 15.])

We can also use the the @ operator in NumPy for matrix multiplication.

x @ np.ones(3)
array([ 6., 15.])

np.linalg.inv()

Computes the inverse of a matrix.

from numpy.linalg import inv, qr
X = np.random.randn(5, 5)
X
array([[ 0.00657667,  0.94280294, -0.38778753, -1.27782778,  1.11744125],
       [ 0.11611194,  0.38160734, -1.14998116,  1.1665558 , -0.41054575],
       [-1.36436756, -0.56012045,  1.13364082, -0.52002208,  1.65062773],
       [ 1.3858369 , -0.46831126,  0.07717   ,  0.31919749, -2.12694842],
       [ 1.54317047,  0.50438473, -0.52536121, -0.1092352 ,  1.24429125]])
inv(X)
array([[-0.39878244, -0.54773915, -0.46683041, -0.19885435,  0.45676946],
       [-5.62085111, -7.96286851, -9.36991848, -8.63364179,  0.0922498 ],
       [-5.14910918, -7.016499  , -7.4090965 , -6.95651434,  0.24652277],
       [-3.08196362, -3.16558159, -3.87235952, -3.84772023,  0.28306779],
       [ 0.32842762,  0.66673685,  0.90894543,  0.47139604,  0.32872684]])

numpy.linalg.qr()

Here we compute the QR factorization of a matrix.

That is, we decompose a matrix into an orthogonal matrix \(Q\) and an upper triangular matrix \(R\).

Q, R = qr(X)
Q
array([[-0.00264618, -0.72711785, -0.38911205, -0.50671846,  0.25123863],
       [-0.04671863, -0.28092006,  0.81095117,  0.03350105,  0.51003644],
       [ 0.54896491,  0.27144918, -0.31983195,  0.19796682,  0.69531974],
       [-0.55760328,  0.52508801, -0.05515592, -0.52941709,  0.36060578],
       [-0.62090779, -0.20731974, -0.29260103,  0.65010827,  0.25146753]])
R
array([[-2.48534567, -0.37985398,  0.96025135, -0.44675298,  1.33576317],
       [ 0.        , -1.29524778,  1.06218467,  0.65051628, -1.62392066],
       [ 0.        ,  0.        , -0.99479547,  1.62391435, -1.54243298],
       [ 0.        ,  0.        ,  0.        ,  0.34362934,  1.68175458],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.76497413]])

Random Number Generation

We have already used NumPy to generate random numbers.

Here’s another function in this domain.

np.random.normal()

samples = np.random.normal(size=(4, 4))
samples
array([[-0.19696082, -0.26304121,  0.6363801 ,  0.20438421],
       [-1.55414504,  0.9324663 ,  1.51351311,  0.42318343],
       [ 0.60250211, -1.26817884,  0.05698393, -0.36026335],
       [-0.67891835,  0.96903582, -0.7618861 ,  0.60346314]])

Let’s compare its speed to a more basic approach.

from random import normalvariate

N = 1000000
%timeit samples = [normalvariate(0, 1) for _ in range(N)]
%timeit np.random.normal(size=N)
486 ms ± 1.46 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
30.4 ms ± 21.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)