import numpy as np
NB: NumPy Functions
Programming for Data Science
NumPy provides a wide variety of functions and methods that take advantage of the multi-dimensional array data structure it introduces.
These are grouped into modules according to purpose, such as linalg
for linear algebra and math
for standard mathematical functions.
Here, we go over some of these tools, but our goal is more to give you a taste of what NumPy offers.
You will want to frequent the documentation to discover the full breadth of functions and methods.
More important, the goal is to introduce how NumPy thinks of functions and methods.
As mentioned earlier, NumPy offers an approach to applying functions to collections of data without resorting to looping, nesting, etc.
Let’s start with this.
Universal Functions
NumPy introduces the concept of a universal function.
A universal function, or ufunc
, is a function that performs element-wise operations on data in ndarrays.
Universal functions use the same pattern we saw with arithmetic operations, applying a function to elements.
You may think of these as fast vectorized wrappers for simple functions that take one or more scalar values and produce one or more scalar results.
Let’s look at some examples.
Let’s first define an array that we want to operate on.
= np.arange(10)
arr arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
To perform an element-wise square root operation on this, we might do the following:
from math import sqrt
for x in arr]) np.array([sqrt(x)
array([0. , 1. , 1.41421356, 1.73205081, 2. ,
2.23606798, 2.44948974, 2.64575131, 2.82842712, 3. ])
However, NumPy offers a more concise approach:
np.sqrt(arr)
array([0. , 1. , 1.41421356, 1.73205081, 2. ,
2.23606798, 2.44948974, 2.64575131, 2.82842712, 3. ])
It’s also faster:
%timeit np.array([sqrt(x) for x in arr])
%timeit np.sqrt(arr)
2.26 µs ± 18.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
710 ns ± 3.97 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Here’s an element-wise exponent operation:
np.exp(arr)
array([1.00000000e+00, 2.71828183e+00, 7.38905610e+00, 2.00855369e+01,
5.45981500e+01, 1.48413159e+02, 4.03428793e+02, 1.09663316e+03,
2.98095799e+03, 8.10308393e+03])
Some vectorized functions can take two arrays.
= np.random.randn(8)
x = np.random.randn(8)
y x, y
(array([-0.02865278, -1.47829863, 0.820916 , 0.86317255, 0.85184202,
-0.52943286, 0.79710862, 0.43352724]),
array([ 0.79527927, 0.98962373, 0.30685449, -1.88671886, 0.24585142,
-0.79768499, -1.30316536, 0.84871309]))
This takes the maximum between the corrresponding elements in x
and y
:
np.maximum(x, y)
array([ 0.79527927, 0.98962373, 0.820916 , 0.86317255, 0.85184202,
-0.52943286, 0.79710862, 0.84871309])
Let’s create another array of random floats and apply the modulus function.
This returns two values.
= np.random.randn(7) * 5
arr_rand arr_rand
array([ 6.18511759, -2.17250696, 2.74951073, 3.81623877, -1.56924266,
8.44800585, 3.22968695])
= np.modf(arr_rand)
remainder, whole_part remainder
array([ 0.18511759, -0.17250696, 0.74951073, 0.81623877, -0.56924266,
0.44800585, 0.22968695])
whole_part
array([ 6., -2., 2., 3., -1., 8., 3.])
What happens when we try to get the square root of a negative number?
np.sqrt(arr_rand)
/tmp/ipykernel_946251/3237590434.py:1: RuntimeWarning: invalid value encountered in sqrt
np.sqrt(arr_rand)
array([2.48698966, nan, 1.65816487, 1.95351959, nan,
2.90654535, 1.79713298])
nan
is a special value in NumPy.
Vectorization
Beyond universal functions, NumPy expresses many kinds of data processing tasks as concise array-oriented expressions without writing loops.
This practice of replacing explicit loops with array expressions is referred to by some people as vectorization.
Vectorized array operations are often significantly faster than their pure Python equivalents.
They are also visually concise and elegant, although loops have the virtue of visualizing what’s under the hood in an algorithm.
Conditional Logic
np.where()
The np.where
function is a vectorized version of the ternary expression x if condition else y
.
Suppose we had a boolean array and two arrays of values and we wanted to pick between the two arrays based on the boolean.
We might do something like this:
= np.array([1.1, 1.2, 1.3, 1.4, 1.5])
xarr = np.array([2.1, 2.2, 2.3, 2.4, 2.5])
yarr = np.array([True, False, True, True, False]) cond
= [(x if c else y) for x, y, c in zip(xarr, yarr, cond)] result
result
[1.1, 2.2, 1.3, 1.4, 2.5]
A quick note on zip()
.
This function takes two or more iterables (e.g. lists) of the same length and returns their grouped elements by index.
In effect, it binds a group of lists as columns in a table and returns their rows.
For example,
= [1, 2, 3, 4]
A = ['a', 'b', 'c', 'd']
B = [(a, b) for a, b in zip(A, B)]
C C
[(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]
Here is the vectorized version:
= np.where(cond, xarr, yarr) result
result
array([1.1, 2.2, 1.3, 1.4, 2.5])
And it’s faster, too:
%timeit [(x if c else y) for x, y, c in zip(xarr, yarr, cond)]
%timeit np.where(cond, xarr, yarr)
1.42 µs ± 6.33 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
1.03 µs ± 8.67 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Here is another example, where we use a boolean expression that returns an array to conditionally operate on a \(2\)-D array.
= np.random.randn(4, 4)
arr arr
array([[-0.37372765, -0.07008402, -0.9804428 , -0.4962722 ],
[-0.48452222, -1.63528018, 2.02413952, 0.95929382],
[ 0.85871743, -0.32967862, -0.22090242, -0.95784047],
[ 0.31219406, 0.98487624, 1.00231528, -0.12983138]])
> 0 arr
array([[False, False, False, False],
[False, False, True, True],
[ True, False, False, False],
[ True, True, True, False]])
> 0, 2, -2) np.where(arr
array([[-2, -2, -2, -2],
[-2, -2, 2, 2],
[ 2, -2, -2, -2],
[ 2, 2, 2, -2]])
> 0, 2, arr) np.where(arr
array([[-0.37372765, -0.07008402, -0.9804428 , -0.4962722 ],
[-0.48452222, -1.63528018, 2. , 2. ],
[ 2. , -0.32967862, -0.22090242, -0.95784047],
[ 2. , 2. , 2. , -0.12983138]])
Mathematical and Statistical Methods
Statistical computations are aggregate functions applied to vectors within an array.
In a \(2\)-D array, they can be applied to rows or columns, i.e. axis \(0\) or axis \(1\).
To demonstrate, let’s create an array of random values.
We can think of it is a table of observations and random variables.
= np.random.randn(5, 4) arr
arr
array([[ 0.85124258, 1.26274113, -1.41403354, -1.92964977],
[-0.9302931 , 0.21807627, -0.19301232, -1.69534407],
[ 1.06757947, -0.34279466, -0.33276255, 1.68059036],
[ 0.38249005, 0.17731579, -0.46004143, -0.00862738],
[-1.7617584 , -0.17263553, 0.56952929, -0.50669218]])
.mean()
We can get the mean of all the numbers in the array by using a method or a function.
arr.mean()
-0.17690399891686628
np.mean(arr)
-0.17690399891686628
Here we get the mean for each row.
=1) arr.mean(axis
array([-0.3074249 , -0.6501433 , 0.51815315, 0.02278426, -0.4678892 ])
And here for each column.
=0) arr.mean(axis
array([-0.07814788, 0.2285406 , -0.36606411, -0.49194461])
.sum()
We can do the same with summing.
sum(arr), arr.sum() np.
(-3.5380799783373256, -3.5380799783373256)
sum(axis=0) arr.
array([-0.39073939, 1.142703 , -1.83032055, -2.45972304])
=1) arr.mean(axis
array([-0.3074249 , -0.6501433 , 0.51815315, 0.02278426, -0.4678892 ])
.cumsum()
And here is cumulative summing.
= np.array([0, 1, 2, 3, 4, 5, 6, 7])
arr arr.cumsum()
array([ 0, 1, 3, 6, 10, 15, 21, 28])
= np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])
arr arr
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
=0) arr.cumsum(axis
array([[ 0, 1, 2],
[ 3, 5, 7],
[ 9, 12, 15]])
=1) arr.cumprod(axis
array([[ 0, 0, 0],
[ 3, 12, 60],
[ 6, 42, 336]])
Boolean Arrays
.sum()
Since booleans are \(0\)s and \(1\), we can sum them to get a total truth count.
= np.random.randn(100)
arr > 0).sum() # Number of positive values (arr
43
= np.array([False, False, True, False])
bools sum() bools.
1
.any()
and .all()
any(), bools.all() bools.
(True, False)
Sorting
NumPy provides several sorting options. Here we look at the most basic forms, the sort function np.sort()
and the method .sort()
.
Here’s the method option.
= np.random.randn(6)
arr arr
array([-0.13048996, -0.72462279, -0.06963874, 0.3639324 , 2.22785044,
-0.47912818])
arr.sort() arr
array([-0.72462279, -0.47912818, -0.13048996, -0.06963874, 0.3639324 ,
2.22785044])
Note that this performs an in-place operation — it changes the state of the object.
The method also does not return anything.
If you just want to get a sorted version of the vector without changing the state of the object, use the function.
This does return a value.
np.sort(arr)
array([-0.72462279, -0.47912818, -0.13048996, -0.06963874, 0.3639324 ,
2.22785044])
To reverse sort, you have to slice the result and walk backwords:
-1] arr[::
array([ 2.22785044, 0.3639324 , -0.06963874, -0.13048996, -0.47912818,
-0.72462279])
Here we sort a \(2\)-D array by row.
That is, each row is sorted.
= np.random.randn(5, 3)
arr arr
array([[ 0.75982535, -1.44121737, 0.55483639],
[-2.06462226, 1.14922865, -0.18207867],
[-2.47882161, 0.16483045, 1.21204657],
[ 0.12048109, 0.885755 , 1.77209294],
[-1.39199022, 0.87544244, 1.42260675]])
=1)
arr.sort(axis arr
array([[-1.44121737, 0.55483639, 0.75982535],
[-2.06462226, -0.18207867, 1.14922865],
[-2.47882161, 0.16483045, 1.21204657],
[ 0.12048109, 0.885755 , 1.77209294],
[-1.39199022, 0.87544244, 1.42260675]])
NumPy defaults to sorting on the last axis, which you can specify by using \(-1\).
-1)
arr.sort( arr
array([[-1.44121737, 0.55483639, 0.75982535],
[-2.06462226, -0.18207867, 1.14922865],
[-2.47882161, 0.16483045, 1.21204657],
[ 0.12048109, 0.885755 , 1.77209294],
[-1.39199022, 0.87544244, 1.42260675]])
Here we use sort to compute the \(5\%\) quantile of a vector of floats.
= np.random.randn(1000)
large_arr
large_arr.sort()int(0.05 * len(large_arr))] large_arr[
-1.6779175381412759
0.05 * len(large_arr)
50.0
Set Logic
NumPy provides some methods to perform set membership operations.
np.unique()
This returns a sorted array of distinct values in an array.
= np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
names np.unique(names)
array(['Bob', 'Joe', 'Will'], dtype='<U4')
= np.array([3, 3, 3, 2, 2, 1, 1, 4, 4])
ints np.unique(ints)
array([1, 2, 3, 4])
np.in1d()
This method tests whether each element of a \(1\)-D array is also present in a second array.
= np.array([6, 0, 0, 3, 2, 5, 6])
values 2, 3, 6]) np.in1d(values, [
array([ True, False, False, True, True, False, True])
File Input and Output
NumPy provides methods and functions to read and write data from disk.
np.save()
Save an array to a binary file in NumPy .npy
format.
Automatically adds the .npy
file extension.
= np.arange(10)
arr 'some_array', arr) np.save(
np.load()
This reads in a file in NumPy format.
'some_array.npy') np.load(
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
np.savetxt()
and np.loadtxt()
These read from and write to plain text files.
'some_array.txt', arr) np.savetxt(
'some_array.txt') np.loadtxt(
array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])
np.savez()
This saves several arrays into a single file in uncompressed .npz
format.
'array_archive.npz', a=arr, b=arr) np.savez(
= np.load('array_archive.npz')
arch 'b'] arch[
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
This uses a compressed format.
'arrays_compressed.npz', a=arr, b=arr) np.savez_compressed(
Clean up …
!rm some_array.npy
!rm some_array.txt
!rm array_archive.npz
!rm arrays_compressed.npz
Linear Algebra
NumPy also offers a number of linear algebraic methods.
Because linear algebra is so fundamental to numeric analysis, many functions are in the main NumPy namespace. These include: dot
, vdot
, inner
, outer
, matmul
, tensordot
, einsum
, einsum_path
and kron
.
Functions present in numpy.linalg
are the following:
- Matrix and vector products:
multi_dot
,matrix_power
- Decompositions:
cholesky
,qr
,svd
- Matrix eigenvalues:
eig
,eigh
,eigvals
,eigvalsh
- Norms and other numbers:
norm
,cond
,det
,matrix_rank
,slogdet
- Solving equations and inverting matrices:
solve
,tensorsolve
,lstsq
,inv
,pinv
,tensorinv
- Exceptions:
LinAlgError
We’ll look at just a couple of these.
.dot()
and np.dot()
This gives you the dot product of two arrays.
It returns a result based on the shape of the input data.
So, given two arrays \(a\) and \(b\):
If | Then |
---|---|
\(a\) and \(b\) are \(1\)-D (vector) | inner product |
\(a\) and \(b\) are \(2\)-D (matrix) | marix multiplaction |
\(a\) or \(b\) are \(0\)-D (scalar) | equivalent to multiply() |
\(a\) is \(N\)-D and \(b\) is \(1\)-D | sum product over the last axis of \(a\) and \(b\) |
\(a\) is \(N\)-D and \(b\) is \(M\)-D (where \(M >= 2\)) |
sum product over the last axis of \(a\) and the second-to-last axis of \(b\): dot(a, b)[i,j,k,m] = sum(a[i,j,:] * b[k,:,m]) |
Let’s look at an example.
= np.array([[1., 2., 3.], [4., 5., 6.]])
x = np.array([[6., 23.], [-1, 7], [8, 9]]) y
x
array([[1., 2., 3.],
[4., 5., 6.]])
y
array([[ 6., 23.],
[-1., 7.],
[ 8., 9.]])
x.dot(y)
array([[ 28., 64.],
[ 67., 181.]])
np.dot(x, y)
array([[ 28., 64.],
[ 67., 181.]])
Here we get the dot product of a matrix and a vector.
This gets the sum product over the rows of x
.
3)) np.dot(x, np.ones(
array([ 6., 15.])
We can also use the the @
operator in NumPy for matrix multiplication.
@ np.ones(3) x
array([ 6., 15.])
np.linalg.inv()
Computes the inverse of a matrix.
from numpy.linalg import inv, qr
= np.random.randn(5, 5)
X X
array([[ 0.00657667, 0.94280294, -0.38778753, -1.27782778, 1.11744125],
[ 0.11611194, 0.38160734, -1.14998116, 1.1665558 , -0.41054575],
[-1.36436756, -0.56012045, 1.13364082, -0.52002208, 1.65062773],
[ 1.3858369 , -0.46831126, 0.07717 , 0.31919749, -2.12694842],
[ 1.54317047, 0.50438473, -0.52536121, -0.1092352 , 1.24429125]])
inv(X)
array([[-0.39878244, -0.54773915, -0.46683041, -0.19885435, 0.45676946],
[-5.62085111, -7.96286851, -9.36991848, -8.63364179, 0.0922498 ],
[-5.14910918, -7.016499 , -7.4090965 , -6.95651434, 0.24652277],
[-3.08196362, -3.16558159, -3.87235952, -3.84772023, 0.28306779],
[ 0.32842762, 0.66673685, 0.90894543, 0.47139604, 0.32872684]])
numpy.linalg.qr()
Here we compute the QR factorization of a matrix.
That is, we decompose a matrix into an orthogonal matrix \(Q\) and an upper triangular matrix \(R\).
= qr(X) Q, R
Q
array([[-0.00264618, -0.72711785, -0.38911205, -0.50671846, 0.25123863],
[-0.04671863, -0.28092006, 0.81095117, 0.03350105, 0.51003644],
[ 0.54896491, 0.27144918, -0.31983195, 0.19796682, 0.69531974],
[-0.55760328, 0.52508801, -0.05515592, -0.52941709, 0.36060578],
[-0.62090779, -0.20731974, -0.29260103, 0.65010827, 0.25146753]])
R
array([[-2.48534567, -0.37985398, 0.96025135, -0.44675298, 1.33576317],
[ 0. , -1.29524778, 1.06218467, 0.65051628, -1.62392066],
[ 0. , 0. , -0.99479547, 1.62391435, -1.54243298],
[ 0. , 0. , 0. , 0.34362934, 1.68175458],
[ 0. , 0. , 0. , 0. , 0.76497413]])
Random Number Generation
We have already used NumPy to generate random numbers.
Here’s another function in this domain.
np.random.normal()
= np.random.normal(size=(4, 4))
samples samples
array([[-0.19696082, -0.26304121, 0.6363801 , 0.20438421],
[-1.55414504, 0.9324663 , 1.51351311, 0.42318343],
[ 0.60250211, -1.26817884, 0.05698393, -0.36026335],
[-0.67891835, 0.96903582, -0.7618861 , 0.60346314]])
Let’s compare its speed to a more basic approach.
from random import normalvariate
= 1000000
N %timeit samples = [normalvariate(0, 1) for _ in range(N)]
%timeit np.random.normal(size=N)
486 ms ± 1.46 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
30.4 ms ± 21.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)