import numpy as np
NB: NumPy Operations
Programming for Data Science
Element-wise Arithmetic
NumPy arrays can be transformed with with arithmetic operations.
These are all element-wise operations.
Let’s start with a couple of \(2\)-D arrays.
= np.array([[1., 2., 3.], [4., 5., 6.]])
arr1 = np.array([[0., 4., 1.], [7., 2., 12.]])
arr2 arr1, arr2
(array([[1., 2., 3.],
[4., 5., 6.]]),
array([[ 0., 4., 1.],
[ 7., 2., 12.]]))
If we multiply these two matrices, NumPy performs multiplication on each pair of cells with the same index or coordinate.
* arr2 arr1
array([[ 0., 8., 3.],
[28., 10., 72.]])
You can think of it this way:
coordinate arr1 arr2 arr1 * arr2
0, 0 1. 0. 0.
0, 1 2. 4. 8.
0, 2 3. 1. 3.
...
Of course, this works for the other operations, too.
- arr2 arr1
array([[ 1., -2., 2.],
[-3., 3., -6.]])
/ arr1 arr2
array([[0. , 2. , 0.33333333],
[1.75 , 0.4 , 2. ]])
1 / arr1
array([[1. , 0.5 , 0.33333333],
[0.25 , 0.2 , 0.16666667]])
** arr2 arr1
array([[1.00000000e+00, 1.60000000e+01, 3.00000000e+00],
[1.63840000e+04, 2.50000000e+01, 2.17678234e+09]])
** 0.5 arr2
array([[0. , 2. , 1. ],
[2.64575131, 1.41421356, 3.46410162]])
> arr1 arr2
array([[False, True, False],
[ True, False, True]])
Broadcasting
What happens when you try to perform an element-wise operation on two arrays of different shape?
NumPy will convert a low-dimensional array into a high-dimensional array to allow the operation to take place.
This is called broadcasting.
Let’s look at an example.
= np.ones((6,4)) foo
foo
array([[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.]])
If we multiply it by \(5\), the scalar is converted into an array of the same shape as foo
with the value \(5\) “broadcast” to populate the entire array.
* 5 foo
array([[5., 5., 5., 5.],
[5., 5., 5., 5.],
[5., 5., 5., 5.],
[5., 5., 5., 5.],
[5., 5., 5., 5.],
[5., 5., 5., 5.]])
We actually saw this already when we looked at slices.
If we want to multiply an array by a vector, the vector is broadcast to become a 2D array.
* np.array([5, 10, 6, 8]) foo
array([[ 5., 10., 6., 8.],
[ 5., 10., 6., 8.],
[ 5., 10., 6., 8.],
[ 5., 10., 6., 8.],
[ 5., 10., 6., 8.],
[ 5., 10., 6., 8.]])
Note that NumPy can’t always make the adjustment:
* np.array([5, 10]) foo
ValueError: operands could not be broadcast together with shapes (6,4) (2,)
Boolean Indexing
Another crucial topic in NumPy is boolean indexing.
In brief, you can pass a boolean array to the array indexer (i.e. the []
suffix) and it will return only those cells that are True
.
This is a technique we will use frequently in Pandas and R.
Let’s assume that we have two related arrays:
names
which holds the names associated with the data in each row, or observations, of a table.data
which holds the data associated with each feature of a table.
There are \(7\) observations and \(4\) features.
= np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
names names
array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'], dtype='<U4')
= np.random.randn(7, 4)
data data
array([[-1.0441337 , -0.8075191 , -1.89069483, -0.9007478 ],
[ 1.57167643, -0.50688096, 0.78120542, 0.22558685],
[-1.45769989, -0.49824512, -1.2056539 , -0.43596557],
[-0.33496411, -0.25575133, -0.61952407, 0.88984652],
[-0.59001244, -2.23411426, -0.31123889, -0.86358338],
[ 1.23662366, -0.90041871, 0.63348956, -0.58677799],
[-3.09362317, 0.92042332, 0.53013723, 0.24224835]])
A comparison operation for an array returns an array of booleans.
Let’s see which names are 'Bob'
:
== 'Bob' names
array([ True, False, False, True, False, False, False])
Now, this boolean expression can be passed to an array indexer to the data:
== 'Bob'] data[names
array([[-1.0441337 , -0.8075191 , -1.89069483, -0.9007478 ],
[-0.33496411, -0.25575133, -0.61952407, 0.88984652]])
Along the second axis, we can use a slice or integer to select data.
== 'Bob', 2:] data[names
array([[-1.89069483, -0.9007478 ],
[-0.61952407, 0.88984652]])
== 'Bob', 3] data[names
array([-0.9007478 , 0.88984652])
If you know SQL, this is like the query:
SELECT col3, col4 FROM data WHERE name = 'Bob'
Negation
Here are some examples of negated boolean operations being applied.
= names != 'Bob'
bix bix
array([False, True, True, False, True, True, True])
data[bix]
array([[ 1.57167643, -0.50688096, 0.78120542, 0.22558685],
[-1.45769989, -0.49824512, -1.2056539 , -0.43596557],
[-0.59001244, -2.23411426, -0.31123889, -0.86358338],
[ 1.23662366, -0.90041871, 0.63348956, -0.58677799],
[-3.09362317, 0.92042332, 0.53013723, 0.24224835]])
~bix] # Back to Bob data[
array([[-1.0441337 , -0.8075191 , -1.89069483, -0.9007478 ],
[-0.33496411, -0.25575133, -0.61952407, 0.88984652]])
~(names == 'Bob')] data[
array([[ 1.57167643, -0.50688096, 0.78120542, 0.22558685],
[-1.45769989, -0.49824512, -1.2056539 , -0.43596557],
[-0.59001244, -2.23411426, -0.31123889, -0.86358338],
[ 1.23662366, -0.90041871, 0.63348956, -0.58677799],
[-3.09362317, 0.92042332, 0.53013723, 0.24224835]])
Note that we don’t use not
but instead the tilde ~
sign to negate (flip) a value.
Nor do we use and
and or
; instead we use &
and |
.
Also, expressions join by these operators must be in parentheses.
= (names == 'Bob') | (names == 'Will')
mask
mask data[mask]
array([[-1.0441337 , -0.8075191 , -1.89069483, -0.9007478 ],
[-1.45769989, -0.49824512, -1.2056539 , -0.43596557],
[-0.33496411, -0.25575133, -0.61952407, 0.88984652],
[-0.59001244, -2.23411426, -0.31123889, -0.86358338]])
We can also do things like this:
< 0] = 0
data[data data
array([[0. , 0. , 0. , 0. ],
[1.57167643, 0. , 0.78120542, 0.22558685],
[0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0.88984652],
[0. , 0. , 0. , 0. ],
[1.23662366, 0. , 0.63348956, 0. ],
[0. , 0.92042332, 0.53013723, 0.24224835]])
And we can alter data with boolean indexing, just as we did with slices.
!= 'Joe'] = 7
data[names data
array([[7. , 7. , 7. , 7. ],
[1.57167643, 0. , 0.78120542, 0.22558685],
[7. , 7. , 7. , 7. ],
[7. , 7. , 7. , 7. ],
[7. , 7. , 7. , 7. ],
[1.23662366, 0. , 0.63348956, 0. ],
[0. , 0.92042332, 0.53013723, 0.24224835]])