NB: NumPy Operations

Programming for Data Science

Element-wise Arithmetic

NumPy arrays can be transformed with with arithmetic operations.

These are all element-wise operations.

Let’s start with a couple of \(2\)-D arrays.

import numpy as np
arr1 = np.array([[1., 2., 3.], [4., 5., 6.]])
arr2 = np.array([[0., 4., 1.], [7., 2., 12.]])
arr1, arr2
(array([[1., 2., 3.],
        [4., 5., 6.]]),
 array([[ 0.,  4.,  1.],
        [ 7.,  2., 12.]]))

If we multiply these two matrices, NumPy performs multiplication on each pair of cells with the same index or coordinate.

arr1 * arr2
array([[ 0.,  8.,  3.],
       [28., 10., 72.]])

You can think of it this way:

coordinate   arr1   arr2  arr1 * arr2 
0, 0         1.     0.    0.
0, 1         2.     4.    8.
0, 2         3.     1.    3.
...

Of course, this works for the other operations, too.

arr1 - arr2
array([[ 1., -2.,  2.],
       [-3.,  3., -6.]])
arr2 / arr1
array([[0.        , 2.        , 0.33333333],
       [1.75      , 0.4       , 2.        ]])
1 / arr1
array([[1.        , 0.5       , 0.33333333],
       [0.25      , 0.2       , 0.16666667]])
arr1 ** arr2
array([[1.00000000e+00, 1.60000000e+01, 3.00000000e+00],
       [1.63840000e+04, 2.50000000e+01, 2.17678234e+09]])
arr2 ** 0.5
array([[0.        , 2.        , 1.        ],
       [2.64575131, 1.41421356, 3.46410162]])
arr2 > arr1
array([[False,  True, False],
       [ True, False,  True]])

Broadcasting

What happens when you try to perform an element-wise operation on two arrays of different shape?

NumPy will convert a low-dimensional array into a high-dimensional array to allow the operation to take place.

This is called broadcasting.

Let’s look at an example.

foo = np.ones((6,4))
foo
array([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])

If we multiply it by \(5\), the scalar is converted into an array of the same shape as foo with the value \(5\) “broadcast” to populate the entire array.

foo * 5
array([[5., 5., 5., 5.],
       [5., 5., 5., 5.],
       [5., 5., 5., 5.],
       [5., 5., 5., 5.],
       [5., 5., 5., 5.],
       [5., 5., 5., 5.]])

We actually saw this already when we looked at slices.

If we want to multiply an array by a vector, the vector is broadcast to become a 2D array.

foo * np.array([5, 10, 6, 8])
array([[ 5., 10.,  6.,  8.],
       [ 5., 10.,  6.,  8.],
       [ 5., 10.,  6.,  8.],
       [ 5., 10.,  6.,  8.],
       [ 5., 10.,  6.,  8.],
       [ 5., 10.,  6.,  8.]])

Note that NumPy can’t always make the adjustment:

foo * np.array([5, 10])
ValueError: operands could not be broadcast together with shapes (6,4) (2,) 

Boolean Indexing

Another crucial topic in NumPy is boolean indexing.

In brief, you can pass a boolean array to the array indexer (i.e. the [] suffix) and it will return only those cells that are True.

This is a technique we will use frequently in Pandas and R.

Let’s assume that we have two related arrays:

  • names which holds the names associated with the data in each row, or observations, of a table.
  • data which holds the data associated with each feature of a table.

There are \(7\) observations and \(4\) features.

names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
names
array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'], dtype='<U4')
data = np.random.randn(7, 4)
data
array([[-1.0441337 , -0.8075191 , -1.89069483, -0.9007478 ],
       [ 1.57167643, -0.50688096,  0.78120542,  0.22558685],
       [-1.45769989, -0.49824512, -1.2056539 , -0.43596557],
       [-0.33496411, -0.25575133, -0.61952407,  0.88984652],
       [-0.59001244, -2.23411426, -0.31123889, -0.86358338],
       [ 1.23662366, -0.90041871,  0.63348956, -0.58677799],
       [-3.09362317,  0.92042332,  0.53013723,  0.24224835]])

A comparison operation for an array returns an array of booleans.

Let’s see which names are 'Bob':

names == 'Bob'
array([ True, False, False,  True, False, False, False])

Now, this boolean expression can be passed to an array indexer to the data:

data[names == 'Bob']
array([[-1.0441337 , -0.8075191 , -1.89069483, -0.9007478 ],
       [-0.33496411, -0.25575133, -0.61952407,  0.88984652]])

Along the second axis, we can use a slice or integer to select data.

data[names == 'Bob', 2:]
array([[-1.89069483, -0.9007478 ],
       [-0.61952407,  0.88984652]])
data[names == 'Bob', 3]
array([-0.9007478 ,  0.88984652])

If you know SQL, this is like the query:

SELECT col3, col4 FROM data WHERE name = 'Bob'

Negation

Here are some examples of negated boolean operations being applied.

bix = names != 'Bob'
bix
array([False,  True,  True, False,  True,  True,  True])
data[bix]
array([[ 1.57167643, -0.50688096,  0.78120542,  0.22558685],
       [-1.45769989, -0.49824512, -1.2056539 , -0.43596557],
       [-0.59001244, -2.23411426, -0.31123889, -0.86358338],
       [ 1.23662366, -0.90041871,  0.63348956, -0.58677799],
       [-3.09362317,  0.92042332,  0.53013723,  0.24224835]])
data[~bix] # Back to Bob
array([[-1.0441337 , -0.8075191 , -1.89069483, -0.9007478 ],
       [-0.33496411, -0.25575133, -0.61952407,  0.88984652]])
data[~(names == 'Bob')]
array([[ 1.57167643, -0.50688096,  0.78120542,  0.22558685],
       [-1.45769989, -0.49824512, -1.2056539 , -0.43596557],
       [-0.59001244, -2.23411426, -0.31123889, -0.86358338],
       [ 1.23662366, -0.90041871,  0.63348956, -0.58677799],
       [-3.09362317,  0.92042332,  0.53013723,  0.24224835]])

Note that we don’t use not but instead the tilde ~ sign to negate (flip) a value.

Nor do we use and and or; instead we use & and |.

Also, expressions join by these operators must be in parentheses.

mask = (names == 'Bob') | (names == 'Will')
mask
data[mask]
array([[-1.0441337 , -0.8075191 , -1.89069483, -0.9007478 ],
       [-1.45769989, -0.49824512, -1.2056539 , -0.43596557],
       [-0.33496411, -0.25575133, -0.61952407,  0.88984652],
       [-0.59001244, -2.23411426, -0.31123889, -0.86358338]])

We can also do things like this:

data[data < 0] = 0
data
array([[0.        , 0.        , 0.        , 0.        ],
       [1.57167643, 0.        , 0.78120542, 0.22558685],
       [0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.88984652],
       [0.        , 0.        , 0.        , 0.        ],
       [1.23662366, 0.        , 0.63348956, 0.        ],
       [0.        , 0.92042332, 0.53013723, 0.24224835]])

And we can alter data with boolean indexing, just as we did with slices.

data[names != 'Joe'] = 7
data
array([[7.        , 7.        , 7.        , 7.        ],
       [1.57167643, 0.        , 0.78120542, 0.22558685],
       [7.        , 7.        , 7.        , 7.        ],
       [7.        , 7.        , 7.        , 7.        ],
       [7.        , 7.        , 7.        , 7.        ],
       [1.23662366, 0.        , 0.63348956, 0.        ],
       [0.        , 0.92042332, 0.53013723, 0.24224835]])