NB: Data Structures

Programming for Data Science

In contrast to primitive data types,data structures
organize types into structures that have certain properties,
such as order, mutability, and an addressing scheme.

Python provides the following primary data structures:

Lists

A list is an ordered sequence of items.

Each element of a list is associated with an integer that represents the order in which the element appears.

Lists are indexed with brackets [].

List elements are accessed by providing their order number in the brackets.

Lists are mutable, meaning you can modify them after they have been created.

They can contain mixed types.

Constructors

They can be constructed in several ways:

list1 = []
list2 = list()
list3 = "some string".split()
numbers = [1,2,3,4] 

Indexing

Indexing refers to how each element in a list is identified and addressed.

Python has what is called zero-based indexing.

This means that for a list mylist,

  • mylist[0] references the first element.
  • mylist[1] references the second element, etc.

For any list of length \(N\):

  • mylist[:n] will return the first \(n\) elements from index \(0\) to \(n-1\).
  • mylist[-n:] will return the last \(n\) elements from index \(N - n\) to \(N - 1\).

Let’s llok at an example.

Here we access first element of numbers:

numbers
[1, 2, 3, 4]
numbers[0]
1

And here we accesst the last element:

numbers[-1]
4

Note that the elements of a list are values that can be treated like individual variables:

numbers[0] + numbers[3]
5
type(numbers), type(numbers[0])
(list, int)

Slicing

Slicing refers to the process of getting a sublist from a list.

To do this, we use a colon : like so:

numbers[0:2]
[1, 2]
numbers[1:3]
[2, 3]
numbers[2:]
[3, 4]

Multiply lists by a scalar

A scalar is a single value number.

numbers * 2
[1, 2, 3, 4, 1, 2, 3, 4]

Concatenate lists with +

numbers2 = [30, 40, 50]
numbers + numbers2
[1, 2, 3, 4, 30, 40, 50]

Lists can mix types

myList = ['coconuts', 777, 7.25, 'Sir Robin', 80.0, True]
myList
['coconuts', 777, 7.25, 'Sir Robin', 80.0, True]

What happens if we multiply a list with strings?

myList * 2
['coconuts',
 777,
 7.25,
 'Sir Robin',
 80.0,
 True,
 'coconuts',
 777,
 7.25,
 'Sir Robin',
 80.0,
 True]

Lists can be nested

names = ['Darrell', 'Clayton', ['Billie', 'Arthur'], 'Samantha']
  names[0]
'Darrell'
names[2]
['Billie', 'Arthur']
names[2][0]
'Billie'

Strings are lists

Actually, they are list-like.

Like strings, their elements (i.e. individual characters) can be accessed by indexing and slicing.

However, like tuples, they are immutable and cannot be altered.

Here are some functions applicable to strings because they are lists.

len() Length

This is built-in length funciton tells us how many characters in the string.

It also applys to any list-like object, including strings, lists, dicts, sets, and dataframes.

my_new_tring = 'This is a string'
len(my_new_tring)
16

Indexing

Since strings are sequences in Python, each character of the string has a unique position that can be indexed.

Indexes are indicated by suffixed brackets, e.g. foo[]

This displays the first character of the string.

my_new_tring[0]
'T'

This displays the last character. Negatives count backwords.

my_new_tring[-1]
'g'

Slicing

We can used the colon to ‘slice’ strings (and lists)

The first four characters, i.e. index positions \(0\) to \(3\).

my_new_tring[0:4]
'This'

The same thing; \(0\) is implied.

my_new_tring[:4]
'This'

The fifth character and onwards until the end of the string.

my_new_tring[4:]
' is a string'

Note that it is not possible to re-assign elements of a string.

This is because Python strings are immutable.

my_new_tring[0] = 't'
TypeError: 'str' object does not support item assignment

Dictionaries dict

Dictionaries are like lists, but instead of implied numeric indexes, each element is addressed by a name.

We say that dictionaries are made up of key/value pairs.

Key names are unique. If you re-use a key, you overwrite its value.

Keys are usually strings, but don’t have to be. They can be numbers or tuples or expressions that evaluate to one of these.

As with lists, elements are indexed using brackets [].

However, dictionaries are constructed with braces {}.

Constructors

A list may be constructed by putting a sequence of colon-delimitted key/value pairs within a pair of braces:

dict1 = {
    'a': 1,
    'b': 2,
    'c': 3
}

Or we may call the dict() function:

dict2 = dict(x=55, y=29, z=99)

Note the absence of quotes around keys.

dict2
{'x': 55, 'y': 29, 'z': 99}

Here we demonstrate the keys can be strings, numbers, or tuples:

dict3 = {'A': 'foo', 99: 'bar', (1,2): 'baz'}
dict3
{'A': 'foo', 99: 'bar', (1, 2): 'baz'}

Note that values can be anything, just like the elements of a list.

dict4 = {'a': 1, 'b': dict3}
dict4
{'a': 1, 'b': {'A': 'foo', 99: 'bar', (1, 2): 'baz'}}

Retrieving a value

Just use a key as the index.

phonelist = {'Tom':123, 'Bob':456, 'Sam':897}
phonelist['Bob']
456

Getting a list of keys, values, or both.

Use the .keys(), .values(), or .items() methods.

phonelist.keys()
dict_keys(['Tom', 'Bob', 'Sam'])
phonelist.values()
dict_values([123, 456, 897])
phonelist.items()
dict_items([('Tom', 123), ('Bob', 456), ('Sam', 897)])
phonelist
{'Tom': 123, 'Bob': 456, 'Sam': 897}

Tuples

A tuple is like a list but with one big difference: a tuple is an immutable object.

You can’t change a tuple once it’s created.

Like lists, a tuple can contain any number of elements of any datatype.

Elements are accessed with brackets [] but constructed with parentheses ().

Constructors

Tuples may be created with comma-separated values, with or without parenthesis.

letters = 'a', 'b', 'c', 'd'
letters
('a', 'b', 'c', 'd')
numbers = (1, 2, 3, 4)
numbers
(1, 2, 3, 4)
len(numbers)
4

A single valued tuple must include a comma ,.

Otherwise, Python will interpret it as a simple expression.

tuple0 = (29)
tuple0, type(tuple0)
(29, int)
tuple1 = (29,)
tuple1, type(tuple1)
((29,), tuple)

You can’t re-assign a value to a tuple element.

They are immutable.

tuple1[0] = 5
TypeError: 'tuple' object does not support item assignment

Common functions and methods to all sequences

Lists, distionaries, tuples, and other data structures are all members of a more general class of sequences.

They therefore share some functions and operations.

For example:

len()
in
+ 
*

len()

len(numbers), len(tuple1), len(dict1)
(4, 1, 3)

Membership with in

'Sam' in phonelist
True

Sets

A set is an unordered collection of unique objects.

They are subject to set operations.

Constructors

Like dictionaries, sets are constructed with braces {}, but unlike dictionaries elements are not key value pairs separated by a colon :.

peanuts = {'snoopy','snoopy','woodstock'}
peanuts
{'snoopy', 'woodstock'}

Note the set is “de-duped.”

Sets also don’t have an index. This will break:

peanuts[0]
TypeError: 'set' object is not subscriptable

You can check if a value is in the set using in:

'snoopy' in peanuts
True

Set operations

To combine two set sets is to get their union:

set1 = {'python','R'}
set2 = {'R','SQL'}
set1.union(set2)
{'R', 'SQL', 'python'}

Note that the + operator is not overloaded in this case.

This fails:

set1 + set2
TypeError: unsupported operand type(s) for +: 'set' and 'set'

To get the set intersection, do this:

set1.intersection(set2)
{'R'}

And the difference, i.e. removing from one set the shared elements in another set:

set1.difference(set2)
{'python'}
set2.difference(set1)
{'SQL'}

Interestingly, the - operator is overloaded to compute the difference:

set1 - set2
{'python'}
set2 - set1
{'SQL'}

Ranges

A range is a sequence of integers, from start to stop by step. - The start point is zero by default.
- The step is one by default.
- The stop point is NOT included.

Ranges can be assigned to a variable.

rng = range(5)

More often, ranges are used in iterations (loops), which we will cover later.

for rn in rng:
    print(rn)
0
1
2
3
4

Another range:

rangy = range(1, 11, 2)
for rn in rangy:
    print(rn)
1
3
5
7
9

Collections and defaultdict

Very often you will want to build a dictionary from some data source, and add keys as they appear. The default dict type in Python, however, requires that the key exists before you can mutate it. The defaultdict type in the collections module solves this problem.

Here’s an example where we try to count the number word instances in a paragraph.

source_data = """
Lorem Ipsum is simply dummy text of the printing and typesetting industry. 
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, 
when an unknown printer took a galley of type and scrambled it to make a type 
specimen book. It has survived not only five centuries, but also the leap 
into electronic typesetting, remaining essentially unchanged. It was 
popularised in the 1960s with the release of Letraset sheets containing 
Lorem Ipsum passages, and more recently with desktop publishing software 
like Aldus PageMaker including versions of Lorem Ipsum.
"""[1:-1].split()
source_data[:10]
['Lorem',
 'Ipsum',
 'is',
 'simply',
 'dummy',
 'text',
 'of',
 'the',
 'printing',
 'and']

Try with dict

words = {}
for word in source_data:
    words[word] += 1
KeyError: 'Lorem'

Use try and except

This is how we “trap” errors.

We’ll cover this later in the course.

for word in source_data:
    try:
        words[word] += 1
    except KeyError:
        words[word] = 1
print(words)
{'Lorem': 8, 'Ipsum': 6, 'is': 2, 'simply': 2, 'dummy': 4, 'text': 4, 'of': 8, 'the': 12, 'printing': 2, 'and': 6, 'typesetting': 2, 'industry.': 2, 'has': 4, 'been': 2, "industry's": 2, 'standard': 2, 'ever': 2, 'since': 2, '1500s,': 2, 'when': 2, 'an': 2, 'unknown': 2, 'printer': 2, 'took': 2, 'a': 4, 'galley': 2, 'type': 4, 'scrambled': 2, 'it': 2, 'to': 2, 'make': 2, 'specimen': 2, 'book.': 2, 'It': 4, 'survived': 2, 'not': 2, 'only': 2, 'five': 2, 'centuries,': 2, 'but': 2, 'also': 2, 'leap': 2, 'into': 2, 'electronic': 2, 'typesetting,': 2, 'remaining': 2, 'essentially': 2, 'unchanged.': 2, 'was': 2, 'popularised': 2, 'in': 2, '1960s': 2, 'with': 4, 'release': 2, 'Letraset': 2, 'sheets': 2, 'containing': 2, 'passages,': 2, 'more': 2, 'recently': 2, 'desktop': 2, 'publishing': 2, 'software': 2, 'like': 2, 'Aldus': 2, 'PageMaker': 2, 'including': 2, 'versions': 2, 'Ipsum.': 2}

Use .get()

for word in source_data:
    words[word] = words.get(word, 0) + 1

Use collections.defaultdict

from collections import defaultdict
words2 = defaultdict(int) # Note the data type must be set
for word in source_data:
    words2[word] += 1
print(words2)
defaultdict(<class 'int'>, {'Lorem': 4, 'Ipsum': 3, 'is': 1, 'simply': 1, 'dummy': 2, 'text': 2, 'of': 4, 'the': 6, 'printing': 1, 'and': 3, 'typesetting': 1, 'industry.': 1, 'has': 2, 'been': 1, "industry's": 1, 'standard': 1, 'ever': 1, 'since': 1, '1500s,': 1, 'when': 1, 'an': 1, 'unknown': 1, 'printer': 1, 'took': 1, 'a': 2, 'galley': 1, 'type': 2, 'scrambled': 1, 'it': 1, 'to': 1, 'make': 1, 'specimen': 1, 'book.': 1, 'It': 2, 'survived': 1, 'not': 1, 'only': 1, 'five': 1, 'centuries,': 1, 'but': 1, 'also': 1, 'leap': 1, 'into': 1, 'electronic': 1, 'typesetting,': 1, 'remaining': 1, 'essentially': 1, 'unchanged.': 1, 'was': 1, 'popularised': 1, 'in': 1, '1960s': 1, 'with': 2, 'release': 1, 'Letraset': 1, 'sheets': 1, 'containing': 1, 'passages,': 1, 'more': 1, 'recently': 1, 'desktop': 1, 'publishing': 1, 'software': 1, 'like': 1, 'Aldus': 1, 'PageMaker': 1, 'including': 1, 'versions': 1, 'Ipsum.': 1})