= []
list1 = list()
list2 = "some string".split()
list3 = [1,2,3,4] numbers
NB: Data Structures
Programming for Data Science
In contrast to primitive data types,data structures
organize types into structures that have certain properties,
such as order, mutability, and an addressing scheme.
Python provides the following primary data structures:
- lists
- dictionaries
- tuples
- sets
- ranges
Lists
A list is an ordered sequence of items.
Each element of a list is associated with an integer that represents the order in which the element appears.
Lists are indexed with brackets []
.
List elements are accessed by providing their order number in the brackets.
Lists are mutable, meaning you can modify them after they have been created.
They can contain mixed types.
Constructors
They can be constructed in several ways:
Indexing
Indexing refers to how each element in a list is identified and addressed.
Python has what is called zero-based indexing.
This means that for a list mylist
,
mylist[0]
references the first element.mylist[1]
references the second element, etc.
For any list of length \(N\):
mylist[:n]
will return the first \(n\) elements from index \(0\) to \(n-1\).
mylist[-n:]
will return the last \(n\) elements from index \(N - n\) to \(N - 1\).
Let’s llok at an example.
Here we access first element of numbers
:
numbers
[1, 2, 3, 4]
0] numbers[
1
And here we accesst the last element:
-1] numbers[
4
Note that the elements of a list are values that can be treated like individual variables:
0] + numbers[3] numbers[
5
type(numbers), type(numbers[0])
(list, int)
Slicing
Slicing refers to the process of getting a sublist from a list.
To do this, we use a colon :
like so:
0:2] numbers[
[1, 2]
1:3] numbers[
[2, 3]
2:] numbers[
[3, 4]
Multiply lists by a scalar
A scalar is a single value number.
* 2 numbers
[1, 2, 3, 4, 1, 2, 3, 4]
Concatenate lists with +
= [30, 40, 50] numbers2
+ numbers2 numbers
[1, 2, 3, 4, 30, 40, 50]
Lists can mix types
= ['coconuts', 777, 7.25, 'Sir Robin', 80.0, True] myList
myList
['coconuts', 777, 7.25, 'Sir Robin', 80.0, True]
What happens if we multiply a list with strings?
* 2 myList
['coconuts',
777,
7.25,
'Sir Robin',
80.0,
True,
'coconuts',
777,
7.25,
'Sir Robin',
80.0,
True]
Lists can be nested
= ['Darrell', 'Clayton', ['Billie', 'Arthur'], 'Samantha'] names
0] names[
'Darrell'
2] names[
['Billie', 'Arthur']
2][0] names[
'Billie'
Strings are lists
Actually, they are list-like.
Like strings, their elements (i.e. individual characters) can be accessed by indexing and slicing.
However, like tuples, they are immutable and cannot be altered.
Here are some functions applicable to strings because they are lists.
len()
Length
This is built-in length funciton tells us how many characters in the string.
It also applys to any list-like object, including strings, lists, dicts, sets, and dataframes.
= 'This is a string' my_new_tring
len(my_new_tring)
16
Indexing
Since strings are sequences in Python, each character of the string has a unique position that can be indexed.
Indexes are indicated by suffixed brackets, e.g. foo[]
This displays the first character of the string.
0] my_new_tring[
'T'
This displays the last character. Negatives count backwords.
-1] my_new_tring[
'g'
Slicing
We can used the colon to ‘slice’ strings (and lists)
The first four characters, i.e. index positions \(0\) to \(3\).
0:4] my_new_tring[
'This'
The same thing; \(0\) is implied.
4] my_new_tring[:
'This'
The fifth character and onwards until the end of the string.
4:] my_new_tring[
' is a string'
Note that it is not possible to re-assign elements of a string.
This is because Python strings are immutable.
0] = 't' my_new_tring[
TypeError: 'str' object does not support item assignment
Dictionaries dict
Dictionaries are like lists, but instead of implied numeric indexes, each element is addressed by a name.
We say that dictionaries are made up of key/value pairs.
Key names are unique. If you re-use a key, you overwrite its value.
Keys are usually strings, but don’t have to be. They can be numbers or tuples or expressions that evaluate to one of these.
As with lists, elements are indexed using brackets []
.
However, dictionaries are constructed with braces {}
.
Constructors
A list may be constructed by putting a sequence of colon-delimitted key/value pairs within a pair of braces:
= {
dict1 'a': 1,
'b': 2,
'c': 3
}
Or we may call the dict()
function:
= dict(x=55, y=29, z=99) dict2
Note the absence of quotes around keys.
dict2
{'x': 55, 'y': 29, 'z': 99}
Here we demonstrate the keys can be strings, numbers, or tuples:
= {'A': 'foo', 99: 'bar', (1,2): 'baz'} dict3
dict3
{'A': 'foo', 99: 'bar', (1, 2): 'baz'}
Note that values can be anything, just like the elements of a list.
= {'a': 1, 'b': dict3} dict4
dict4
{'a': 1, 'b': {'A': 'foo', 99: 'bar', (1, 2): 'baz'}}
Retrieving a value
Just use a key as the index.
= {'Tom':123, 'Bob':456, 'Sam':897} phonelist
'Bob'] phonelist[
456
Getting a list of keys, values, or both.
Use the .keys()
, .values()
, or .items()
methods.
phonelist.keys()
dict_keys(['Tom', 'Bob', 'Sam'])
phonelist.values()
dict_values([123, 456, 897])
phonelist.items()
dict_items([('Tom', 123), ('Bob', 456), ('Sam', 897)])
phonelist
{'Tom': 123, 'Bob': 456, 'Sam': 897}
Tuples
A tuple is like a list but with one big difference: a tuple is an immutable object.
You can’t change a tuple once it’s created.
Like lists, a tuple can contain any number of elements of any datatype.
Elements are accessed with brackets []
but constructed with parentheses ()
.
Constructors
Tuples may be created with comma-separated values, with or without parenthesis.
= 'a', 'b', 'c', 'd' letters
letters
('a', 'b', 'c', 'd')
= (1, 2, 3, 4) numbers
numbers
(1, 2, 3, 4)
len(numbers)
4
A single valued tuple must include a comma ,
.
Otherwise, Python will interpret it as a simple expression.
= (29) tuple0
type(tuple0) tuple0,
(29, int)
= (29,) tuple1
type(tuple1) tuple1,
((29,), tuple)
You can’t re-assign a value to a tuple element.
They are immutable.
0] = 5 tuple1[
TypeError: 'tuple' object does not support item assignment
Common functions and methods to all sequences
Lists, distionaries, tuples, and other data structures are all members of a more general class of sequences.
They therefore share some functions and operations.
For example:
len()
in
+
*
len()
len(numbers), len(tuple1), len(dict1)
(4, 1, 3)
Membership with in
'Sam' in phonelist
True
Sets
A set
is an unordered collection of unique objects.
They are subject to set operations.
Constructors
Like dictionaries, sets are constructed with braces {}
, but unlike dictionaries elements are not key value pairs separated by a colon :
.
= {'snoopy','snoopy','woodstock'} peanuts
peanuts
{'snoopy', 'woodstock'}
Note the set is “de-duped.”
Sets also don’t have an index. This will break:
0] peanuts[
TypeError: 'set' object is not subscriptable
You can check if a value is in the set using in
:
'snoopy' in peanuts
True
Set operations
To combine two set sets is to get their union:
= {'python','R'}
set1 = {'R','SQL'} set2
set1.union(set2)
{'R', 'SQL', 'python'}
Note that the +
operator is not overloaded in this case.
This fails:
+ set2 set1
TypeError: unsupported operand type(s) for +: 'set' and 'set'
To get the set intersection, do this:
set1.intersection(set2)
{'R'}
And the difference, i.e. removing from one set the shared elements in another set:
set1.difference(set2)
{'python'}
set2.difference(set1)
{'SQL'}
Interestingly, the -
operator is overloaded to compute the difference:
- set2 set1
{'python'}
- set1 set2
{'SQL'}
Ranges
A range is a sequence of integers, from start
to stop
by step
. - The start
point is zero by default.
- The step
is one by default.
- The stop
point is NOT included.
Ranges can be assigned to a variable.
= range(5) rng
More often, ranges are used in iterations (loops), which we will cover later.
for rn in rng:
print(rn)
0
1
2
3
4
Another range:
= range(1, 11, 2)
rangy for rn in rangy:
print(rn)
1
3
5
7
9
Collections and defaultdict
Very often you will want to build a dictionary from some data source, and add keys as they appear. The default dict
type in Python, however, requires that the key exists before you can mutate it. The defaultdict
type in the collections
module solves this problem.
Here’s an example where we try to count the number word instances in a paragraph.
= """
source_data Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s,
when an unknown printer took a galley of type and scrambled it to make a type
specimen book. It has survived not only five centuries, but also the leap
into electronic typesetting, remaining essentially unchanged. It was
popularised in the 1960s with the release of Letraset sheets containing
Lorem Ipsum passages, and more recently with desktop publishing software
like Aldus PageMaker including versions of Lorem Ipsum.
"""[1:-1].split()
10] source_data[:
['Lorem',
'Ipsum',
'is',
'simply',
'dummy',
'text',
'of',
'the',
'printing',
'and']
Try with dict
= {}
words for word in source_data:
+= 1 words[word]
KeyError: 'Lorem'
Use try
and except
This is how we “trap” errors.
We’ll cover this later in the course.
for word in source_data:
try:
+= 1
words[word] except KeyError:
= 1 words[word]
print(words)
{'Lorem': 8, 'Ipsum': 6, 'is': 2, 'simply': 2, 'dummy': 4, 'text': 4, 'of': 8, 'the': 12, 'printing': 2, 'and': 6, 'typesetting': 2, 'industry.': 2, 'has': 4, 'been': 2, "industry's": 2, 'standard': 2, 'ever': 2, 'since': 2, '1500s,': 2, 'when': 2, 'an': 2, 'unknown': 2, 'printer': 2, 'took': 2, 'a': 4, 'galley': 2, 'type': 4, 'scrambled': 2, 'it': 2, 'to': 2, 'make': 2, 'specimen': 2, 'book.': 2, 'It': 4, 'survived': 2, 'not': 2, 'only': 2, 'five': 2, 'centuries,': 2, 'but': 2, 'also': 2, 'leap': 2, 'into': 2, 'electronic': 2, 'typesetting,': 2, 'remaining': 2, 'essentially': 2, 'unchanged.': 2, 'was': 2, 'popularised': 2, 'in': 2, '1960s': 2, 'with': 4, 'release': 2, 'Letraset': 2, 'sheets': 2, 'containing': 2, 'passages,': 2, 'more': 2, 'recently': 2, 'desktop': 2, 'publishing': 2, 'software': 2, 'like': 2, 'Aldus': 2, 'PageMaker': 2, 'including': 2, 'versions': 2, 'Ipsum.': 2}
Use .get()
for word in source_data:
= words.get(word, 0) + 1 words[word]
Use collections.defaultdict
from collections import defaultdict
= defaultdict(int) # Note the data type must be set words2
for word in source_data:
+= 1 words2[word]
print(words2)
defaultdict(<class 'int'>, {'Lorem': 4, 'Ipsum': 3, 'is': 1, 'simply': 1, 'dummy': 2, 'text': 2, 'of': 4, 'the': 6, 'printing': 1, 'and': 3, 'typesetting': 1, 'industry.': 1, 'has': 2, 'been': 1, "industry's": 1, 'standard': 1, 'ever': 1, 'since': 1, '1500s,': 1, 'when': 1, 'an': 1, 'unknown': 1, 'printer': 1, 'took': 1, 'a': 2, 'galley': 1, 'type': 2, 'scrambled': 1, 'it': 1, 'to': 1, 'make': 1, 'specimen': 1, 'book.': 1, 'It': 2, 'survived': 1, 'not': 1, 'only': 1, 'five': 1, 'centuries,': 1, 'but': 1, 'also': 1, 'leap': 1, 'into': 1, 'electronic': 1, 'typesetting,': 1, 'remaining': 1, 'essentially': 1, 'unchanged.': 1, 'was': 1, 'popularised': 1, 'in': 1, '1960s': 1, 'with': 2, 'release': 1, 'Letraset': 1, 'sheets': 1, 'containing': 1, 'passages,': 1, 'more': 1, 'recently': 1, 'desktop': 1, 'publishing': 1, 'software': 1, 'like': 1, 'Aldus': 1, 'PageMaker': 1, 'including': 1, 'versions': 1, 'Ipsum.': 1})