NB: Iterables and Iterators

Programming for Data Science

Iterables and Iteration

We have seen that sequential data structures like lists and tuples have a natural affinity to loops.

Sequences imply loops and loops expect sequences.

In Python, this relationship is captured by the resonance between the words iteration and iterables.

Iterable data structures that can be iterated over, meaning they can return their elements one at a time.

Examples of iterable objects include lists, tuples, sets, dictionaries, and strings.

Typically we iterate over iterables using for loops, as we saw when reviewed control structures.

Lists

iterating using for

First, let’s review iteration by means of a for loop.

tokens = ['living room', 'was', 'quite', 'large']
for tok in tokens:
    print(tok)
living room
was
quite
large

Iterators

Python introduces a kind of object call an iterator designed to make iteration — sequence processing — fast and efficient.

An iterator is a specific object that represents an interable stream of data.

It is used to iterate over iterable objects by removing one element at a time from the iterables.

Iterating with Iterators

An iterator works by popping out and removing a value at each iteration.

This means than when iterating through an iterable object you empty it as you go, leaving an empty data structure at the end.

This is useful in situations where you want to save memory.

Many functions in Python return iterables so it’s helpful to understand them even if you don’t create any yourself.

Using iter() and next()

To use an iterator, you convert a sequence to an iterator object using iter().

Then you use next() to get the next item from the iterator.

tokens = ['living room','was','quite','large']
myit = iter(tokens)
print(next(myit)) 
print(next(myit)) 
print(next(myit)) 
print(next(myit)) 
living room
was
quite
large

Calling next() when the iterator has reached the end of the list produces an exception:

next(myit)
StopIteration: 

Note that when used with an iterable created by iter(), for implicitly executes next() on each loop iteration.

myit = iter(tokens) # Reset the iterator
for next_it in myit:
    print(next_it)
living room
was
quite
large

Sequences and Collections

So far, we iterated over a list.

Now let’s look at sets, strings, tuples, dictionaries, and ranges.

Lists, tuples, and strings are sequences. Sequences are designed so that elements come out of them in the same order they were put in.

Sets and dictionaries are not sequences per se, since they the order of their elements is not as important as their names. They are called collections.

Note that prior to Python 3.7, the order of elements in sets and dictionaries was arbitrary. Now, dictionaries preserve the order in which they were populated, and sets are sorted.

Sets

Iterating using for:

princesses = {'belle', 'cinderella', 'rapunzel'}
for princess in princesses:
    print(princess)
rapunzel
belle
cinderella

Iterating using iter() and next():

princesses_i = iter(princesses)
print(next(princesses_i))
print(next(princesses_i))
print(next(princesses_i))
rapunzel
belle
cinderella
type(princesses_i)
set_iterator

Strings

Iterating using for:

str1 = 'data'
for my_char in str1:
    print(my_char)
d
a
t
a

Iterating using iter() and next():

str1_i = iter(str1)
print(next(str1_i))
print(next(str1_i))
print(next(str1_i))
print(next(str1_i))
d
a
t
a
type(str1_i)
str_ascii_iterator

Tuples

Iterating using for:

metrics = ('auc','recall','precision','support')
for met in metrics:
    print(met)
auc
recall
precision
support

Iterating using iter() and next():

metrics = ('auc','recall','precision','support')
metrics_i = iter(metrics)
print(next(metrics_i))
print(next(metrics_i))
print(next(metrics_i))
print(next(metrics_i))
auc
recall
precision
support
type(metrics_i)
tuple_iterator

Dictionaries

Iterating using for:

courses = {'fall': ['regression','python'], 'spring': ['capstone','pyspark','nlp']}
for k in courses:
    print(k)
fall
spring
for k in courses.keys():
    print(k)
fall
spring
for v in courses.values():
    print(v)
['regression', 'python']
['capstone', 'pyspark', 'nlp']
for k, v in courses.items():
    print(f"{k.upper()}:\t{', '.join(v)}")
FALL:   regression, python
SPRING: capstone, pyspark, nlp
for k in courses.keys():
    print(f"{k.upper()}:\t{', '.join(courses[k])}") # index into the dict with the key
FALL:   regression, python
SPRING: capstone, pyspark, nlp

Ranges

Iterating using for:

for i in range(10):
    print(str(i+1).zfill(2), (i+1)**2 * '|')
01 |
02 ||||
03 |||||||||
04 ||||||||||||||||
05 |||||||||||||||||||||||||
06 ||||||||||||||||||||||||||||||||||||
07 |||||||||||||||||||||||||||||||||||||||||||||||||
08 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
09 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
10 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

Get iteration number with enumerate()

Very often you will want to know iteration number you are on in a loop.

This can be used to name files or dict keys, for example.

enumerate() will return the index and key for each iteration.

courses
{'fall': ['regression', 'python'], 'spring': ['capstone', 'pyspark', 'nlp']}
for i, semester in enumerate(courses):
    course_name = f"{str(i).zfill(2)}_{semester}:\t{'-'.join(courses[semester])}"
    print(course_name)
00_fall:    regression-python
01_spring:  capstone-pyspark-nlp

Nested Loops

Iterations can be nested — this is very powerful.

This works well with nested data structures, like dictionaries within dictionaries.

This is basically how JSON files are handled, BTW.

Be careful, though – these can get deep and complicated.

for i, semester in enumerate(courses):
    print(f"{i+1}. {semester.upper()}:")
    for j, course in enumerate(courses[semester]):
        print(f"\t{i+1}.{j+1}. {course}")
1. FALL:
    1.1. regression
    1.2. python
2. SPRING:
    2.1. capstone
    2.2. pyspark
    2.3. nlp

Used nested loops to get the cartesian product.

die = range(1,7)
die_rolls = []
for face1 in die:
    for face2 in die:
        die_rolls.append((face1, face2))
print(die_rolls)
[(1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6), (2, 1), (2, 2), (2, 3), (2, 4), (2, 5), (2, 6), (3, 1), (3, 2), (3, 3), (3, 4), (3, 5), (3, 6), (4, 1), (4, 2), (4, 3), (4, 4), (4, 5), (4, 6), (5, 1), (5, 2), (5, 3), (5, 4), (5, 5), (5, 6), (6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6)]

Now get the frequency of die roll sums.

die_roll_sums = {}
for my_die_roll in die_rolls:
    my_die_roll_sum = sum(my_die_roll)
    die_roll_sums[my_die_roll_sum] = die_roll_sums.get(my_die_roll_sum, 0) + 1
for k, v in die_roll_sums.items():
    print(str(k).zfill(2), v, '|' * v)
02 1 |
03 2 ||
04 3 |||
05 4 ||||
06 5 |||||
07 6 ||||||
08 5 |||||
09 4 ||||
10 3 |||
11 2 ||
12 1 |