NB: Basic File I/O

Programming for Data Science

Motivation

Another useful topic that we will introduce now concerns how to read and write data to and from sources external to your program.

External sources include the file system, the web, or an external database.

We call reading your data to and from files file I/O, where I/O stands for input and output.

Python provides a function for file i/o — open().

We will discuss this along with a use case of importing structured data from a file.

This will motivate the use of more sophisticated tools such as we will encounter with NumPy and Pandas.

Open Files with `open()`

Python’s open() function allows you to read and write files from the file system, or, as we sometimes say, “from disk.”

A common use case is opening up a text file containing data you want to work with, such as a CSV file.

CSV means “Comma Separated Values.”

A CSV file is a plain text file where each line has a list of data items separated (i.e. delimitied) by a comma (or other character, such as tab).

Each row should have the same number of delimited items.

Often, but not always, the first line contains the names of the columns for the delimited data.

Biostats Data

Let’s open a sample CSV file, biostats.csv.

Because we want to convert it into a Python data structure, we need to know something about how the source file is structured. Here are some basic facts about the file:

It has some biometric statistics for a group of office workers.
There are 18 records, recording Name, Sex, Age, Height, Weight
There is an initial header line.

This file was downloaded from https://people.sc.fsu.edu/~jburkardt/data/csv/csv.html and modified slightly.

Let’s open a downloaded version.

src_file_name = "./sample_data_files/biostats.csv"
file_handle = open(src_file_name, 'r')

The open() function takes two parameters:

The name of the file we want to open.
The mode in which the file is opened. It defaults to r which means open for reading in text mode. Other common values are:
- w for writing (truncating the file if it already exists)
- x for creating and writing to a new file
- a for appending

This returns a file an iterator.

We sometimes call the file object a file “handle.”

To get the contents of the file into the program, we have a couple of options.

.read() reads in the file as one long string.

.readlines() reads in the file as a list of lines, created by the hard returns \n in the source file.

Let’s look at the first function:

file_as_big_string = file_handle.read()

file_as_big_string[:1000]

'"Name",     "Sex", "Age", "Height (in)", "Weight (lbs)"\n"Alex",       "M",   41,       74,      170\n"Bert",       "M",   42,       68,      166\n"Carl",       "M",   32,       70,      155\n"Dave",       "M",   39,       72,      167\n"Elly",       "F",   30,       66,      124\n"Fran",       "F",   33,       66,      115\n"Gwen",       "F",   26,       64,      121\n"Hank",       "M",   30,       71,      158\n"Ivan",       "M",   53,       72,      175\n"Jake",       "M",   32,       69,      143\n"Kate",       "F",   47,       69,      139\n"Luke",       "M",   34,       72,      163\n"Myra",       "F",   23,       62,       98\n"Neil",       "M",   36,       75,      160\n"Omar",       "M",   38,       70,      145\n"Page",       "F",   31,       67,      135\n"Quin",       "M",   29,       71,      176\n"Ruth",       "F",   28,       65,      131'

Since the file object is an iterator, we can’t get the string again from the object.

file_as_big_string = file_handle.read()
file_as_big_string[:1000]

''

So, let’s create a new handle, read in the contents again, and then parse our string by newlines using .split("\n").

file_handle = open("./sample_data_files/biostats.csv", 'r')
file_as_big_string = file_handle.read()
file_as_big_string.split("\n")

['"Name",     "Sex", "Age", "Height (in)", "Weight (lbs)"',
 '"Alex",       "M",   41,       74,      170',
 '"Bert",       "M",   42,       68,      166',
 '"Carl",       "M",   32,       70,      155',
 '"Dave",       "M",   39,       72,      167',
 '"Elly",       "F",   30,       66,      124',
 '"Fran",       "F",   33,       66,      115',
 '"Gwen",       "F",   26,       64,      121',
 '"Hank",       "M",   30,       71,      158',
 '"Ivan",       "M",   53,       72,      175',
 '"Jake",       "M",   32,       69,      143',
 '"Kate",       "F",   47,       69,      139',
 '"Luke",       "M",   34,       72,      163',
 '"Myra",       "F",   23,       62,       98',
 '"Neil",       "M",   36,       75,      160',
 '"Omar",       "M",   38,       70,      145',
 '"Page",       "F",   31,       67,      135',
 '"Quin",       "M",   29,       71,      176',
 '"Ruth",       "F",   28,       65,      131']

This creates a list of lines, each containing a string of delimited data.

A short-cut to this process is to call the .readlines() method, which returns a pre-made list of lines.

file_handle = open(src_file_name, 'r')
file_as_list_of_strings = file_handle.readlines()
file_as_list_of_strings

['"Name",     "Sex", "Age", "Height (in)", "Weight (lbs)"\n',
 '"Alex",       "M",   41,       74,      170\n',
 '"Bert",       "M",   42,       68,      166\n',
 '"Carl",       "M",   32,       70,      155\n',
 '"Dave",       "M",   39,       72,      167\n',
 '"Elly",       "F",   30,       66,      124\n',
 '"Fran",       "F",   33,       66,      115\n',
 '"Gwen",       "F",   26,       64,      121\n',
 '"Hank",       "M",   30,       71,      158\n',
 '"Ivan",       "M",   53,       72,      175\n',
 '"Jake",       "M",   32,       69,      143\n',
 '"Kate",       "F",   47,       69,      139\n',
 '"Luke",       "M",   34,       72,      163\n',
 '"Myra",       "F",   23,       62,       98\n',
 '"Neil",       "M",   36,       75,      160\n',
 '"Omar",       "M",   38,       70,      145\n',
 '"Page",       "F",   31,       67,      135\n',
 '"Quin",       "M",   29,       71,      176\n',
 '"Ruth",       "F",   28,       65,      131']

Note that the newlines are preserved in this case.

We could fix this by opening the file in a comprehension, like so:

file_handle = open(src_file_name, 'r')
file_as_list_of_strings = [line.rstrip() for line in file_handle.readlines()]
file_as_list_of_strings

['"Name",     "Sex", "Age", "Height (in)", "Weight (lbs)"',
 '"Alex",       "M",   41,       74,      170',
 '"Bert",       "M",   42,       68,      166',
 '"Carl",       "M",   32,       70,      155',
 '"Dave",       "M",   39,       72,      167',
 '"Elly",       "F",   30,       66,      124',
 '"Fran",       "F",   33,       66,      115',
 '"Gwen",       "F",   26,       64,      121',
 '"Hank",       "M",   30,       71,      158',
 '"Ivan",       "M",   53,       72,      175',
 '"Jake",       "M",   32,       69,      143',
 '"Kate",       "F",   47,       69,      139',
 '"Luke",       "M",   34,       72,      163',
 '"Myra",       "F",   23,       62,       98',
 '"Neil",       "M",   36,       75,      160',
 '"Omar",       "M",   38,       70,      145',
 '"Page",       "F",   31,       67,      135',
 '"Quin",       "M",   29,       71,      176',
 '"Ruth",       "F",   28,       65,      131']

File objects should be closed when you are done with them.

file_handle.close()

Using a `with` block

To automatically open and close the file i/o object, we can use a with block.

This automatically closes the file handle once the program moves out of the block.

with open(src_file_name, 'r') as infile:
    file_as_list = [line.rstrip() for line in infile.readlines()]

file_as_list

['"Name",     "Sex", "Age", "Height (in)", "Weight (lbs)"',
 '"Alex",       "M",   41,       74,      170',
 '"Bert",       "M",   42,       68,      166',
 '"Carl",       "M",   32,       70,      155',
 '"Dave",       "M",   39,       72,      167',
 '"Elly",       "F",   30,       66,      124',
 '"Fran",       "F",   33,       66,      115',
 '"Gwen",       "F",   26,       64,      121',
 '"Hank",       "M",   30,       71,      158',
 '"Ivan",       "M",   53,       72,      175',
 '"Jake",       "M",   32,       69,      143',
 '"Kate",       "F",   47,       69,      139',
 '"Luke",       "M",   34,       72,      163',
 '"Myra",       "F",   23,       62,       98',
 '"Neil",       "M",   36,       75,      160',
 '"Omar",       "M",   38,       70,      145',
 '"Page",       "F",   31,       67,      135',
 '"Quin",       "M",   29,       71,      176',
 '"Ruth",       "F",   28,       65,      131']

with blocks can be used with other handles, too, like database connections.

Converting to a 2D list

Let’s convert our list of strings to a list of lists, where each element of the first list is a row and each element of the second list is a cell containing data.

The shared index positions of the cells in the rows can be thought of as columns.

list_2d = []
with open(src_file_name, 'r') as infile:
    for line in infile.readlines():
        row = line.rstrip().split(",")   
        new_row = []
        for cell in row:
            new_row.append(cell.strip().replace('"', ''))
        list_2d.append(new_row)

list_2d[:5]

[['Name', 'Sex', 'Age', 'Height (in)', 'Weight (lbs)'],
 ['Alex', 'M', '41', '74', '170'],
 ['Bert', 'M', '42', '68', '166'],
 ['Carl', 'M', '32', '70', '155'],
 ['Dave', 'M', '39', '72', '167']]

Using a Nested Comprehension

We can replace the entire code block above nested list comprehensions.

Remember, you can put any expression into the first part of a comprehension, even another comprehension.

list_2d = [[cell.strip() 
            for cell in line.rstrip().replace('"', '').split(",")] 
            for line in open(src_file_name, 'r').readlines()]

list_2d

[['Name', 'Sex', 'Age', 'Height (in)', 'Weight (lbs)'],
 ['Alex', 'M', '41', '74', '170'],
 ['Bert', 'M', '42', '68', '166'],
 ['Carl', 'M', '32', '70', '155'],
 ['Dave', 'M', '39', '72', '167'],
 ['Elly', 'F', '30', '66', '124'],
 ['Fran', 'F', '33', '66', '115'],
 ['Gwen', 'F', '26', '64', '121'],
 ['Hank', 'M', '30', '71', '158'],
 ['Ivan', 'M', '53', '72', '175'],
 ['Jake', 'M', '32', '69', '143'],
 ['Kate', 'F', '47', '69', '139'],
 ['Luke', 'M', '34', '72', '163'],
 ['Myra', 'F', '23', '62', '98'],
 ['Neil', 'M', '36', '75', '160'],
 ['Omar', 'M', '38', '70', '145'],
 ['Page', 'F', '31', '67', '135'],
 ['Quin', 'M', '29', '71', '176'],
 ['Ruth', 'F', '28', '65', '131']]

Converting to Data Types

If we were to continue the process of importing this data set, we’d need to convert the data into their proper types.

We can see that there are three data types in the set: Strings, Categories, and Integers.

We’d have to write a another segment of code to identify data type by row position and use casting functions, such as int(), to do this. Also, since basic Python doesn’t have a data type for categories, we’d have to think about we handle M and F.

[['Name', 'Sex', 'Age', 'Height (in)', 'Weight (lbs)'],
 ['Alex', 'M', '41', '74', '170'],
 ['Bert', 'M', '42', '68', '166'],
 ['Carl', 'M', '32', '70', '155'],
 ['Dave', 'M', '39', '72', '167'],
 ['Elly', 'F', '30', '66', '124'],
 ['Fran', 'F', '33', '66', '115'],
 ['Gwen', 'F', '26', '64', '121'],
 ['Hank', 'M', '30', '71', '158'],
 ['Ivan', 'M', '53', '72', '175'],
 ['Jake', 'M', '32', '69', '143'],
 ['Kate', 'F', '47', '69', '139'],
 ['Luke', 'M', '34', '72', '163'],
 ['Myra', 'F', '23', '62', '98'],
 ['Neil', 'M', '36', '75', '160'],
 ['Omar', 'M', '38', '70', '145'],
 ['Page', 'F', '31', '67', '135'],
 ['Quin', 'M', '29', '71', '176'],
 ['Ruth', 'F', '28', '65', '131']]

At this point, it might be better to find some tools that do all this automatically.

Observations

It is pretty easy to import and process CSV files this way, but you will encounter some issues if you use this as your default pattern for importing data.

Here are just a few:

Not all sources are well-formed. They may have delimitters that are complex to parse, and the the data themselve may be hard to parse.
You have to keep the column names in a separate list or vector and then associate them with the data if and when necessary.
You have to convert each column vector into its appropriate data type yourself. Or, you have to create separate 2D arrays for each collection of columns with a common data type. This process also invovles human inspection of the file, as opposed to have a program try to figure it out for you.

For these reasons, other tools such as NumPy and Pandas were created to make the work of a data scientist a bit easier and more productive.

Plus, they are faster!