NB: Basic File I/O

Objectives

Open Files with open()

Let’s open a sample CSV file, biostats.csv.

  • This has some biometric statistics for a group of office workers.
  • There are 18 records, recording Name, Sex, Age, Height, Weight
  • There is an initial header line.
  • This file was downloaded from https://people.sc.fsu.edu/~jburkardt/data/csv/csv.html and modified slightly.
src_file_name = "./sample_data_files/biostats.csv"

We call the open() function and pass it two parameters: * The name of the file we want to open. * The mode in which the file is opened. It defaults to r which means open for reading in text mode. Other common values are: * w for writing (truncating the file if it already exists) * x for creating and writing to a new file * a for appending

The returns a file object whose type depends on the mode and through which the standard file operations such as reading and writing are performed. So, to read from the file, you need to have specified type r and to write you need to have specified w.

The file object is an iterator.

For more info, check out the Python docs or run open? from a code cell.

Note, we sometimes call the file object a file “handle.”

## open?
file_handle = open(src_file_name, 'r')

.read() reads in the file as one long string.

file_as_big_string = file_handle.read()
file_as_big_string[:1000]

Since the file object is an iterator, we can’t get the string again from the object.

file_as_big_string = file_handle.read() # Try reading from the handle again
file_as_big_string[:1000] # Nothing there since the iterator is exhausted

So, let’s create a new handle, read in the contents again, and then parse our string by newlines using .split("\n").

file_handle = open("./sample_data_files/biostats.csv", 'r')
file_as_big_string = file_handle.read()
file_as_big_string.split("\n")

A short-cut to this process is to call the .readlines() method, which returns a pre-made list of lines.

Note that the newlines are preserved in this case.

file_handle = open(src_file_name, 'r')
file_as_list_of_strings = file_handle.readlines()
file_as_list_of_strings

File objects should be closed when you are done with them.

file_handle.close()

Use a with block

… to automatically open and close the file i/o object

There is a better way to handle objects that need to be closed.

Other examples of such objects are database handles.

with will automatically open and close the file handle.

with open(src_file_name, 'r') as infile:
    file_as_list = infile.readlines()
file_as_list

Convert into a 2D list

Let’s covert our list of strings to a list of lists, the former being the rows of data table and the latter the cells.

## %%time
list_2d = []
with open(src_file_name, 'r') as infile:
    for line in infile.readlines():
        row = line.rstrip().split(",") # Note the use of rstrip()
        list_2d.append(row)
list_2d

Note that we now have do something with the column names and handle formating and casting each cell.

Using a list comprehension

We can replace the entire code block above nested list comprehensions.

Remember, you can put any expression into the first part of a comprehension, even another comprehension.

list_2d = [[cell.strip() for cell in line.rstrip().replace('"', '').split(",")] 
        for line in open(src_file_name, 'r').readlines()]
list_2d

Converting to Numpy

import numpy as np

Numpy arrays must be of the same data types, and it also has no concept of column names, so we remove this row from our data.

col_names = list_2d[0]
col_names
np_matrix = np.array(list_2d[1:])
np_matrix

Here we demonstrate slicing along both dimensions.

Array Slices

np_matrix[:2, :2]

Converting Data Types

Let’s try to convert the data types of the numeric columns from strings to integers. One thing we might do is the following:

np_matrix[:, 2:5].astype(int)

We see that the strings are converted to integers.

So, let’s try to save the conversion results to the original array:

np_matrix[:, 2:5] = np_matrix[:, 2:5].astype(int)
np_matrix

What happened?

Some Difficulties

It is pretty easy to import CSV files this way, but there are many difficulties you are likely to encounter if you use this as your default pattern for importing data. Here are just a few: - Not all sources are well-formed. They may have delimitters that are complex to parse, and the the data themselve may be hard to parse. - You have to keep the column names in a separate list or vector and then associate them with the data if and when necessary. - You have to convert each column vector into its appropriate data type yourself. Or, you have to create separate 2D arrays for each collection of columns with a common data type. This process also invovles human inspection of the file, as opposed to have a program try to figure it out for you.

For these reasons, other tools such as Pandas were created to make the work of a data scientist a bit easier and more productive.