= "./sample_data_files/biostats.csv" src_file_name
NB: Basic File I/O
Objectives
- Demonstrate use of Python’s
open()
function - Show pattern using loops, comprehensions, and string operations to import a CSV
- Show how to parse an imported CSV into a 2D list
- Show how to convert a 2D list into a 2D Numpy array
- Describe the difficulties associated with this importing CSV files using basic Python
Open Files with open()
Let’s open a sample CSV file, biostats.csv
.
- This has some biometric statistics for a group of office workers.
- There are 18 records, recording Name, Sex, Age, Height, Weight
- There is an initial header line.
- This file was downloaded from https://people.sc.fsu.edu/~jburkardt/data/csv/csv.html and modified slightly.
We call the open()
function and pass it two parameters: * The name of the file we want to open. * The mode in which the file is opened. It defaults to r
which means open for reading in text mode. Other common values are: * w
for writing (truncating the file if it already exists) * x
for creating and writing to a new file * a
for appending
The returns a file object whose type depends on the mode and through which the standard file operations such as reading and writing are performed. So, to read from the file, you need to have specified type r
and to write you need to have specified w
.
The file object is an iterator.
For more info, check out the Python docs or run open?
from a code cell.
Note, we sometimes call the file object a file “handle.”
## open?
= open(src_file_name, 'r') file_handle
.read()
reads in the file as one long string.
= file_handle.read() file_as_big_string
1000] file_as_big_string[:
Since the file object is an iterator, we can’t get the string again from the object.
= file_handle.read() # Try reading from the handle again
file_as_big_string 1000] # Nothing there since the iterator is exhausted file_as_big_string[:
So, let’s create a new handle, read in the contents again, and then parse our string by newlines using .split("\n")
.
= open("./sample_data_files/biostats.csv", 'r')
file_handle = file_handle.read()
file_as_big_string "\n") file_as_big_string.split(
A short-cut to this process is to call the .readlines()
method, which returns a pre-made list of lines.
Note that the newlines are preserved in this case.
= open(src_file_name, 'r')
file_handle = file_handle.readlines() file_as_list_of_strings
file_as_list_of_strings
File objects should be closed when you are done with them.
file_handle.close()
Use a with
block
… to automatically open and close the file i/o object
There is a better way to handle objects that need to be closed.
Other examples of such objects are database handles.
with
will automatically open and close the file handle.
with open(src_file_name, 'r') as infile:
= infile.readlines() file_as_list
file_as_list
Convert into a 2D list
Let’s covert our list of strings to a list of lists, the former being the rows of data table and the latter the cells.
## %%time
= []
list_2d with open(src_file_name, 'r') as infile:
for line in infile.readlines():
= line.rstrip().split(",") # Note the use of rstrip()
row list_2d.append(row)
list_2d
Note that we now have do something with the column names and handle formating and casting each cell.
Using a list comprehension
We can replace the entire code block above nested list comprehensions.
Remember, you can put any expression into the first part of a comprehension, even another comprehension.
= [[cell.strip() for cell in line.rstrip().replace('"', '').split(",")]
list_2d for line in open(src_file_name, 'r').readlines()]
list_2d
Converting to Numpy
import numpy as np
Numpy arrays must be of the same data types, and it also has no concept of column names, so we remove this row from our data.
= list_2d[0] col_names
col_names
= np.array(list_2d[1:]) np_matrix
np_matrix
Here we demonstrate slicing along both dimensions.
Array Slices
2, :2] np_matrix[:
Converting Data Types
Let’s try to convert the data types of the numeric columns from strings to integers. One thing we might do is the following:
2:5].astype(int) np_matrix[:,
We see that the strings are converted to integers.
So, let’s try to save the conversion results to the original array:
2:5] = np_matrix[:, 2:5].astype(int) np_matrix[:,
np_matrix
What happened?
Some Difficulties
It is pretty easy to import CSV files this way, but there are many difficulties you are likely to encounter if you use this as your default pattern for importing data. Here are just a few: - Not all sources are well-formed. They may have delimitters that are complex to parse, and the the data themselve may be hard to parse. - You have to keep the column names in a separate list or vector and then associate them with the data if and when necessary. - You have to convert each column vector into its appropriate data type yourself. Or, you have to create separate 2D arrays for each collection of columns with a common data type. This process also invovles human inspection of the file, as opposed to have a program try to figure it out for you.
For these reasons, other tools such as Pandas were created to make the work of a data scientist a bit easier and more productive.