Notebook

Importing delimited text files¶

Importing delimited text files can be accomplished using a variety of modules in Python. This notebook will cover pure python, the csv module, the NumPy module and Pandas. We'll use a sample data file, located in the data folder, that contains a short 3x3 table with a header row:

id,x,y
1,22,33
2,2.4,6.8
3,1.9,8.0

1. Pure Python¶

In pure Python we:

Open the datafile in read ('r') mode
Create an empty list to store the data
Loop over the rows in the file
Getting rid of whitespace and newline (\n) characters (.strip()), and splitting the row into a list based on the delimiter, in this case a comma (.split(',')).

The types of the resulting items in the list of list will always be strings, so if you need something else it will have to be converted later.

In [ ]:

datafile = open('./data/examp_data.txt', 'r')
data = []
for row in datafile:
    data.append(row.strip().split(','))
datafile.close()
data

Here, our output is a list of lists, which is a nifty data structure that allows us to get specific elements from our data using indices: the first index will specify the row (add 1 for the header row) and the second index will specify the column.

In [ ]:

#Get the value of the 3rd row (4th if we include the header row), and 2nd column
data[3][1]

► Now you try it:

Import the data in the example_data2.txt file (located in the data folder).
Print the value in the 4th row, 3rd column.

The ingested table should be:

[['A', 'B', 'C', 'D'],
 ['1.72', '3.84', '4.59', '1.36'],
 ['5.15', '6.43', '7.92', '6.26'],
 ['1.56', '8.03', '4.36', '5.10'],
 ['7.38', '1.20', '4.56', '3.49'],
 ['4.24', '7.69', '6.49', '5.28'],
 ['1.25', '9.64', '1.83', '6.84']]

And thus the value in the 4th row, 3rd column should be 4.56

In [ ]:

# Exercise 1:

2. CSV Module link ¶

The built-in csv module gives us a bit more command over delimited files. Here we see how it handles the parsing and striping off line feed characters. However, the csv module can handle a number of different formatting nuances.

In the csv module we:

Open the datafile in read ('r') mode
Create a reader object with that file, specifying the delimiter (the default is a comma, but it is explicitly specified here for clarity).
Create an empty list to store the data.
Loop over the rows in the reader appending each row to the list

The types of the resulting items in the list of list will always be strings, so if you need something else it will have to be converted later.

In [ ]:

import csv
datafile = open('./data/examp_data.txt', 'r')
datareader = csv.reader(datafile, delimiter=',')
data = []
for row in datareader:
    data.append(row)
data

In [ ]:

#As above, print the value in the 3rd row, 2nd column
data[3][1]

► Now you try it:

Use the csv module to import the data in the example_data2.txt file (located in the data folder).
Print the value in the 4th row, 3rd column.

In [ ]:

#Exercise 2:

3. Using `NumPy`¶

Here we introduce NumPy, a powerful Python package for working with numeric arrays. Importing and using Numpy will import the data into a Numpy array, a commonly used data structure for scientific programming.

Using Numpy we simply use the genfromtxt() function to directly import the data. genfromtxt() has a lot of options for controlling what and how gets imported. See the docs page for details: http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html

Numpy will autodetect the data type, so we'll often want to leave off the header row(s) (skip_header=True). We could keep them using the names=True argument, and also columns with different data types, but that creates a structured array and if we want to work with that type of data we're typically better off using pandas (see below).

In [ ]:

import numpy
data = numpy.genfromtxt('./data/examp_data.txt', 
                        delimiter = ',',
                        skip_header=True)
data

In [ ]:

#As above, print the value in the 3rd row, 2nd column
data[2,1]

► Now you try it:

Use NumPy to import the data in the example_data2.txt file (located in the data folder).
Print the value in the 4th row, 3rd column.

In [ ]:

#Exercise 3:

4. Using `Pandas`¶

Pandas is a powerful data management library that produces data structures and associated tools that are ideal for scientific computing tasks related to data. In particular, it produces a dataframe object that is much like R's dataframe and is designed to hold data with the standard structure of one row per record and one column per type of data (or field).

In Pandas we just use the read_csv() function to import text files. It has a lot of options, but will do most things automatically including detecting delimiters and detecting data types.

In [ ]:

import pandas as pd
data = pd.read_csv('./data/examp_data.txt')
data

In Pandas, the iloc[] function (short for "index location") allows us to extract data at a certain row/column coordinate.

In [ ]:

#As above, print the value in the 3rd row, 2nd column
data.iloc[2,1]

Pandas can also retrieve values from indices we provide (as opposed to row/column coordinates). For this, we use the loc[] function. The example below pulls data from row with the index "2" and column "x":

In [ ]:

data.loc[2,'x']

Pandas dataframes do behave a bit differently than a lot of list based structures in Python, but we'll learn how to work with them soon. If you just want to pull the core data out of a dataframe you can do this using the values member (a member is just a variable associated with an object).

In [ ]:

#Extract the data from a Pandas data frame into a numpy array
data.values

In [ ]:

#Extract a single value from a specified row/column
data.values[2][1]

► Now you try it:

Use Pandas to import the data in the example_data2.txt file (located in the data folder).
Print the value in the 4th row, 3rd column.

In [ ]:

#Exercise 4:

Importing delimited text files¶

1. Pure Python¶

2. CSV Module link¶

3. Using NumPy¶

4. Using Pandas¶

2. CSV Module link ¶

3. Using `NumPy`¶

4. Using `Pandas`¶