Importing delimited text files can be accomplished using a variety of modules in Python. This notebook will cover pure python, the csv
module, the NumPy
module and Pandas
. We'll use a sample data file, located in the data
folder, that contains a short 3x3 table with a header row:
id,x,y
1,22,33
2,2.4,6.8
3,1.9,8.0
In pure Python we:
\n
) characters (.strip()
), and splitting the row into a list based on the delimiter, in this case a comma (.split(',')
).The types of the resulting items in the list of list will always be strings, so if you need something else it will have to be converted later.
datafile = open('./data/examp_data.txt', 'r')
data = []
for row in datafile:
data.append(row.strip().split(','))
datafile.close()
data
Here, our output is a list of lists, which is a nifty data structure that allows us to get specific elements from our data using indices: the first index will specify the row (add 1 for the header row) and the second index will specify the column.
#Get the value of the 3rd row (4th if we include the header row), and 2nd column
data[3][1]
► Now you try it:
example_data2.txt
file (located in the data folder).The ingested table should be:
[['A', 'B', 'C', 'D'],
['1.72', '3.84', '4.59', '1.36'],
['5.15', '6.43', '7.92', '6.26'],
['1.56', '8.03', '4.36', '5.10'],
['7.38', '1.20', '4.56', '3.49'],
['4.24', '7.69', '6.49', '5.28'],
['1.25', '9.64', '1.83', '6.84']]
And thus the value in the 4th row, 3rd column should be 4.56
# Exercise 1:
The built-in csv
module gives us a bit more command over delimited files. Here we see how it handles the parsing and striping off line feed characters. However, the csv
module can handle a number of different formatting nuances.
In the csv module we:
The types of the resulting items in the list of list will always be strings, so if you need something else it will have to be converted later.
import csv
datafile = open('./data/examp_data.txt', 'r')
datareader = csv.reader(datafile, delimiter=',')
data = []
for row in datareader:
data.append(row)
data
#As above, print the value in the 3rd row, 2nd column
data[3][1]
► Now you try it:
csv
module to import the data in the example_data2.txt
file (located in the data folder).#Exercise 2:
NumPy
¶Here we introduce NumPy, a powerful Python package for working with numeric arrays. Importing and using Numpy will import the data into a Numpy array, a commonly used data structure for scientific programming.
Using Numpy we simply use the genfromtxt()
function to directly import the data. genfromtxt()
has a lot of options for controlling what and how gets imported. See the docs page for details: http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html
Numpy will autodetect the data type, so we'll often want to leave off the header row(s) (skip_header=True
). We could keep them using the names=True
argument, and also columns with different data types, but that creates a structured array and if we want to work with that type of data we're typically better off using pandas (see below).
import numpy
data = numpy.genfromtxt('./data/examp_data.txt',
delimiter = ',',
skip_header=True)
data
#As above, print the value in the 3rd row, 2nd column
data[2,1]
► Now you try it:
NumPy
to import the data in the example_data2.txt
file (located in the data folder).#Exercise 3:
Pandas
¶Pandas is a powerful data management library that produces data structures and associated tools that are ideal for scientific computing tasks related to data. In particular, it produces a dataframe object that is much like R's dataframe and is designed to hold data with the standard structure of one row per record and one column per type of data (or field).
In Pandas we just use the read_csv()
function to import text files. It has a lot of options, but will do most things automatically including detecting delimiters and detecting data types.
import pandas as pd
data = pd.read_csv('./data/examp_data.txt')
data
In Pandas, the iloc[]
function (short for "index location") allows us to extract data at a certain row/column coordinate.
#As above, print the value in the 3rd row, 2nd column
data.iloc[2,1]
Pandas can also retrieve values from indices we provide (as opposed to row/column coordinates). For this, we use the loc[]
function. The example below pulls data from row with the index "2" and column "x":
data.loc[2,'x']
Pandas dataframes do behave a bit differently than a lot of list based structures in Python, but we'll learn how to work with them soon. If you just want to pull the core data out of a dataframe you can do this using the values member (a member is just a variable associated with an object).
#Extract the data from a Pandas data frame into a numpy array
data.values
#Extract a single value from a specified row/column
data.values[2][1]
► Now you try it:
Pandas
to import the data in the example_data2.txt
file (located in the data folder).#Exercise 4: