#!/usr/bin/env python # coding: utf-8 # ## Importing delimited text files # Importing delimited text files can be accomplished using a variety of modules in Python. This notebook will cover pure python, the `csv` module, the `NumPy` module and `Pandas`. We'll use a sample data file, located in the `data` folder, that contains a short 3x3 table with a header row: # ``` # id,x,y # 1,22,33 # 2,2.4,6.8 # 3,1.9,8.0 # ``` # --- # ## 1. Pure Python # # In pure Python we: # # 1. Open the datafile in read ('r') mode # 2. Create an empty list to store the data # 3. Loop over the rows in the file # 4. Getting rid of whitespace and newline (`\n`) characters (`.strip()`), and splitting the row into a list based on the delimiter, in this case a comma (`.split(',')`). # # The types of the resulting items in the list of list will always be strings, so if you need something else it will have to be converted later. # In[ ]: datafile = open('./data/examp_data.txt', 'r') data = [] for row in datafile: data.append(row.strip().split(',')) datafile.close() data # Here, our output is a *list of lists*, which is a nifty data structure that allows us to get specific elements from our data using indices: the first index will specify the row (add 1 for the header row) and the second index will specify the column. # In[ ]: #Get the value of the 3rd row (4th if we include the header row), and 2nd column data[3][1] # ► Now you try it: # * Import the data in the `example_data2.txt` file (located in the data folder). # * Print the value in the 4th row, 3rd column. # The ingested table should be: # ``` # [['A', 'B', 'C', 'D'], # ['1.72', '3.84', '4.59', '1.36'], # ['5.15', '6.43', '7.92', '6.26'], # ['1.56', '8.03', '4.36', '5.10'], # ['7.38', '1.20', '4.56', '3.49'], # ['4.24', '7.69', '6.49', '5.28'], # ['1.25', '9.64', '1.83', '6.84']] # ``` # And thus the value in the 4th row, 3rd column should be `4.56` # In[ ]: # Exercise 1: # --- # ## 2. CSV Module [link](https://pymotw.com/3/csv/) # The built-in `csv` module gives us a bit more command over delimited files. Here we see how it handles the parsing and striping off line feed characters. However, the `csv` module can handle a number of different formatting nuances. # # In the csv module we: # # 1. Open the datafile in read ('r') mode # 2. Create a *reader* object with that file, specifying the delimiter (the default is a comma, but it is explicitly specified here for clarity). # 3. Create an empty list to store the data. # 4. Loop over the rows in the reader appending each row to the list # # The types of the resulting items in the list of list will always be strings, so if you need something else it will have to be converted later. # In[ ]: import csv datafile = open('./data/examp_data.txt', 'r') datareader = csv.reader(datafile, delimiter=',') data = [] for row in datareader: data.append(row) data # In[ ]: #As above, print the value in the 3rd row, 2nd column data[3][1] # ► Now you try it: # * Use the `csv` module to import the data in the `example_data2.txt` file (located in the data folder). # * Print the value in the 4th row, 3rd column. # In[ ]: #Exercise 2: # --- # ## 3. Using [`NumPy`](http://www.numpy.org/) # # Here we introduce **NumPy**, a powerful Python package for working with numeric arrays. Importing and using Numpy will import the data into a Numpy array, a commonly used data structure for scientific programming. # # Using Numpy we simply use the `genfromtxt()` function to directly import the data. `genfromtxt()` has a lot of options for controlling what and how gets imported. See the docs page for details: http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html # # Numpy will **autodetect** the data type, so we'll often want to leave off the header row(s) (`skip_header=True`). We could keep them using the `names=True` argument, and also columns with different data types, but that creates a structured array and if we want to work with that type of data we're typically better off using pandas (see below). # In[ ]: import numpy data = numpy.genfromtxt('./data/examp_data.txt', delimiter = ',', skip_header=True) data # In[ ]: #As above, print the value in the 3rd row, 2nd column data[2,1] # ► Now you try it: # * Use `NumPy` to import the data in the `example_data2.txt` file (located in the data folder). # * Print the value in the 4th row, 3rd column. # In[ ]: #Exercise 3: # ## 4. Using [`Pandas`](https://pandas.pydata.org/) # # Pandas is a powerful data management library that produces data structures and associated tools that are ideal for scientific computing tasks related to data. In particular, it produces a dataframe object that is much like R's dataframe and is designed to hold data with the standard structure of one row per record and one column per type of data (or field). # # In Pandas we just use the `read_csv()` function to import text files. It has a lot of options, but will do most things automatically including detecting delimiters and detecting data types. # In[ ]: import pandas as pd data = pd.read_csv('./data/examp_data.txt') data # In Pandas, the `iloc[]` function (short for "index location") allows us to extract data at a certain row/column coordinate. # In[ ]: #As above, print the value in the 3rd row, 2nd column data.iloc[2,1] # Pandas can also retrieve values from indices we provide (as opposed to row/column coordinates). For this, we use the `loc[]` function. The example below pulls data from row with the index "2" and column "x": # In[ ]: data.loc[2,'x'] # Pandas dataframes do behave a bit differently than a lot of list based structures in Python, but we'll learn how to work with them soon. If you just want to pull the core data out of a dataframe you can do this using the values member (a member is just a variable associated with an object). # In[ ]: #Extract the data from a Pandas data frame into a numpy array data.values # In[ ]: #Extract a single value from a specified row/column data.values[2][1] # ► Now you try it: # * Use `Pandas` to import the data in the `example_data2.txt` file (located in the data folder). # * Print the value in the 4th row, 3rd column. # In[ ]: #Exercise 4: