#!/usr/bin/env python # coding: utf-8 # (data-import)= # # Data Import # # ## Introduction # # In this chapter, we're going to introduce working with your own data. Here, you'll learn how to read plain-text rectangular files into Python. We'll only scratch the surface of data import, but many of the principles will translate to other forms of data. We'll finish with a few pointers to opening other types of data. # # ### Prerequisites # # You will need to have the **pandas** package installed. You'll need to ensure you have **pandas** installed. To do this, and to import **pandas** into your session, run # In[ ]: import pandas as pd # If this command fails, you don't have **pandas** installed. Open up the terminal in Visual Studio Code (Terminal -> New Terminal) and type in `conda install pandas`. Note that once **pandas** is installed, the convention is to import it into your Python session under the name `pd` by putting `import pandas as pd` at the top of your script. # ## Getting Started # # There are a huge range of input and output formats available in **pandas**: Stata (.dta), Excel (.xls, .xlsx), csv, tsv, big data formats (HDF5, parquet), JSON, SAS, SPSS, SQL, and more; there's a [full list](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) of formats available in the documentation. # # ![From the pandas documentation](https://pandas.pydata.org/pandas-docs/stable/_images/02_io_readwrite.svg) # # While **pandas** has a huge number of ways to read data in and load it into your Python session, here we'll focus on the humble plain-text table file; for example csv (comma separated values) and tsv (tab separated values). # # ### Reading data from a file # # All of the power needed to open plain-text table files is contained in a single function, `pd.read_csv`. It takes numerous arguments but the two most important are the (unnamed) first one, which gives the path to the data, and `sep=` (a keyword argument) that tells **pandas** whether to expect values to be separated by commas or tabs or another character; however, if you leave this field blank, **pandas** will guess for you. To see the full set of arguments, run `help(pd.read_csv)`. # # Here is what a simple CSV file with a row for column names (also commonly referred to as the header row) and six rows of data looks like (using the terminal): # In[ ]: get_ipython().system(' cat data/students.csv') # Note that this is a CSV file, so the values are separated by commas. Now let's load this into a **pandas** dataframe in Python: # In[ ]: students = pd.read_csv("data/students.csv") students # If you want to download this data to try it for yourself, head to [this link](https://github.com/aeturrell/python4DS/blob/main/data/students.csv) and then right-click on 'Raw' and select "Save Link As...". Save the data in a directory called 'data' with the name 'students.csv'. This directory should sit within your active Visual Studio Code folder. You can check which folder you're currently in by running the following code: # # ```python # import os # # os.getcwd() # get current working directory (cwd) # ``` # # Say this comes back with 'python4DS', then your downloaded data should be in 'python4DS/data/students.csv'. # The first argument to `read_csv` was the path to the data, and **pandas** guessed that this file uses commas as the separator. # # The read CSV function automatically creates a new *index* (which is just the position of each row) and takes the top line of data as the *header* or *column names*. But you may wish to tweak this behaviour, in several ways. # # 1. Sometimes there are a few lines of metadata at the top of the file. You can use `skiprows=n` to skip the first `n` lines, eg `pd.read_csv("data/students.csv", skiprows=2)`. # # 2. The data might not have column names. You can use `names = ` a list to tell `read_csv()` to use a different option for the column names. For example, `pd.read_csv("data/students.csv", names=range(5))` would put the numbers 0 to 4 in as the column names. # # 3. You may wish to change which column is used as the index. The default behaviour is to create an index, but for this data we see that there already is an ID column we could use. To do this, use the `index_col=` argument, for example `pd.read_csv("data/students.csv", index_col=0)`. # # This is all you need to know to read \~75% of CSV files that you'll encounter in practice. Reading tab separated files and fixed width files is done with the same function. # ### First Steps # # Let's take another look at the `students` data. # # Once you read data in, the first step usually involves transforming it in some way to make it easier to work with in the rest of your analysis. For example, the column names in the `students` file we read in are formatted in non-standard ways. # # You might consider renaming them one by one with `.rename` or you might use a convenience function from another package to clean them and turn them all into snake case at once. We will make use of the **skimpy** package to do this. **skimpy** is a smaller package so isn't available to install via conda; instead, install it by running `pip install skimpy` in the terminal. # # From **skimpy**, we will use the `clean_columns` function; this takes in a data frame and returns a data frame with variable names converted to snake case. # In[ ]: from skimpy import clean_columns students = clean_columns(students) students # Another common task after reading in data is to consider variable types. In the `favourite_food` column, there are a bunch of food items and then the value `NaN`, which has been read in as a floating point number rather than a missing string. We can solve this by casting that column to explicitly be composed of strings: # In[ ]: students["favourite_food"] = students["favourite_food"].astype("string") students # Similarly, `"age"` has mixed data types: string and integer! Let's map the 'five' into the number five. # In[ ]: import numpy as np students["age"] = students["age"].replace("five", 5) students["age"] # In a moment, we will turn this into a column of integers too. # Another example where the data type is wrong is `meal_type`. This is a categorical variable with a known set of possible values. **pandas** has a special data type for these: # In[ ]: students["meal_plan"] = students["meal_plan"].astype("category") students["meal_plan"] # Note that the values in the `meal_type` variable has stayed exactly the same, but the type of variable has changed from the object to category. # # It is a bit tedious to have to go through columns one-by-one as single line assignments to apply type. An alternative is to pass a dictionary that maps column names into types, like follows: # In[ ]: students = students.astype({"student_id": "int", "full_name": "string", "age": "int"}) students.info() # ### Exercises # # 1. What function would you use to read a file where fields were separated with "\|"? # ## Reading data from multiple files # Sometimes your data is split across multiple files instead of being contained in a single file. For example, you might have sales data for multiple months, with each month's data in a separate file: `01-sales.csv` for January, `02-sales.csv` for February, and `03-sales.csv` for March. # # With `pd.read_csv()` you can read these data in one-by-one and then stack them on top of each other in a single data frame using the `pd.concat` function. This looks like: # # In[ ]: list_of_dataframes = [ pd.read_csv(x) for x in ["data/01-sales.csv", "data/02-sales.csv", "data/03-sales.csv"] ] sales_files = pd.concat(list_of_dataframes) sales_files # If you have many files you want to read in, it can get cumbersome to write out their names as a list. Instead, you can use the **glob** package (which is built in to Python) to find the files for you by matching a pattern in the file names. Note that there may be other CSV files in the directory data/, so here we specified `"*-sales.csv"` to ensure we only get those files that include the word sales. Here, `"*"` plays the role of a wildcard: it represents any series of characters. # In[ ]: import glob list_of_csvs = glob.glob("data/*-sales.csv") print("List of csvs is:") print(list_of_csvs, "\n") sales_files = pd.concat([pd.read_csv(x) for x in list_of_csvs]) sales_files # ## Writing to a file # # Just as the typical pattern for reading files is `pd.read_FILETYPE`, where filetype can be, for example, CSV, all of the ways of writing **pandas** data frames to disk have the pattern `DATAFRAME.to_FILETYPE`. So to write our sales data to a CSV file, the code will be `sales_files.to_csv(FILEPATH)`, where filepath is the location plus name of the file you want to write to. # # Let's see an example of writing data to file using our students data that we already did such good work on setting the data types for: # In[ ]: students.to_csv("data/students-clean.csv") # Now let's read it back in and check the info on data types: # In[ ]: pd.read_csv("data/students-clean.csv").info() # Notice anything? We lost a lot of the nice data type work we did! While **pandas** guessed that some columns are integers, we lost the string and categorical variables. The reason for this is that plain text files are not able to carry any contextual information (though **pandas** will guess some column data types). # # If you want to save data in a file and have it remember the data types, you'll need to use a different data format. For temporary storage, we recommend using the *feather* format as it is very fast and interoperable with other programming languages. Interoperability is a good reason to avoid language-specific file formats such as Stata's .dta, R's .rds, and Python's .pickle. # # Note that the feather format has an additional dependency in the form of a package called **pyarrow**. To install it, run `pip install pyarrow` in a terminal window. # # Here's an example of writing to a feather file: # In[ ]: students.to_feather("data/students-clean.feather") # Now let's re-open that feather file and take a look at the info attached to it. # In[ ]: pd.read_feather("data/students-clean.feather").info() # Saving to this format preserved our data type information. # # ### Reading and Writing Other Data Formats # # The image at the start of this chapter gives you a sense of what other formats are available but you can find a comprehensive list over at the [official **pandas** documentation on input and output](https://pandas.pydata.org/docs/user_guide/io.html). # In[ ]: import os os.remove("data/students-clean.csv") os.remove("data/students-clean.feather")