#!/usr/bin/env python # coding: utf-8 # # Handling data with python # **Author:** Ties de Kok ([Personal Website](https://www.tiesdekok.com)) # **Last updated:** June 2020 # **Conda Environment:** `LearnPythonForResearch` # **Python version:** Python 3.7 # **Pandas version:** Pandas 1.0.4 # **License:** MIT License # **Note:** Some features (like the ToC) will only work if you run it locally, use Binder, or use nbviewer by clicking this link: # https://nbviewer.jupyter.org/github/TiesdeKok/LearnPythonforResearch/blob/master/2_handling_data.ipynb # # *Introduction* # Getting your data ready for analysis (i.e. "data wrangling") is usually the most time-consuming part of a project. For data wrangling tasks I recommend `Pandas` and `Numpy`. # # What is `Pandas`? # # > Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. # # In other words, Pandas introduces a data structure (i.e. `dataframe`) that represents data as a table with columns and rows. Combining Python with `Pandas` yields a very powerful toolkit with which you can process any type of dataset. # ## Format of this notebook # The `Pandas` library is massive and it is continuously expanding in functionality. # It is, therefore, impossible to keep it both comprehensive and cover everything in just one notebook. # # The goal of this notebook is to cover the basic functionality that I expect you to encounter for an average project. # # I have based parts on this notebook on a PyCon 2015 tutorial/talk by Brandon Rhodes. If you want to know more I highly recommend watching his talk and checking the accompanying GitHub page # (**updated note:** some parts of the talk use outdated syntax, but it is still a great video) # # https://www.youtube.com/watch?v=5JnMutdy6Fw # https://github.com/brandon-rhodes/pycon-pandas-tutorial # # Another great resource is the `Python Data Science Handbook` from Jake VanderPlas, which you can read for free on his website: # https://jakevdp.github.io/PythonDataScienceHandbook/ # # *Table of Contents* # * [Import pandas](#import-pandas) # * [Create a dataframe](#create-dataframe) # * [Manipulate dataframe](#manipulate-dataframe) # * [Rename columns](#rename-columns) # * [View a dataframe using qgrid](#qgrid) # * [View (parts) of a dataframe using Pandas](#view-dataframe) # * [Dealing with datatypes](#datatypes) # * [Handling missing values](#missing-values) # * [Work with data in the dataframe](#work-with-data) # * [Combining dataframes](#combining-dataframes) # * [Group-by operations](#groupby) # * [Reshaping and Pivot Tables](#reshaping-pivot) # * [Dealing with dates](#dates) # * [Dealing with strings](#strings) # ## Import Pandas [(to top)](#toc) # In[1]: import os import pandas as pd import numpy as np # *Note:* it is usually a good idea to also import `numpy` when you use `pandas`, their functionality is quite intertwined. # Pandas tends to throw a frequent warning about chained assignment. I prefer to disable that warning. # You can read more about it here: [link](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy) # In[2]: pd.options.mode.chained_assignment = None # default='warn' # ### Parameters # Path to our data # In[3]: data_path = os.path.join(os.getcwd(), 'example_data') # ## Create a dataframe [(to top)](#toc) # We can create a dataframe in many ways. Below are a couple of situations: # ### 1) Load file from drive into Pandas # For details on opening files such as Excel, CSV, Stata, SAS, HDF see the `1_opening_files` notebook. # In[4]: df_auto = pd.read_csv(os.path.join(data_path, 'auto_df.csv'), sep=';', index_col=0) # ### 2) Create new dataframe and pass data to it # We can pass many different types of data to the `pd.DataFrame()` method. # In[5]: d = {'col1': [1,2,3,4], 'col2': [5,6,7,8]} df = pd.DataFrame(data=d) df # In[6]: d = [(1, 2 ,3 ,4), (5, 6, 7, 8)] df = pd.DataFrame(data=d, columns=['col1', 'col2', 'col3', 'col4']) df # ### 3) Create dataframe from a dictionary # We can also directly convert a dictionary to a dataframe: # In[7]: d = {'row1': [1,2,3,4], 'row2': [5,6,7,8]} df = pd.DataFrame.from_dict(d, orient='index', columns=['col1', 'col2', 'col3', 'col4']) df # ## Manipulate dataframe [(to top)](#toc) # ### Add column # In[8]: df['col5'] = [10, 10] df # ### Add row # In[9]: df.loc['row3'] = [11, 12, 13, 14, 15] df # ### Inverse the dataframe # In[10]: df.T # ### Remove column # In[11]: df = df.drop('col5', axis=1) df # ### Remove row # In[12]: df = df.drop('row1', axis=0) df # ### Set index # In[13]: df # In[14]: df.set_index('col1') # `Pandas` also allows a multi-index: # In[15]: df.set_index('col1', append=True) # ### Reset index # We can convert the index to a regular column using `reset_index()` # In[16]: df.reset_index() # ## Rename columns [(to top)](#toc) # We can either manipulate `df.columns` directly or use `df.rename()` # In[17]: df.columns = ['column1', 'column2', 'column3', 'column4'] df # In[18]: df.rename(columns={'column1' : 'col1', 'column2' : 'col2'}) # **Note:** The above creates a copy, it does not modify it in place! # We need to use either the `inplace=True` argument or assign it: # In[19]: df = df.rename(columns={'col1' : 'column1', 'col2' : 'column2'}) #or df.rename(columns={'col1' : 'column1', 'col2' : 'column2'}, inplace=True) # ## View (parts) of a dataframe using `Pandas` [(to top)](#toc) # Navigating your way around a dataframe is an important skill. The ability to sub-select parts of a dataframe is important for inspection purposes, analysis, exporting, and much more. # ### View entire dataframe # *Note:* Pandas will only show the top and bottom parts if the dataframe is large. # In[20]: df_auto # **Tip:** if you use JupyterLab you can create a new window with the notebook output like so: # # ![image.png](attachment:3aff5b53-5087-4c8e-b18b-e6583f5c52cc.png) # ### Get top or bottom of dataframe # In[21]: df_auto.head(3) # In[22]: df_auto.tail(3) # ### Get an X amount of random rows # In[23]: X = 5 df_auto.sample(X) # ### Select column(s) based on name # *Note:* the below returns a pandas `Series` object, this is different than a pandas `Dataframe` object! # You can tell by the way that it looks when shown. # In[24]: df_auto['make'].head(3) # If the column name has no whitespace you can also use a dot followed with the column name: # **Warning:** you can't change / overwrite a column using the dot syntax, only do that with the \[...\] syntax. # In[25]: df_auto.make.head(3) # **If you want multiple columns you need to use double brackets:** # *Note:* you are passing a standard Python list of column names to the dataframe. # In[26]: df_auto[['make', 'price', 'mpg']].head(10) # ### Select row based on index value # In[27]: df = df_auto[['make', 'price', 'mpg', 'trunk', 'headroom']].set_index('make') # In[28]: df.loc['Buick Riviera'] # *Note:* notice the appearance, this returned a pandas.Series object not a pandas.Dataframe object # ### Select row based on index position # In[29]: df.iloc[2:5] # **You can also include columns based on their column (!) index position:** # In[30]: df.iloc[0:3, 1:3] # *Note:* In the example above the first `0:3` selects the first 3 rows, the second `1:3` selects the 2nd and 3rd column. # ### Select based on condition # In many cases you want to filter rows based on a condition. You can do this in Pandas by putting the condition inside square brackets. # # It is worth explaining the intuition behind this method as a lot of people find it confusing: # # 1. You request Pandas to filter a dataframe by putting a condition between square brackets: `df[ `*condition*` ]` # 2. The `condition` is a sequence of `True` or `False` values for each row (so the length of the `condition` always has to match the number of rows in the dataframe!) # 3. In Pandas you can generate a `True` or `False` value for each row by simply writing a boolean expression on the whole column. # 4. Pandas will then only show those rows where the value is `True` # # In more practical terms: # # `df_auto['price'] < 3800` will evaluate each row of `df_auto['price']` and return, for that row, whether the condition is `True` or `False`: # # `` # 0 False # 1 False # 2 True # 3 False # 4 False # 5 False # `` # # By putting that condition in square brackets `df_auto[ df_auto['price'] < 3800 ]` pandas will first generate a sequence of `True` / `False` values and then only display the rows for which the value is `True`. # In[31]: df_auto[ df_auto['price'] < 3800 ] # We can also combine multiple conditions by wrapping each condition in parentheses and chaining them with: # # * For an **AND** statement use: `&` # * For an **OR** statement use: `|` # In[32]: df_auto[ (df_auto['price'] < 3800) & (df_auto['foreign'] == 'Foreign') ] # **Note:** this will return a new dataframe that isn't automatically assigned to a new variable. # If we want to keep it as a separate dataframe we have to assign it like so: # In[33]: df_auto_small = df_auto[(df_auto.price < 3800) & (df_auto.foreign == 'Foreign')] df_auto_small # ### Sort dataframe # In[34]: df_auto.sort_values(by=['headroom', 'trunk'], inplace=True) ## Note: `inplace=True` will overwrite the existing dataframe after sorting df_auto.head() # ## View a dataframe using `qgrid` [(to top)](#toc) # If you are used to RStudio or Stata you might be looking for an interactive dataframe viewer. Python / Jupyter does not come with such a feature by default. The primary workflow to inspect parts of the data would be to write quick snippets of code to pull up the data you'd like to see (such as illustrated above). # # This can be cumbersome in the beginning, a neat workaround is to use a package called `qgrid` to quickly inspect your data: # # This is the GitHub page for `qgrid`: https://github.com/quantopian/qgrid # # The first time you run it you do have to run the following command in the terminal / command line: # 1. `jupyter labextension install @jupyter-widgets/jupyterlab-manager qgrid2` # # *Note 1:* you can create a terminal window from within JupyterLab and run it from there (click the plus in the top left and click terminal) # *Note 2:* you might have to click on the puzzle icon in the left sidebar and click "enable" # # **First make sure you import the `show_grid` function** # In[35]: from qgrid import show_grid # **You can inspect a Dataframe as follows:** # In[36]: show_grid(df_auto, grid_options={'editable' : False}) # **Several things to note about qgrid:** # - I strongly recommend to never edit values directly inside qrid. It is best to include `grid_options={'editable' : False}` when using it to avoid this. # - Saving a notebook with active `qgrids` in them dramatically increases the file-size. Generally it is best to remove the qgrid as soon as your are done. # - Opening very big dataframes using `show_grid()` might cause stability issues. # - These `qgrids` will only display locally, not on GitHub. Therefore, if you see this on GitHub, you will not see the actual `qgrid`. # - There are a bunch of options you can use with `show_grid()`, however, I strongly discourage you to modify a dataframe using the qgrid toolbars. # ## Dealing with datatypes [(to top)](#toc) # It is important to pay attention to the datatypes contained in a column. A lot of errors that you will encounter relate to wrong datatypes (e.g. because of data errors). # **Remark:** you can enter any Python object into a dataframe, but that doesn't mean it is good idea. My recommendation is to mainly include `int`, `float`, `date`, or `str` data types in a dataframe. In other cases you are probably better of using native Python data structures and not a dataframe. # In[37]: pd.DataFrame([({'TestKey' : 'TestVal'}, lambda x: x + 2)], columns=['col1', 'col2']) # ### Show current datatypes: # In[38]: df_auto.dtypes # **Important note about strings** # Prior to Pandas 1.0 strings would be stored in a dataframe as an `object` datatype. However, the `object` dtype is not specific to just strings and is also assigned to any other dtype that isn't recognized by Pandas. # # This is why Pandas 1.0 introduced a new datatype for strings called `string`. ([Link with details](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html)) # # As af June 2020, the `pandas.StringDtype` is still considered experimental, so I will not use it explicitly in the demos below. # ### Convert datatypes # We can convert the datatype of a column in two ways: # # 1. Use the built-in Pandas functions to convert the column in one go # 2. Loop over the values and convert them individually # *1) Use Pandas functions* # If you want to convert a column to `string`, I recommend to use `.astype(str)`: # In[39]: df_auto['length'] = df_auto['length'].astype(str) df_auto['length'].dtypes # If you want to convert a column to `numeric`, I recommend to use `df.to_numeric()`: # In[40]: df_auto['length'] = pd.to_numeric(df_auto['length']) df_auto['length'].dtypes # *2) Convert values individually* # Using the built-in Pandas fuctions is usually sufficient, however, sometimes you want to have more control over the conversion. # # In[41]: df_auto['length'].apply(lambda x: str(x)).dtypes # Note: `'O'` stands for 'object' # In[42]: df_auto['length'].apply(lambda x: int(x)).dtypes # The section `dealing with dates` will discuss how to convert a column with `dates`. # *Honorable mention:* # # Pandas 1.0 introduced a new feature (`convert_dtypes()`) that can attempt to resolve dtype conflicts for you. See the example below: # In[43]: df_auto.convert_dtypes() # ## Handling missing values [(to top)](#toc) # Dealing with missing values is easy in Pandas, as long as you are careful in defining them as `np.nan` (and **not** a string value like 'np.nan') # http://pandas.pydata.org/pandas-docs/stable/missing_data.html # ### Add some missing values # *Note:* We define a missing value as `np.nan` so we can consistently select them! # In[44]: df_auto.loc['Example_Car'] = [np.nan for x in range(0,len(df_auto.columns))] df_auto.loc['Example_Bike'] = [np.nan for x in range(0,len(df_auto.columns))] # In[45]: df_auto.loc[['Example_Car', 'Example_Bike']] # ### Select missing or non-missing values # Always use `pd.isnull()` or `pd.notnull()` when conditioning on missing values. This is the most reliable. # `df_auto.make == np.nan` will **not** work consistently. # In[46]: df_auto[pd.isnull(df_auto.make)] # In[47]: df_auto[pd.notnull(df_auto.make)].head() # ### Fill missing values # To fill missing values with something we can use `.fillna()` # In[48]: df = df_auto.fillna('Missing') df.loc[['Example_Car', 'Example_Bike']] # ### Drop rows with missing values # To drop missing values we can use `.dropna()` # In[49]: df_auto['make'].tail(3) # In[50]: df_auto = df_auto.dropna(axis=0) df_auto['make'].tail(3) # *Note 1:* `axis=0` indicates that we want to drop rows and not columns. # *Note 2:* the `subset=[...]` parameter can be used in `dropna()` to only look for missing values in certain columns. # ## Work with data in the dataframe [(to top)](#toc) # ### Combine columns (and output it to a new column) # *Remember:* You can select a column using: # 1. `df_auto['price']` # 2. `df_auto.price` --> but this one only works if there are no spaces in the column name # In[51]: df_auto['price_trunk_ratio'] = df_auto.price / df_auto.trunk df_auto[['price', 'trunk', 'price_trunk_ratio']].head() # ### Generate a new column by iterating over the dataframe per row # There are multiple ways to iterate over rows. # They mainly different in their trade-off between ease-of-use, readability, and performance. # # I will show the three main possibilities. # # For the sake of demonstration, let's say our goal is to achieve the following: # > If the car is a foreign brand, multiple the price by 1.5 # **Option 1: use `.apply()` with `lambda`** # *Note:* `lambda` is a so-called anonymous function. # In[52]: logic = lambda x: x.price * 1.5 if x.foreign == 'Foreign' else x.price df_auto['new_price'] = df_auto.apply(logic, axis=1) df_auto[['make', 'price', 'foreign', 'new_price']].head() # *Note:* make sure to include the `axis = 1` argument, this tells Pandas to iterate over the rows and not the columns. # **Option 2: use `.apply()` with a function** # In the example above we used an anonymous `lambda` function. # For more complex processing it is possible to use a defined function and call it in `.apply()` # # **Personal note:** This is often my preferred method as it is the most flexible and a lot easier to read. # In[53]: def new_price_function(row): if row.foreign == 'Foreign': return row.price * 1.5 else: return row.price # In[54]: df_auto['new_price'] = df_auto.apply(new_price_function, axis=1) df_auto[['make', 'price', 'foreign', 'new_price']].head() # *Note:* make sure to include the `axis = 1` argument, this tells Pandas to iterate over the rows and not the columns. # **Option 3: loop over the dataframe row by row** # *Note:* Using apply is generally considered best practices over using `iterrows()`. I'd only use `iterrows()` if you have to. # In[55]: new_list = [] for index, row in df_auto.iterrows(): if row.foreign == 'Foreign': new_list.append(row.price * 1.5) else: new_list.append(row.price) df_auto['new_price'] = new_list df_auto[['make', 'price', 'foreign', 'new_price']].head() # ## Combining dataframes [(to top)](#toc) # You can combine dataframes in three ways: # # 1. Merge # 2. Join # 3. Append # I will demonstrate that using the following two datasets: # In[56]: df_auto_p1 = df_auto[['make', 'price', 'mpg']] df_auto_p2 = df_auto[['make', 'headroom', 'trunk']] # In[57]: df_auto_p1.head(3) # In[58]: df_auto_p2.head(3) # ### 1) Merge datasets # http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html # The `.merge()` function is one of my personal favorites, it is really easy to use. # In[59]: merged_auto = pd.merge(df_auto_p1, df_auto_p2, how='left', on='make') merged_auto.head(3) # ### 2) Join datasets on index # if both dataframes share the same index you can also join on the index. This can be faster compared to `.merge`. # In[60]: df_auto_p1.set_index('make', inplace=True) df_auto_p2.set_index('make', inplace=True) # In[61]: joined_auto = df_auto_p1.join(df_auto_p2) joined_auto.reset_index().head(3) # ### 3) Append data to the dataframe # See http://pandas.pydata.org/pandas-docs/stable/merging.html#concatenating-objects # # *Note:* There is also a shortcut function called `.append()` # In[62]: df_auto_i1 = df_auto.iloc[0:3] df_auto_i2 = df_auto.iloc[3:6] # In[63]: df_auto_i1 # In[64]: df_auto_i2 # Using the higher level function `concat()`: # In[65]: pd.concat([df_auto_i1, df_auto_i2]) # Using the shortcut fuction `append()`: # In[66]: df_auto_i1.append(df_auto_i2) # ## Group-by operations [(to top)](#toc) # Often you want to perform an operation within a group, in Pandas you achieve this by using `.groupby()`. # # Pandas `.groupby()` is a process involving one or more of the following steps (paraphrasing from the docs): # 1. **Splitting** the data into groups based on some criteria # 2. **Applying** a function to each group independently # 3. **Combining** the results into a data structure # # For the full documentation see: http://pandas.pydata.org/pandas-docs/stable/groupby.html # ### Split the dataframe by creating a group object: # Step 1 is to create a `group` object that specifies the groups that we want to create. # After creating a `group` object we can apply operations to it # In[67]: col_list = ['price', 'mpg', 'headroom', 'trunk', 'weight', 'length'] grouped = df_auto[col_list + ['foreign']].groupby(['foreign']) # *Note:* it is also possible to groupby on multiple column (e.g. `df_auto.groupby(['foreign', 'trunk'])` # ### Applying example 1) Compute mean summary statistic # In[68]: grouped.mean() # ### Applying example 2) Retrieve particular group: # In[69]: grouped.get_group('Domestic').head() # ### Applying example 3) Iterate over the groups in the `group` object # By iterating over each group you get a lot of flexibility as you can do anything you want with each group. # # It is worth noting that each group is a dataframe object. # In[70]: for name, group in grouped: print(name) print(group.head()) # #### It is also possible to use the `.apply()` function on `group` objects: # # Using a `lambda df: ....` with `.apply()` is a nice way to iterature over subsets of the data. # # For example, let's say we want to get the cheapest car within each "trunk" size category: # In[71]: df_auto.groupby('trunk').apply(lambda df: df.sort_values('price').iloc[0]).head()[['trunk', 'make', 'price']] # ### Create new dataframe with one row for each group # If you want to aggregate each group to one row in a new dataframe you have many options, below a couple of examples: # ### 1) `grouped.sum()` and `grouped.mean()` # # In[72]: grouped.sum() # In[73]: grouped.mean() # ### 2) `grouped.count()` and `grouped.size()` # In[74]: grouped.count() # In[75]: grouped.size() # ### 3) `grouped.first()` and `grouped.last()` # In[76]: grouped.first() # In[77]: grouped.last() # **4) You can also use the `.agg()` function to specify which operations to perform for each column** # In[78]: grouped.agg({'price' : 'first', 'mpg' : ['mean', 'median'], 'trunk' : ['mean', (lambda x: 100 * np.mean(x))]}, ) # ## Add a column to the original dataframe with a value from the groupby operation # Frequently it is useful to generate a new column in the existing dataframe based on a groupby operation. # This operation is generally called a `transform` operation. # # For example, let's say we want to group the data by trunk size and we want to create a new column which has the weight as a ratio of the group mean. # ### Method 1: using `transform` # In[79]: df_auto['weight_ratio'] = df_auto.groupby('trunk')['weight'].transform(lambda x: x / x.mean()) df_auto[['trunk', 'weight', 'weight_ratio']].head() # ### Method 2: manually using `.merge()` # The `transform` method can sometimes become hard to use, in those cases you might have to resort to just doing it manually using a `.merge` operation. # In[80]: mean_per_group = df_auto.groupby('trunk').apply(lambda df: df['weight'].mean()).reset_index() mean_per_group = mean_per_group.rename(columns={0 : 'weight_group_mean'}) print('Note: this new DF only has {} rows!'.format(len(mean_per_group))) mean_per_group.head(2) # In[81]: tmp_df = pd.merge(df_auto, mean_per_group, on='trunk', how='left')[['trunk', 'weight', 'weight_group_mean']] tmp_df['weight_ratio'] = tmp_df['weight'] / tmp_df['weight_group_mean'] tmp_df.head() # ## And a lot of other groupby operations! # There are many-many more things you can do with Pandas `.groupby`, too much to show here. # Feel free to check out the comprehensive documentation: # https://pandas.pydata.org/pandas-docs/stable/groupby.html # ## Reshaping and Pivot Tables [(to top)](#toc) # Pandas includes a variety of tools that allow you to reshape your DataFrame. # ### Create some sample data: # In[82]: tuples = [('bar', 'one', 1, 10), ('bar', 'two', 2, 11), ('bar', 'three', 3, 12), ('baz', 'one', 4, 13), ('baz', 'two', 5, 14), ('baz', 'three', 6, 15), ('foo', 'one', 7, 16), ('foo', 'two', 8, 17), ('foo', 'three', 9, 18) ] df = pd.DataFrame(tuples) df.columns = ['first', 'second', 'A', 'B'] # In[83]: df # ### Example 1) Create a pivot table (i.e. long to wide) # Using the `pivot()` function: # http://pandas.pydata.org/pandas-docs/stable/reshaping.html#reshaping-by-pivoting-dataframe-objects # In[84]: df.pivot(index='first', columns='second', values='A') # Using the `pd.pivot_table()` function: # http://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html # In[85]: pd.pivot_table(df, values=['A', 'B'], index='first', columns='second') # In[ ]: # *Note 1:* the above illustrates that Pandas essentially has two indexes: the usual 'row index' but also a 'column index' # # Pandas also has an "unpivot" function called `pandas.melt` (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html) # In[86]: wide_df = pd.DataFrame([(1, 2, 3), (4, 5, 6), (7, 8 ,9)], columns = ['A', 'B', 'C']) wide_df # In[87]: pd.melt(wide_df, id_vars = ['A'], value_vars=['B', 'C']) # ### Example 2: Stack and Unstack # `Stack` and `Unstack` are higher level operators to reshape a dataframe based on a multi-level index. # From the documentation: # >stack: “pivot” a level of the (possibly hierarchical) column labels, returning a DataFrame with an index with a new inner-most level of row labels. # unstack: inverse operation from stack: “pivot” a level of the (possibly hierarchical) row index to the column axis, producing a reshaped DataFrame with a new inner-most level of column labels. # # http://pandas.pydata.org/pandas-docs/stable/reshaping.html#reshaping-by-stacking-and-unstacking # # In other words: # **Stack** --> move the data "down" # **Unstack** --> move the data "up" # # These operations are primarily useful when moving complex data from wide to long or long to wide format. # ### Stack # In[88]: pivot_df = pd.pivot_table(df, values=['A', 'B'], index='first', columns='second') pivot_df # In[89]: pivot_df.stack(level=['second']) # *Note* We could also just use `pivot_df.stack()` as it will by default choose the 'last' level of the index. # ### Unstack # In[90]: df.set_index(['first', 'second'], inplace=True) # In[91]: df # In[92]: df.unstack(level=['first']) # In[93]: df.unstack(level=['second']) # ## Dealing with dates [(to top)](#toc) # Pandas has a lot of functionality to deal with timeseries data # http://pandas.pydata.org/pandas-docs/stable/timeseries.html # # A nice overview from the documentation: # | Class | My interpretation | Example | How to create: | # |-----------------|---------------------------|------|----------------------------------------------| # | `Timestamp` | Specifiek date and time | `2017-01-01 03:00:00` | `to_datetime`, `date_range` | # | `Timedelta` | An absolute time duration | `5 hours` | `to_timedelta`, `timedelta_range` | # | `Period` | Regular intervals of time | `daily` |`Period`, `period_range` | # | `DateOffset` | Similar to `timedelta` but follows calendar rules | `same time next day` | `DateOffset` | # ### Make an example DF # In[94]: df_ad = df_auto.copy()[['make', 'price']].head(5) # ### 1) Create a `date` column with the right dtype # Let's say we load in the data and all the dates are imported as strings: # In[95]: df_ad['date'] = ['01-01-2018', '30-05-2020', '10-10-2019', '05-04-2013', '05-07-2015'] # In[96]: df_ad['date'].dtypes # Let's convert the dates to a date dtype: # In[97]: df_ad['date'] = pd.to_datetime(df_ad['date'], dayfirst=True) df_ad.dtypes # In[98]: df_ad['date'] # ### 2) Select observations that fall in a certain range # We can use `pd.to_datetime` to create a datetime object that we can use in a condition: # In[99]: pd.Timestamp('2011-02-01') # *Note:* it is usually a good idea to explicitly include the format, to avoid unexpected behavior # In[100]: pd.to_datetime('01-02-2011', format='%d-%m-%Y') # Select all observations with a date later than Jan 1st 2016: # In[101]: df_ad[df_ad.date > pd.to_datetime('01-01-2016', format='%d-%m-%Y')] # We can also use `isin()` with a `date_range` object to get all observations in a range: # In[102]: start_date = pd.to_datetime('01-01-2015', format='%d-%m-%Y') end_date = pd.to_datetime('31-12-2018', format='%d-%m-%Y') df_ad[df_ad.date.isin(pd.date_range(start_date, end_date, freq='D'))] # ### Date ranges: # You can use `date_range` to create an array of datetime objects: # In[103]: pd.date_range('1/1/2011', periods=len(df_auto.index), freq='D') # In[104]: start_date = pd.to_datetime('01-01-2015', format='%d-%m-%Y') end_date = pd.to_datetime('31-12-2018', format='%d-%m-%Y') pd.date_range(start_date, end_date, freq='D') # ### 3) Select components of the dates # You can access special date operations by using the `.dt.` accessor. # See: https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#basics-dt-accessors # In[105]: df_ad['day'] = df_ad['date'].dt.day df_ad.head() # In[106]: df_ad['year'] = df_ad['date'].dt.year df_ad.head() # ### 4) Manipulate (off-set) the date # See: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects # # Add 1 year to each date: # In[107]: df_ad['new_date'] = df_ad.date.apply(lambda x: x + pd.DateOffset(years=1)) df_ad.head() # Add 2 business days to each date: # In[108]: df_ad['new_date'] = df_ad.date.apply(lambda x: x + 2* pd.offsets.BDay()) df_ad.head() # ## Dealing with strings [(to top)](#toc) # Pandas has a variety of built-in operations that you can use to modify string values: # https://pandas.pydata.org/pandas-docs/stable/reference/series.html#api-series-str # # You can access the functions by using the `.str.` accessor # **Example:** Get the length (i.e. number of characters) for each make: # In[109]: df_auto['make_len'] = df_auto['make'].str.len() df_auto[['make', 'make_len']] # **Example:** Convert the make to all uppercase # In[110]: df_auto['make_upper'] = df_auto['make'].str.upper() df_auto[['make', 'make_upper']] # **Example:** Replace all whitespace characters with `-` # In[111]: df_auto['make_no_ws'] = df_auto['make'].str.replace(' ', '-') df_auto[['make', 'make_no_ws']]