2019¶

1.A recap from last time¶

Recall we previously had talked about pandas series. Dataframes include columns that are like individual pandas series. They resemble tabular data in Microsoft Excel. Keep in mind it has a row and column index. Let's construct a Dataframe object. A common way with using a dict of equal length lists or numpy arrays. Each column can then be accessed as a series object.

In [2]:

import pandas as pd

states = ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada']
year = [2000, 2001, 2002, 2001, 2002]
pop = [1.5, 1.7, 3.6, 2.4, 2.9]
data = {'States': states, 'Year': year, 'Population': pop}
dataframe = pd.DataFrame(data)
dataframe #May have to put print(dataframe) in your text editor

Out[2]:

	Population	States	Year
0	1.5	Ohio	2000
1	1.7	Ohio	2001
2	3.6	Ohio	2002
3	2.4	Nevada	2001
4	2.9	Nevada	2002

In [5]:

#Specify columns, index
dataframe2 = pd.DataFrame(data, columns = ['States', 'Population', 'Year'],
                          index = ['one', 'two', 'three', 'four', 'five'])

dataframe2

Out[5]:

	States	Population	Year
one	Ohio	1.5	2000
two	Ohio	1.7	2001
three	Ohio	3.6	2002
four	Nevada	2.4	2001
five	Nevada	2.9	2002

Selecting columns will return a series object. Use the syntax below:

In [9]:

dataframe2['Population']
#dataframe2[['Population']] <- this turns into a dataframe object

Out[9]:

one      1.5
two      1.7
three    3.6
four     2.4
five     2.9
Name: Population, dtype: float64

With this in mind, we can create new columns in a similar matter:

In [7]:

dataframe2['West'] = ['False', 'False', 'False', 'True', 'True']
dataframe2

Out[7]:

	States	Population	Year	West
one	Ohio	1.5	2000	False
two	Ohio	1.7	2001	False
three	Ohio	3.6	2002	False
four	Nevada	2.4	2001	True
five	Nevada	2.9	2002	True

2.Indexing¶

Sometimes, we may choose to retrieve one or more columns and possibly assign it to some variable. Most of the time, we can select pieces of the dataframe and analyze them during exploratory data analysis, the phase before machine learning.

In [54]:

salaries = pd.read_csv("Position_Salaries.csv")

In [11]:

salaries.columns # < - Printing columns

Out[11]:

Index(['Position', 'Level', 'Salary'], dtype='object')

Retrieve the level and salary columns

In [12]:

salaries[['Level', 'Salary']]

Out[12]:

	Level	Salary
0	1	45000
1	2	50000
2	3	60000
3	4	80000
4	5	110000
5	6	150000
6	7	200000
7	8	300000
8	9	500000
9	10	1000000

Print first five rows

In [13]:

salaries[:5]

Out[13]:

	Position	Level	Salary
0	Business Analyst	1	45000
1	Junior Consultant	2	50000
2	Senior Consultant	3	60000
3	Manager	4	80000
4	Country Manager	5	110000

2a.Boolean Indexing¶

In [18]:

salaries[salaries['Salary'] > 100000]

Out[18]:

	Position	Level	Salary
4	Country Manager	5	110000
5	Region Manager	6	150000
6	Partner	7	200000
7	Senior Partner	8	300000
8	C-level	9	500000
9	CEO	10	1000000

In [21]:

salaries['Salary'][salaries['Salary'] > 100000]

Out[21]:

4     110000
5     150000
6     200000
7     300000
8     500000
9    1000000
Name: Salary, dtype: int64

2b. iloc/loc¶

In [25]:

salaries.iloc[0] #prints first row, returns series object

Out[25]:

Position    Business Analyst
Level                      1
Salary                 45000
Name: 0, dtype: object

In [28]:

(salaries.iloc[:, [0,1]]) #prints all rows, first two columns; We choose column 1 and column 2 (indexed at 0 & 1)

Out[28]:

	Position	Level
0	Business Analyst	1
1	Junior Consultant	2
2	Senior Consultant	3
3	Manager	4
4	Country Manager	5
5	Region Manager	6
6	Partner	7
7	Senior Partner	8
8	C-level	9
9	CEO	10

In [30]:

salaries.iloc[[4,5,6], :] #prints 4th, 5th, 6th rows, all columns 

Out[30]:

	Position	Level	Salary
4	Country Manager	5	110000
5	Region Manager	6	150000
6	Partner	7	200000

In [56]:

#Slightly bit more complicated:
import numpy as np
salaries.iloc[np.r_[1:3, 6:8, 8], np.r_[0:1, 2:3]] #last index is exclusive unless specifically stated

Out[56]:

	Position	Salary
1	Junior Consultant	50000
2	Senior Consultant	60000
6	Partner	200000
7	Senior Partner	300000
8	C-level	500000

In [55]:

salaries.iloc[4, 1] = 88888 # Change individual value
salaries

Out[55]:

	Position	Level	Salary
0	Business Analyst	1	45000
1	Junior Consultant	2	50000
2	Senior Consultant	3	60000
3	Manager	4	80000
4	Country Manager	88888	110000
5	Region Manager	6	150000
6	Partner	7	200000
7	Senior Partner	8	300000
8	C-level	9	500000
9	CEO	10	1000000

LOC is fairly similar; instead of numbers, use the row/column names. for this specific example, let's change the index of the dataframe to the positions

In [57]:

salaries.set_index(['Position'], inplace = True)
salaries

Out[57]:

	Level	Salary
Position
Business Analyst	1	45000
Junior Consultant	2	50000
Senior Consultant	3	60000
Manager	4	80000
Country Manager	88888	110000
Region Manager	6	150000
Partner	7	200000
Senior Partner	8	300000
C-level	9	500000
CEO	10	1000000

In [59]:

salaries.loc[['Business Analyst', 'Junior Consultant'], :] #prints first two rows, all columns

Out[59]:

	Level	Salary
Position
Business Analyst	1	45000
Junior Consultant	2	50000

In [60]:

salaries.loc[:, ['Level', 'Salary']] #prints all rows, second/third column

Out[60]:

	Level	Salary
Position
Business Analyst	1	45000
Junior Consultant	2	50000
Senior Consultant	3	60000
Manager	4	80000
Country Manager	88888	110000
Region Manager	6	150000
Partner	7	200000
Senior Partner	8	300000
C-level	9	500000
CEO	10	1000000

We use the 'any' function to determine if any of the values in the Salary column contain values less than $10,000.

In [61]:

(salaries.Salary < 10000).values.any()

Out[61]:

False

Basic Arithmetic¶

In [62]:

# Let's reset the index:
# multiply the salaries by 2.
salaries.reset_index(inplace = True)
salaries['Salary'] = salaries['Salary'] * 2
salaries

Out[62]:

	Position	Level	Salary
0	Business Analyst	1	90000
1	Junior Consultant	2	100000
2	Senior Consultant	3	120000
3	Manager	4	160000
4	Country Manager	88888	220000
5	Region Manager	6	300000
6	Partner	7	400000
7	Senior Partner	8	600000
8	C-level	9	1000000
9	CEO	10	2000000

You can use other operations when dealing and maniuplating with individual columns.

Filling Misisng Data¶

We will be importing the wine dataset. This dataset contains null values (missing values). 5 columns contain some missing data, while 22 rows contain some missing data.

In [11]:

wine = pd.read_csv("winequality-red.csv")
wine.isnull().any(axis = 0) #add the .sum operator

Out[11]:

fixed acidity           False
volatile acidity        False
citric acid              True
residual sugar          False
chlorides                True
free sulfur dioxide     False
total sulfur dioxide    False
density                 False
pH                       True
sulphates                True
alcohol                  True
quality                 False
dtype: bool

In [4]:

#Check the first few rows of the data frame. 
wine.head()

Out[4]:

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
0	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5
1	7.8	0.88	0.00	2.6	0.098	25.0	67.0	0.9968	3.20	0.68	9.8	5
2	7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.9970	3.26	0.65	9.8	5
3	11.2	0.28	0.56	1.9	0.075	17.0	60.0	0.9980	3.16	0.58	9.8	6
4	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5

In [ ]:

#wine.dropna(axis=0, inplace = True)
#wine.dropna(axis = 1, inplace = True)

Sometimes, we may want to fill in the missing data with a specific value. Keep in mind that dropping null values or replacing them with a particular value may ultimately affect the accuracy of your model. It's important to utilize certain strategies on how to deal with null values depending on the context of the problem. This is referred to as data imputation. Here is an article on some approaches used: https://www.theanalysisfactor.com/seven-ways-to-make-up-data-common-methods-to-imputing-missing-data/

In [67]:

#Fill all missing data with zeros
wine.fillna(0, inplace = True)
wine.isnull().any(axis=0)

Out[67]:

fixed acidity           False
volatile acidity        False
citric acid             False
residual sugar          False
chlorides               False
free sulfur dioxide     False
total sulfur dioxide    False
density                 False
pH                      False
sulphates               False
alcohol                 False
quality                 False
dtype: bool

We can fill columns with missing data by specific values corresponding to their column.

In [13]:

wine.fillna({wine.columns[2]:0, wine.columns[4]:2}, inplace = True) #Columns 2 and 4 values have been replaced. {name of dataframe}.columns provides a list of column names
wine.isnull().any(axis = 0)

Out[13]:

fixed acidity           False
volatile acidity        False
citric acid             False
residual sugar          False
chlorides               False
free sulfur dioxide     False
total sulfur dioxide    False
density                 False
pH                       True
sulphates                True
alcohol                  True
quality                 False
dtype: bool

In [70]:

wine.head()

Out[70]:

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
0	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5
1	7.8	0.88	0.00	2.6	0.098	25.0	67.0	0.9968	3.20	0.68	9.8	5
2	7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.9970	3.26	0.65	9.8	5
3	11.2	0.28	0.56	1.9	0.075	17.0	60.0	0.9980	3.16	0.58	9.8	6
4	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5