Recall we previously had talked about pandas series. Dataframes include columns that are like individual pandas series. They resemble tabular data in Microsoft Excel. Keep in mind it has a row and column index. Let's construct a Dataframe object. A common way with using a dict of equal length lists or numpy arrays. Each column can then be accessed as a series object.
import pandas as pd
states = ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada']
year = [2000, 2001, 2002, 2001, 2002]
pop = [1.5, 1.7, 3.6, 2.4, 2.9]
data = {'States': states, 'Year': year, 'Population': pop}
dataframe = pd.DataFrame(data)
dataframe #May have to put print(dataframe) in your text editor
Population | States | Year | |
---|---|---|---|
0 | 1.5 | Ohio | 2000 |
1 | 1.7 | Ohio | 2001 |
2 | 3.6 | Ohio | 2002 |
3 | 2.4 | Nevada | 2001 |
4 | 2.9 | Nevada | 2002 |
#Specify columns, index
dataframe2 = pd.DataFrame(data, columns = ['States', 'Population', 'Year'],
index = ['one', 'two', 'three', 'four', 'five'])
dataframe2
States | Population | Year | |
---|---|---|---|
one | Ohio | 1.5 | 2000 |
two | Ohio | 1.7 | 2001 |
three | Ohio | 3.6 | 2002 |
four | Nevada | 2.4 | 2001 |
five | Nevada | 2.9 | 2002 |
Selecting columns will return a series object. Use the syntax below:
dataframe2['Population']
#dataframe2[['Population']] <- this turns into a dataframe object
one 1.5 two 1.7 three 3.6 four 2.4 five 2.9 Name: Population, dtype: float64
With this in mind, we can create new columns in a similar matter:
dataframe2['West'] = ['False', 'False', 'False', 'True', 'True']
dataframe2
States | Population | Year | West | |
---|---|---|---|---|
one | Ohio | 1.5 | 2000 | False |
two | Ohio | 1.7 | 2001 | False |
three | Ohio | 3.6 | 2002 | False |
four | Nevada | 2.4 | 2001 | True |
five | Nevada | 2.9 | 2002 | True |
Sometimes, we may choose to retrieve one or more columns and possibly assign it to some variable. Most of the time, we can select pieces of the dataframe and analyze them during exploratory data analysis, the phase before machine learning.
salaries = pd.read_csv("Position_Salaries.csv")
salaries.columns # < - Printing columns
Index(['Position', 'Level', 'Salary'], dtype='object')
Retrieve the level and salary columns
salaries[['Level', 'Salary']]
Level | Salary | |
---|---|---|
0 | 1 | 45000 |
1 | 2 | 50000 |
2 | 3 | 60000 |
3 | 4 | 80000 |
4 | 5 | 110000 |
5 | 6 | 150000 |
6 | 7 | 200000 |
7 | 8 | 300000 |
8 | 9 | 500000 |
9 | 10 | 1000000 |
Print first five rows
salaries[:5]
Position | Level | Salary | |
---|---|---|---|
0 | Business Analyst | 1 | 45000 |
1 | Junior Consultant | 2 | 50000 |
2 | Senior Consultant | 3 | 60000 |
3 | Manager | 4 | 80000 |
4 | Country Manager | 5 | 110000 |
salaries[salaries['Salary'] > 100000]
Position | Level | Salary | |
---|---|---|---|
4 | Country Manager | 5 | 110000 |
5 | Region Manager | 6 | 150000 |
6 | Partner | 7 | 200000 |
7 | Senior Partner | 8 | 300000 |
8 | C-level | 9 | 500000 |
9 | CEO | 10 | 1000000 |
salaries['Salary'][salaries['Salary'] > 100000]
4 110000 5 150000 6 200000 7 300000 8 500000 9 1000000 Name: Salary, dtype: int64
salaries.iloc[0] #prints first row, returns series object
Position Business Analyst Level 1 Salary 45000 Name: 0, dtype: object
(salaries.iloc[:, [0,1]]) #prints all rows, first two columns; We choose column 1 and column 2 (indexed at 0 & 1)
Position | Level | |
---|---|---|
0 | Business Analyst | 1 |
1 | Junior Consultant | 2 |
2 | Senior Consultant | 3 |
3 | Manager | 4 |
4 | Country Manager | 5 |
5 | Region Manager | 6 |
6 | Partner | 7 |
7 | Senior Partner | 8 |
8 | C-level | 9 |
9 | CEO | 10 |
salaries.iloc[[4,5,6], :] #prints 4th, 5th, 6th rows, all columns
Position | Level | Salary | |
---|---|---|---|
4 | Country Manager | 5 | 110000 |
5 | Region Manager | 6 | 150000 |
6 | Partner | 7 | 200000 |
#Slightly bit more complicated:
import numpy as np
salaries.iloc[np.r_[1:3, 6:8, 8], np.r_[0:1, 2:3]] #last index is exclusive unless specifically stated
Position | Salary | |
---|---|---|
1 | Junior Consultant | 50000 |
2 | Senior Consultant | 60000 |
6 | Partner | 200000 |
7 | Senior Partner | 300000 |
8 | C-level | 500000 |
salaries.iloc[4, 1] = 88888 # Change individual value
salaries
Position | Level | Salary | |
---|---|---|---|
0 | Business Analyst | 1 | 45000 |
1 | Junior Consultant | 2 | 50000 |
2 | Senior Consultant | 3 | 60000 |
3 | Manager | 4 | 80000 |
4 | Country Manager | 88888 | 110000 |
5 | Region Manager | 6 | 150000 |
6 | Partner | 7 | 200000 |
7 | Senior Partner | 8 | 300000 |
8 | C-level | 9 | 500000 |
9 | CEO | 10 | 1000000 |
LOC is fairly similar; instead of numbers, use the row/column names. for this specific example, let's change the index of the dataframe to the positions
salaries.set_index(['Position'], inplace = True)
salaries
Level | Salary | |
---|---|---|
Position | ||
Business Analyst | 1 | 45000 |
Junior Consultant | 2 | 50000 |
Senior Consultant | 3 | 60000 |
Manager | 4 | 80000 |
Country Manager | 88888 | 110000 |
Region Manager | 6 | 150000 |
Partner | 7 | 200000 |
Senior Partner | 8 | 300000 |
C-level | 9 | 500000 |
CEO | 10 | 1000000 |
salaries.loc[['Business Analyst', 'Junior Consultant'], :] #prints first two rows, all columns
Level | Salary | |
---|---|---|
Position | ||
Business Analyst | 1 | 45000 |
Junior Consultant | 2 | 50000 |
salaries.loc[:, ['Level', 'Salary']] #prints all rows, second/third column
Level | Salary | |
---|---|---|
Position | ||
Business Analyst | 1 | 45000 |
Junior Consultant | 2 | 50000 |
Senior Consultant | 3 | 60000 |
Manager | 4 | 80000 |
Country Manager | 88888 | 110000 |
Region Manager | 6 | 150000 |
Partner | 7 | 200000 |
Senior Partner | 8 | 300000 |
C-level | 9 | 500000 |
CEO | 10 | 1000000 |
We use the 'any' function to determine if any of the values in the Salary column contain values less than $10,000.
(salaries.Salary < 10000).values.any()
False
# Let's reset the index:
# multiply the salaries by 2.
salaries.reset_index(inplace = True)
salaries['Salary'] = salaries['Salary'] * 2
salaries
Position | Level | Salary | |
---|---|---|---|
0 | Business Analyst | 1 | 90000 |
1 | Junior Consultant | 2 | 100000 |
2 | Senior Consultant | 3 | 120000 |
3 | Manager | 4 | 160000 |
4 | Country Manager | 88888 | 220000 |
5 | Region Manager | 6 | 300000 |
6 | Partner | 7 | 400000 |
7 | Senior Partner | 8 | 600000 |
8 | C-level | 9 | 1000000 |
9 | CEO | 10 | 2000000 |
You can use other operations when dealing and maniuplating with individual columns.
We will be importing the wine dataset. This dataset contains null values (missing values). 5 columns contain some missing data, while 22 rows contain some missing data.
wine = pd.read_csv("winequality-red.csv")
wine.isnull().any(axis = 0) #add the .sum operator
fixed acidity False volatile acidity False citric acid True residual sugar False chlorides True free sulfur dioxide False total sulfur dioxide False density False pH True sulphates True alcohol True quality False dtype: bool
#Check the first few rows of the data frame.
wine.head()
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
1 | 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25.0 | 67.0 | 0.9968 | 3.20 | 0.68 | 9.8 | 5 |
2 | 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15.0 | 54.0 | 0.9970 | 3.26 | 0.65 | 9.8 | 5 |
3 | 11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17.0 | 60.0 | 0.9980 | 3.16 | 0.58 | 9.8 | 6 |
4 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
#wine.dropna(axis=0, inplace = True)
#wine.dropna(axis = 1, inplace = True)
Sometimes, we may want to fill in the missing data with a specific value. Keep in mind that dropping null values or replacing them with a particular value may ultimately affect the accuracy of your model. It's important to utilize certain strategies on how to deal with null values depending on the context of the problem. This is referred to as data imputation. Here is an article on some approaches used: https://www.theanalysisfactor.com/seven-ways-to-make-up-data-common-methods-to-imputing-missing-data/
#Fill all missing data with zeros
wine.fillna(0, inplace = True)
wine.isnull().any(axis=0)
fixed acidity False volatile acidity False citric acid False residual sugar False chlorides False free sulfur dioxide False total sulfur dioxide False density False pH False sulphates False alcohol False quality False dtype: bool
We can fill columns with missing data by specific values corresponding to their column.
wine.fillna({wine.columns[2]:0, wine.columns[4]:2}, inplace = True) #Columns 2 and 4 values have been replaced. {name of dataframe}.columns provides a list of column names
wine.isnull().any(axis = 0)
fixed acidity False volatile acidity False citric acid False residual sugar False chlorides False free sulfur dioxide False total sulfur dioxide False density False pH True sulphates True alcohol True quality False dtype: bool
wine.head()
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
1 | 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25.0 | 67.0 | 0.9968 | 3.20 | 0.68 | 9.8 | 5 |
2 | 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15.0 | 54.0 | 0.9970 | 3.26 | 0.65 | 9.8 | 5 |
3 | 11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17.0 | 60.0 | 0.9980 | 3.16 | 0.58 | 9.8 | 6 |
4 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |