In this blog post, I will show the different ways to select subsets of data in Pandas using [], .loc, .iloc, .at, and .iat. I will be using the wine quality dataset hosted on the UCI website. This data record 11 chemical properties (such as the concentrations of sugar, citric acid, alcohol, pH etc.) of thousands of red and white wines from northern Portugal, as well as the quality of the wines, recorded on a scale from 1 to 10. We will only look at the data for red wine.
import pandas as pd
wine_df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv', sep=';')
wine_df.head()
wine_df.columns = ['fixed_acidity', 'volatile_acidity', 'citric_acid', 'residual_sugar', 'chlorides', 'free_sulfur_dioxide', 'total_sulfur_dioxide','density','pH','sulphates', 'alcohol', 'quality' ]
wine_df['fixed_acidity'].head()
wine_df.fixed_acidity.head()
wine_four = wine_df[['fixed_acidity', 'volatile_acidity','citric_acid', 'residual_sugar']]
cols = ['fixed_acidity', 'volatile_acidity','citric_acid', 'residual_sugar']
wine_list_four = wine_four[cols]
wine_list_four.head()
wine_df.get_dtype_counts()
wine_df.select_dtypes(include = ['float']).head()
wine_df.select_dtypes(include=['number']).head()
wine_df.filter(like='acid').head()
wine_df.rename(columns={'pH':'pH_5', 'quality': 'quality_6' }, inplace=True)
wine_df.filter(regex='\d').head()
wine_df.head()
wine_df.filter(items=['fixed_acidity', 'volatile_acidity']).head()
wine_df.head()
wine_df.columns
group_1 = ['pH_5','sulphates','alcohol', 'quality_6']
group_2 =['chlorides', 'free_sulfur_dioxide', 'total_sulfur_dioxide', 'density']
group_3 = ['fixed_acidity', 'volatile_acidity', 'citric_acid', 'residual_sugar']
new_cols = group_1+group_2+group_3
wine_df.columns
set(wine_df.columns) == set(new_cols)
wine_df_2 = wine_df[new_cols]
wine_df_2.head()
wine_df.nunique()
wine_df = wine_df.drop_duplicates(subset='density')
wine_df.shape #436 unique values in the density column
wine_df.set_index('density',inplace=True)
wine_df.head()
wine_df.iloc[2]
wine_df.loc[0.9968]
wine_df.iloc[[1, 4, 7]]
rows = [0.9968, 0.9964, 0.9943]
wine_df.loc[rows]
wine_df.iloc[1:4]
Slice notation also works with the .loc indexer and is inclusive of the last label:
first = 0.9970
last = 0.9959
wine_df.loc[first:last]
wine_df.head()
wine_df.iloc[:, [3,4,6]].head()
wine_df.loc[:,['residual_sugar','chlorides','total_sulfur_dioxide']].head()
wine_df.iloc[[10,14], [7, 9]]
rows = [0.9969, 0.9962]
columns = ['pH_5', 'alcohol']
wine_df.loc[rows,columns]
wine_df.loc[0.9970, 'chlorides']
wine_df.iloc[2, 4]
wine_df.head(10)
wine_df.iloc[8:2:-2, 2]
first = 0.9974
second = 0.9964
wine_df.loc[first:second:-2, 'citric_acid']
wine_df.iloc[:5]
wine_df.iloc[:5, :]
col_start = wine_df.columns.get_loc('volatile_acidity')
col_end = wine_df.columns.get_loc('volatile_acidity')+2
col_start, col_end
wine_df.iloc[:4, col_start:col_end]
row_start = wine_df.index[4]
row_end = wine_df.index[7]
wine_df.loc[row_start:row_end, 'volatile_acidity' : 'chlorides']
density_val = 0.9980
wine_df.loc[density_val, 'citric_acid']
wine_df.at[density_val, 'citric_acid']
The %timeit magic command to find the difference in speed:
timeit wine_df.loc[density_val, 'citric_acid']
timeit wine_df.at[density_val, 'citric_acid']
row_num = wine_df.index.get_loc(density_val)
col_num = wine_df.columns.get_loc('citric_acid')
row_num,col_num
timeit wine_df.iloc[row_num, col_num]
timeit wine_df.iat[row_num, col_num]