Tips for Selecting Columns in a DataFrame¶

Notebook to accompany this post.

In [1]:

import pandas as pd
import numpy as np

In [2]:

df = pd.read_csv(
    'https://data.cityofnewyork.us/api/views/vfnx-vebw/rows.csv?accessType=DOWNLOAD&bom=true&format=true'
)

Build a mapping list so we can see the index of all the columns

In [3]:

col_mapping = [f"{c[0]}:{c[1]}" for c in enumerate(df.columns)]

In [4]:

col_mapping

Out[4]:

['0:X',
 '1:Y',
 '2:Unique Squirrel ID',
 '3:Hectare',
 '4:Shift',
 '5:Date',
 '6:Hectare Squirrel Number',
 '7:Age',
 '8:Primary Fur Color',
 '9:Highlight Fur Color',
 '10:Combination of Primary and Highlight Color',
 '11:Color notes',
 '12:Location',
 '13:Above Ground Sighter Measurement',
 '14:Specific Location',
 '15:Running',
 '16:Chasing',
 '17:Climbing',
 '18:Eating',
 '19:Foraging',
 '20:Other Activities',
 '21:Kuks',
 '22:Quaas',
 '23:Moans',
 '24:Tail flags',
 '25:Tail twitches',
 '26:Approaches',
 '27:Indifferent',
 '28:Runs from',
 '29:Other Interactions',
 '30:Lat/Long',
 '31:Zip Codes',
 '32:Community Districts',
 '33:Borough Boundaries',
 '34:City Council Districts',
 '35:Police Precincts']

We can also build a dictionary

In [5]:

col_mapping_dict = {c[0]:c[1] for c in enumerate(df.columns)}

In [6]:

col_mapping_dict

Out[6]:

{0: 'X',
 1: 'Y',
 2: 'Unique Squirrel ID',
 3: 'Hectare',
 4: 'Shift',
 5: 'Date',
 6: 'Hectare Squirrel Number',
 7: 'Age',
 8: 'Primary Fur Color',
 9: 'Highlight Fur Color',
 10: 'Combination of Primary and Highlight Color',
 11: 'Color notes',
 12: 'Location',
 13: 'Above Ground Sighter Measurement',
 14: 'Specific Location',
 15: 'Running',
 16: 'Chasing',
 17: 'Climbing',
 18: 'Eating',
 19: 'Foraging',
 20: 'Other Activities',
 21: 'Kuks',
 22: 'Quaas',
 23: 'Moans',
 24: 'Tail flags',
 25: 'Tail twitches',
 26: 'Approaches',
 27: 'Indifferent',
 28: 'Runs from',
 29: 'Other Interactions',
 30: 'Lat/Long',
 31: 'Zip Codes',
 32: 'Community Districts',
 33: 'Borough Boundaries',
 34: 'City Council Districts',
 35: 'Police Precincts'}

Use iloc to select just the second column (Unique Squirrel ID)

In [7]:

df.iloc[:, 2]

Out[7]:

0       37F-PM-1014-03
1       37E-PM-1006-03
2        2E-AM-1010-03
3        5D-PM-1018-05
4       39B-AM-1018-01
             ...      
3018    30B-AM-1007-04
3019    19A-PM-1013-05
3020    22D-PM-1012-07
3021    29B-PM-1010-02
3022     5E-PM-1012-01
Name: Unique Squirrel ID, Length: 3023, dtype: object

Pass a list of integers to select multiple columns by index

In [8]:

df.iloc[:, [0,1,2]]

Out[8]:

	X	Y	Unique Squirrel ID
0	-73.956134	40.794082	37F-PM-1014-03
1	-73.957044	40.794851	37E-PM-1006-03
2	-73.976831	40.766718	2E-AM-1010-03
3	-73.975725	40.769703	5D-PM-1018-05
4	-73.959313	40.797533	39B-AM-1018-01
...	...	...	...
3018	-73.963943	40.790868	30B-AM-1007-04
3019	-73.970402	40.782560	19A-PM-1013-05
3020	-73.966587	40.783678	22D-PM-1012-07
3021	-73.963994	40.789915	29B-PM-1010-02
3022	-73.975479	40.769640	5E-PM-1012-01

3023 rows × 3 columns

We can also pass a slice object to select a range of columns

In [9]:

df.iloc[:, 0:3]

Out[9]:

	X	Y	Unique Squirrel ID
0	-73.956134	40.794082	37F-PM-1014-03
1	-73.957044	40.794851	37E-PM-1006-03
2	-73.976831	40.766718	2E-AM-1010-03
3	-73.975725	40.769703	5D-PM-1018-05
4	-73.959313	40.797533	39B-AM-1018-01
...	...	...	...
3018	-73.963943	40.790868	30B-AM-1007-04
3019	-73.970402	40.782560	19A-PM-1013-05
3020	-73.966587	40.783678	22D-PM-1012-07
3021	-73.963994	40.789915	29B-PM-1010-02
3022	-73.975479	40.769640	5E-PM-1012-01

3023 rows × 3 columns

If we want to combine the list and slice notation, we need to use nump.r_ to process the data into an appropriate format.

In [10]:

np.r_[0:3,15:19,24,25]

Out[10]:

array([ 0,  1,  2, 15, 16, 17, 18, 24, 25])

We can pass the output of np.r_ to .iloc to use multiple selection approaches

In [11]:

df.iloc[:, np.r_[0:3,15:19,24,25]]

Out[11]:

	X	Y	Unique Squirrel ID	Running	Chasing	Climbing	Eating	Tail flags	Tail twitches
0	-73.956134	40.794082	37F-PM-1014-03	False	False	False	False	False	False
1	-73.957044	40.794851	37E-PM-1006-03	True	False	False	False	False	False
2	-73.976831	40.766718	2E-AM-1010-03	False	False	True	False	False	False
3	-73.975725	40.769703	5D-PM-1018-05	False	False	True	False	False	False
4	-73.959313	40.797533	39B-AM-1018-01	False	False	False	False	False	False
...	...	...	...	...	...	...	...	...	...
3018	-73.963943	40.790868	30B-AM-1007-04	False	False	False	True	False	False
3019	-73.970402	40.782560	19A-PM-1013-05	False	False	False	False	False	False
3020	-73.966587	40.783678	22D-PM-1012-07	False	False	False	True	False	False
3021	-73.963994	40.789915	29B-PM-1010-02	False	False	False	True	False	False
3022	-73.975479	40.769640	5E-PM-1012-01	False	False	False	True	False	False

3023 rows × 9 columns

We can use the same notation when reading in a csv as well

In [12]:

df_2 = pd.read_csv(
    'https://data.cityofnewyork.us/api/views/vfnx-vebw/rows.csv?accessType=DOWNLOAD&bom=true&format=true',
    usecols=np.r_[1,2,5:8,15:25],
)

In [13]:

df_2.head()

Out[13]:

	Y	Unique Squirrel ID	Date	Hectare Squirrel Number	Age	Running	Chasing	Climbing	Eating	Foraging	Other Activities	Kuks	Quaas	Moans	Tail flags
0	40.794082	37F-PM-1014-03	10142018	3	NaN	False	False	False	False	False	NaN	False	False	False	False
1	40.794851	37E-PM-1006-03	10062018	3	Adult	True	False	False	False	False	NaN	False	False	False	False
2	40.766718	2E-AM-1010-03	10102018	3	Adult	False	False	True	False	False	NaN	False	False	False	False
3	40.769703	5D-PM-1018-05	10182018	5	Juvenile	False	False	True	False	False	NaN	False	False	False	False
4	40.797533	39B-AM-1018-01	10182018	1	NaN	False	False	False	False	False	unknown	True	False	False	False

We can also select columns using a boolean array

In [14]:

run_cols = df.columns.str.contains('run', case=False)
run_cols

Out[14]:

array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False,  True, False, False,
       False, False, False, False, False, False, False, False, False,
       False,  True, False, False, False, False, False, False, False])

In [15]:

df.iloc[:, run_cols].head()

Out[15]:

	Running	Runs from
0	False	False
1	True	True
2	False	False
3	False	True
4	False	False

A lambda function can be useful for combining into 1 line.

In [16]:

df.iloc[:, lambda df:df.columns.str.contains('run', case=False)].head()

Out[16]:

	Running	Runs from
0	False	False
1	True	True
2	False	False
3	False	True
4	False	False

A more complex example

In [17]:

df.iloc[:, lambda df: df.columns.str.contains('district|precinct|boundaries',
                                              case=False)].head()

Out[17]:

	Community Districts	Borough Boundaries	City Council Districts	Police Precincts
0	19	4	19	13
1	19	4	19	13
2	19	4	19	13
3	19	4	19	13
4	19	4	19	13

Combining index and boolean arrays

In [18]:

location_cols = df.columns.str.contains('district|precinct|boundaries',
                                        case=False)
location_cols

Out[18]:

array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False,  True,  True,  True,  True])

In [19]:

location_indices = [i for i, col in enumerate(location_cols) if col]
location_indices

Out[19]:

[32, 33, 34, 35]

In [20]:

df.iloc[:, np.r_[0:3,location_indices]].head()

Out[20]:

	X	Y	Unique Squirrel ID	Community Districts	Borough Boundaries	City Council Districts	Police Precincts
0	-73.956134	40.794082	37F-PM-1014-03	19	4	19	13
1	-73.957044	40.794851	37E-PM-1006-03	19	4	19	13
2	-73.976831	40.766718	2E-AM-1010-03	19	4	19	13
3	-73.975725	40.769703	5D-PM-1018-05	19	4	19	13
4	-73.959313	40.797533	39B-AM-1018-01	19	4	19	13

In [ ]: