import pandas as pd
import numpy as np
df = pd.read_csv(
'https://data.cityofnewyork.us/api/views/vfnx-vebw/rows.csv?accessType=DOWNLOAD&bom=true&format=true'
)
Build a mapping list so we can see the index of all the columns
col_mapping = [f"{c[0]}:{c[1]}" for c in enumerate(df.columns)]
col_mapping
['0:X', '1:Y', '2:Unique Squirrel ID', '3:Hectare', '4:Shift', '5:Date', '6:Hectare Squirrel Number', '7:Age', '8:Primary Fur Color', '9:Highlight Fur Color', '10:Combination of Primary and Highlight Color', '11:Color notes', '12:Location', '13:Above Ground Sighter Measurement', '14:Specific Location', '15:Running', '16:Chasing', '17:Climbing', '18:Eating', '19:Foraging', '20:Other Activities', '21:Kuks', '22:Quaas', '23:Moans', '24:Tail flags', '25:Tail twitches', '26:Approaches', '27:Indifferent', '28:Runs from', '29:Other Interactions', '30:Lat/Long', '31:Zip Codes', '32:Community Districts', '33:Borough Boundaries', '34:City Council Districts', '35:Police Precincts']
We can also build a dictionary
col_mapping_dict = {c[0]:c[1] for c in enumerate(df.columns)}
col_mapping_dict
{0: 'X', 1: 'Y', 2: 'Unique Squirrel ID', 3: 'Hectare', 4: 'Shift', 5: 'Date', 6: 'Hectare Squirrel Number', 7: 'Age', 8: 'Primary Fur Color', 9: 'Highlight Fur Color', 10: 'Combination of Primary and Highlight Color', 11: 'Color notes', 12: 'Location', 13: 'Above Ground Sighter Measurement', 14: 'Specific Location', 15: 'Running', 16: 'Chasing', 17: 'Climbing', 18: 'Eating', 19: 'Foraging', 20: 'Other Activities', 21: 'Kuks', 22: 'Quaas', 23: 'Moans', 24: 'Tail flags', 25: 'Tail twitches', 26: 'Approaches', 27: 'Indifferent', 28: 'Runs from', 29: 'Other Interactions', 30: 'Lat/Long', 31: 'Zip Codes', 32: 'Community Districts', 33: 'Borough Boundaries', 34: 'City Council Districts', 35: 'Police Precincts'}
Use iloc to select just the second column (Unique Squirrel ID)
df.iloc[:, 2]
0 37F-PM-1014-03 1 37E-PM-1006-03 2 2E-AM-1010-03 3 5D-PM-1018-05 4 39B-AM-1018-01 ... 3018 30B-AM-1007-04 3019 19A-PM-1013-05 3020 22D-PM-1012-07 3021 29B-PM-1010-02 3022 5E-PM-1012-01 Name: Unique Squirrel ID, Length: 3023, dtype: object
Pass a list of integers to select multiple columns by index
df.iloc[:, [0,1,2]]
X | Y | Unique Squirrel ID | |
---|---|---|---|
0 | -73.956134 | 40.794082 | 37F-PM-1014-03 |
1 | -73.957044 | 40.794851 | 37E-PM-1006-03 |
2 | -73.976831 | 40.766718 | 2E-AM-1010-03 |
3 | -73.975725 | 40.769703 | 5D-PM-1018-05 |
4 | -73.959313 | 40.797533 | 39B-AM-1018-01 |
... | ... | ... | ... |
3018 | -73.963943 | 40.790868 | 30B-AM-1007-04 |
3019 | -73.970402 | 40.782560 | 19A-PM-1013-05 |
3020 | -73.966587 | 40.783678 | 22D-PM-1012-07 |
3021 | -73.963994 | 40.789915 | 29B-PM-1010-02 |
3022 | -73.975479 | 40.769640 | 5E-PM-1012-01 |
3023 rows × 3 columns
We can also pass a slice object to select a range of columns
df.iloc[:, 0:3]
X | Y | Unique Squirrel ID | |
---|---|---|---|
0 | -73.956134 | 40.794082 | 37F-PM-1014-03 |
1 | -73.957044 | 40.794851 | 37E-PM-1006-03 |
2 | -73.976831 | 40.766718 | 2E-AM-1010-03 |
3 | -73.975725 | 40.769703 | 5D-PM-1018-05 |
4 | -73.959313 | 40.797533 | 39B-AM-1018-01 |
... | ... | ... | ... |
3018 | -73.963943 | 40.790868 | 30B-AM-1007-04 |
3019 | -73.970402 | 40.782560 | 19A-PM-1013-05 |
3020 | -73.966587 | 40.783678 | 22D-PM-1012-07 |
3021 | -73.963994 | 40.789915 | 29B-PM-1010-02 |
3022 | -73.975479 | 40.769640 | 5E-PM-1012-01 |
3023 rows × 3 columns
If we want to combine the list and slice notation, we need to use nump.r_ to process the data into an appropriate format.
np.r_[0:3,15:19,24,25]
array([ 0, 1, 2, 15, 16, 17, 18, 24, 25])
We can pass the output of np.r_ to .iloc to use multiple selection approaches
df.iloc[:, np.r_[0:3,15:19,24,25]]
X | Y | Unique Squirrel ID | Running | Chasing | Climbing | Eating | Tail flags | Tail twitches | |
---|---|---|---|---|---|---|---|---|---|
0 | -73.956134 | 40.794082 | 37F-PM-1014-03 | False | False | False | False | False | False |
1 | -73.957044 | 40.794851 | 37E-PM-1006-03 | True | False | False | False | False | False |
2 | -73.976831 | 40.766718 | 2E-AM-1010-03 | False | False | True | False | False | False |
3 | -73.975725 | 40.769703 | 5D-PM-1018-05 | False | False | True | False | False | False |
4 | -73.959313 | 40.797533 | 39B-AM-1018-01 | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3018 | -73.963943 | 40.790868 | 30B-AM-1007-04 | False | False | False | True | False | False |
3019 | -73.970402 | 40.782560 | 19A-PM-1013-05 | False | False | False | False | False | False |
3020 | -73.966587 | 40.783678 | 22D-PM-1012-07 | False | False | False | True | False | False |
3021 | -73.963994 | 40.789915 | 29B-PM-1010-02 | False | False | False | True | False | False |
3022 | -73.975479 | 40.769640 | 5E-PM-1012-01 | False | False | False | True | False | False |
3023 rows × 9 columns
We can use the same notation when reading in a csv as well
df_2 = pd.read_csv(
'https://data.cityofnewyork.us/api/views/vfnx-vebw/rows.csv?accessType=DOWNLOAD&bom=true&format=true',
usecols=np.r_[1,2,5:8,15:25],
)
df_2.head()
Y | Unique Squirrel ID | Date | Hectare Squirrel Number | Age | Running | Chasing | Climbing | Eating | Foraging | Other Activities | Kuks | Quaas | Moans | Tail flags | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 40.794082 | 37F-PM-1014-03 | 10142018 | 3 | NaN | False | False | False | False | False | NaN | False | False | False | False |
1 | 40.794851 | 37E-PM-1006-03 | 10062018 | 3 | Adult | True | False | False | False | False | NaN | False | False | False | False |
2 | 40.766718 | 2E-AM-1010-03 | 10102018 | 3 | Adult | False | False | True | False | False | NaN | False | False | False | False |
3 | 40.769703 | 5D-PM-1018-05 | 10182018 | 5 | Juvenile | False | False | True | False | False | NaN | False | False | False | False |
4 | 40.797533 | 39B-AM-1018-01 | 10182018 | 1 | NaN | False | False | False | False | False | unknown | True | False | False | False |
We can also select columns using a boolean array
run_cols = df.columns.str.contains('run', case=False)
run_cols
array([False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False])
df.iloc[:, run_cols].head()
Running | Runs from | |
---|---|---|
0 | False | False |
1 | True | True |
2 | False | False |
3 | False | True |
4 | False | False |
A lambda function can be useful for combining into 1 line.
df.iloc[:, lambda df:df.columns.str.contains('run', case=False)].head()
Running | Runs from | |
---|---|---|
0 | False | False |
1 | True | True |
2 | False | False |
3 | False | True |
4 | False | False |
A more complex example
df.iloc[:, lambda df: df.columns.str.contains('district|precinct|boundaries',
case=False)].head()
Community Districts | Borough Boundaries | City Council Districts | Police Precincts | |
---|---|---|---|---|
0 | 19 | 4 | 19 | 13 |
1 | 19 | 4 | 19 | 13 |
2 | 19 | 4 | 19 | 13 |
3 | 19 | 4 | 19 | 13 |
4 | 19 | 4 | 19 | 13 |
Combining index and boolean arrays
location_cols = df.columns.str.contains('district|precinct|boundaries',
case=False)
location_cols
array([False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, True, True, True])
location_indices = [i for i, col in enumerate(location_cols) if col]
location_indices
[32, 33, 34, 35]
df.iloc[:, np.r_[0:3,location_indices]].head()
X | Y | Unique Squirrel ID | Community Districts | Borough Boundaries | City Council Districts | Police Precincts | |
---|---|---|---|---|---|---|---|
0 | -73.956134 | 40.794082 | 37F-PM-1014-03 | 19 | 4 | 19 | 13 |
1 | -73.957044 | 40.794851 | 37E-PM-1006-03 | 19 | 4 | 19 | 13 |
2 | -73.976831 | 40.766718 | 2E-AM-1010-03 | 19 | 4 | 19 | 13 |
3 | -73.975725 | 40.769703 | 5D-PM-1018-05 | 19 | 4 | 19 | 13 |
4 | -73.959313 | 40.797533 | 39B-AM-1018-01 | 19 | 4 | 19 | 13 |