Mixing .loc indexer with Boolean Operations
Altering DataFrame values using the .loc indexer
Conditionally Apply Values Based on Another Column Value
SAS users tend to think of indexing SAS data sets as a means to improve query performance. Another use case for using SAS indices is in performing non-sequential reads for table look-ups.
Indexing for DataFrames is used to provide direct access to data. Many analytical techniques take advantage of indexes. Indexing is also used as key values to selecting and subseting data.
It took me a bit of time to understand how indexing for Series and DataFrames actually work.
Many of the working examples I found, while mostly useful, used synthetic data which is typically ordered neatly. Here we examine issues like overlapping date ranges or index values in non-alphabetical or in non-sequential order and so on.
That is why you will find some errors in the examples below. By examining the pits I have fallen into, hopefully you can avoid them.
import numpy as np
import pandas as pd
from numpy.random import randn
from pandas import Series, DataFrame, Index
Consider the creation of the DataFrame df in the cell below.
df = pd.DataFrame([['a', 'cold','slow', np.nan, 2., 6., 3.],
['b', 'warm', 'medium', 4, 5, 7, 9],
['c', 'hot', 'fast', 9, 4, np.nan, 6],
['d', 'cool', None, np.nan, np.nan, 17, 89],
['e', 'cool', 'medium', 16, 44, 21, 13],
['f', 'cold', 'slow', np.nan, 29, 33, 17]])
In the 'df' DataFrame created above we did not specify a row index or column names resulting in the RangeIndex object used for row labels. Default column labels are created as well using another RangeIndex object.
df
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
0 | a | cold | slow | NaN | 2.0 | 6.0 | 3.0 |
1 | b | warm | medium | 4.0 | 5.0 | 7.0 | 9.0 |
2 | c | hot | fast | 9.0 | 4.0 | NaN | 6.0 |
3 | d | cool | None | NaN | NaN | 17.0 | 89.0 |
4 | e | cool | medium | 16.0 | 44.0 | 21.0 | 13.0 |
5 | f | cold | slow | NaN | 29.0 | 33.0 | 17.0 |
The .index attribute returns the DataFrame's index structure. We did not explicitly set an index. As a result a default index object is created. The RangeIndex object has a start position at 0 and the end position set to len(df) - 1.
Return the row index.
df.index
RangeIndex(start=0, stop=6, step=1)
Return the column labels. Since no labels were specificed, a RangeIndex object is used to identify columns.
df.columns
RangeIndex(start=0, stop=7, step=1)
For observations SAS uses the automatic variable _N_ in the Data Step and FIRSTOBS and OBS in PROC step for its row indexing.
SAS also has a similar construct (SAS variable list) allowing the use of column name 'ranges' (col1-colN) described here.
The SAS example below creates a data set with same data used to create the DataFrame df in cell #2 above.
/******************************************************/
/* c05_create_analog_df_dataframe.sas */
/******************************************************/
11 data df;
12 infile cards dlm=',';
13
14 input id $
15 col1 $
16 col2 $
17 col3
18 col4
19 col5
20 col6 ;
21
22 datalines;
23 a, cold, slow, ., 2, 6, 3
24 b, warm, medium, 4, 5, 7, 9
25 c, hot, fast, 9, 4, ., 6
26 d, cool, , ., ., 17, 89
27 e, cool, medium, 16, 44, 21, 13
28 f, cold, slow, . ,29, 33, 17
29 ;;;;
The Data Step with the SET options NOBS= is an example of implicit indexing used by SAS. The END= parameter on the SET statement is initialized to 0 and set to one when the last observation is read. The automatic variable _N_ can be used as the observation (row) index.
/******************************************************/
/* c05_sas_row_index1.sas */
/******************************************************/
31 data _null_;
32 set df nobs=obs end=end_of_file;
33
34 put _n_ = ;
35
36 if end_of_file then
37 put 'Data set contains: ' obs ' observations' /;
_N_=1
_N_=2
_N_=3
_N_=4
_N_=5
_N_=6
Data set contains: 6 observations
Using the DataFrame indices we can select specific rows and columns. DataFrames provide indexers to accomplish these tasks. They are:
1. .iloc() method which is mainly an integer-based method
2. .loc() method used to select ranges by labels (either column or row)
3. .ix() method which supports a combination of the loc() and iloc() methods
We also illsutrate altering values in a DataFrame using the .loc(). This is equivalent to the SAS update method.
The .iloc indexer uses an integer-based method for locating row and column positions. It is a very robust indexer, however, it is limited by the fact that humans are better at remembering labels than numbers. This is analogous to locating an observations by _n_ or locating a SAS variable using 'colN' from a variable list.
The syntax for the .iloc indexer is:
df.iloc[row selection, column selection]
A comma (,) is used to separate the request of rows for the request for columns. A colon (:) is used to request a range of cells.
The absence of a either a column or row selection is an implicit request for all columns or rows, respectively.
Details for the .iloc indexer are located here.
.iloc[0] is useful since it returns the first scalar from a Series of the first column from a DataFrame.
df.iloc[0]
0 a 1 cold 2 slow 3 NaN 4 2 5 6 6 3 Name: 0, dtype: object
The point= option on the SET statement behaves similarly to return the first row in the data set. Note the SET statement inside a DO loop and the STOP statement. STOP is needed because the POINT= option indicates a non-sequnetial access pattern and thus the end of data set indicator is not available.
/******************************************************/
/* c05_sas_row_index2.sas */
/******************************************************/
52 data _null_;
53
54 do iloc = 1 to 1;
55 set df point=iloc;
56
57 put _all_ ;
58 end;
59 stop;
_N_=1 _ERROR_=0 iloc=1 id=a col1=cold col2=slow col3=. col4=2 col5=6 col6=3
In the example below, you might expect three rows returned, rather than two. The range request for .iloc[] includes the start of the range and does not include the last item in the range value.
df.iloc[2:4]
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
2 | c | hot | fast | 9.0 | 4.0 | NaN | 6.0 |
3 | d | cool | None | NaN | NaN | 17.0 | 89.0 |
The SAS analog example for cell #6 is below.
/******************************************************/
/* c05_sas_row_index3.sas */
/******************************************************/
94 data _null_ ;
95
96 do iloc = 3 to 4;
97 set df point=iloc;
98 put _all_ ;
99 end;
100 stop;
_N_=1 _ERROR_=0 iloc=3 id=c col1=hot col2=fast col3=9 col4=4 col5=. col6=6
_N_=1 _ERROR_=0 iloc=4 id=d col1=cool col2= col3=. col4=. col5=17 col6=89
Similar to the indexer for string slicing, the index position
iloc[0]
returns the first row in a DataFrame and
iloc[-1]
returns the last row in the DataFrame. This is analogous to the END= option for the SET statement (assuming a sequential access pattern).
The .iloc indexer is mainly used to locate first or last row in the DataFrame.
df.iloc[-1]
0 f 1 cold 2 slow 3 NaN 4 29 5 33 6 17 Name: 5, dtype: object
The .iloc indexer in cell #8 below returns rows 2 and 3 using (2:4) as the row selector and columns 0 to 6 using (0:6) as the column selctor.
df.iloc[2:4, 0:6]
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
2 | c | hot | fast | 9.0 | 4.0 | NaN |
3 | d | cool | None | NaN | NaN | 17.0 |
The analog SAS program for returning the same sub-set is below. FIRSTOBS=3 OBS=4 is the equivalent row selector and keep = id -- col5 is the equivalent column selector.
/******************************************************/
/* c05_firstobs_3_obs_4.sas */
/******************************************************/
60 data df;
61 set df(keep = id -- col5
62 firstobs=3 obs=4);
63 put _all_ ;
_N_=1 _ERROR_=0 id=c col1=hot col2=fast col3=9 col4=4 col5=.
_N_=2 _ERROR_=0 id=d col1=cool col2= col3=. col4=. col5=17
The .iloc idexer illustrating multi-row and multi-column requests. Note the double square brackets ([]) syntax.
df.iloc[[1,3,5], [2, 4, 6]]
2 | 4 | 6 | |
---|---|---|---|
1 | medium | 5.0 | 9.0 |
3 | None | NaN | 89.0 |
5 | slow | 29.0 | 17.0 |
The .loc indexer is similar to .iloc and allows access to rows and columns by labels. A good analogy is a cell reference in Excel, eg. C:31.
The syntax for the .loc indexer is:
df.loc[row selection, column selection]
For both the row and column selection, a comma (,) is used to request a list of multiple cells. A colon (:) is used to request a range of cells.
Similiar to the .iloc indexer you can select combinations of rows and columns. The doc details are located here.
Consider the DataFrame df2 created below in cell #10. It contains the new columns 'id' and 'date'.
df2 = pd.DataFrame([['a', 'cold','slow', np.nan, 2., 6., 3., '08/01/16'],
['b', 'warm', 'medium', 4, 5, 7, 9, '03/15/16'],
['c', 'hot', 'fast', 9, 4, np.nan, 6, '04/30/16'],
['d', 'None', 'fast', np.nan, np.nan, 17, 89, '05/31/16'],
['e', 'cool', 'medium', 16, 44, 21, 13, '07/04/16'],
['f', 'cold', 'slow', np.nan, 29, 33, 17, '01/01/16']],
columns=['id', 'col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'date'])
Executing just the name for the DataFrame is the equivalent of:
print(df2)
The print() method for a DataFrame returns the output without the cell outlines, however.
df2
id | col1 | col2 | col3 | col4 | col5 | col6 | date | |
---|---|---|---|---|---|---|---|---|
0 | a | cold | slow | NaN | 2.0 | 6.0 | 3.0 | 08/01/16 |
1 | b | warm | medium | 4.0 | 5.0 | 7.0 | 9.0 | 03/15/16 |
2 | c | hot | fast | 9.0 | 4.0 | NaN | 6.0 | 04/30/16 |
3 | d | None | fast | NaN | NaN | 17.0 | 89.0 | 05/31/16 |
4 | e | cool | medium | 16.0 | 44.0 | 21.0 | 13.0 | 07/04/16 |
5 | f | cold | slow | NaN | 29.0 | 33.0 | 17.0 | 01/01/16 |
In SAS the construction of an index for a data set creates an external file used in a deterministic fashion by SAS. In contrast, the construction of a panda index physically alters either the DataFrame, or a copy of it, depending on argument values to the set_index method. The doc is located here .
Start by setting the index to 'id', to access rows by a row value or a range of rows values. By default, the column is dropped when it becomes the index. You may not want this behavior, in which case you set the drop= argument value to False.
drop=False
This is useful if you have multiple occasions setting and resetting the index. Otherwise, a re-read of the DataFrame is required. Below, use the set_index() method to set the index to the 'id' column.
df2.set_index('id', inplace=True, drop=False)
The .set_index attribute execution is silent when the inplace=argument value is utilized. Validate the index using the .index attribute.
df2.index
Index(['a', 'b', 'c', 'd', 'e', 'f'], dtype='object', name='id')
To reset the index, use the reset_index() method.
df2 = df2.reset_index(drop=True)
In order to have the remainder of the examples for the .loc indexer to work, we set the index again.
df2.set_index('id', inplace=True, drop=False)
Return the row labeled 'e'.
df2.loc['e', ]
id e col1 cool col2 medium col3 16 col4 44 col5 21 col6 13 date 07/04/16 Name: e, dtype: object
Return rows in the range of 'b' to 'f' inclusive. 'b':'f' denotes a row range. The absence of a column request is an implicit request for all of them.
df2.loc['b':'f' ,]
id | col1 | col2 | col3 | col4 | col5 | col6 | date | |
---|---|---|---|---|---|---|---|---|
id | ||||||||
b | b | warm | medium | 4.0 | 5.0 | 7.0 | 9.0 | 03/15/16 |
c | c | hot | fast | 9.0 | 4.0 | NaN | 6.0 | 04/30/16 |
d | d | None | fast | NaN | NaN | 17.0 | 89.0 | 05/31/16 |
e | e | cool | medium | 16.0 | 44.0 | 21.0 | 13.0 | 07/04/16 |
f | f | cold | slow | NaN | 29.0 | 33.0 | 17.0 | 01/01/16 |
Return the rows between range of 'd' to 'f' inclusive. 'col6' and 'col2' is a request for columns by label.
df2.loc['d':'f',['col6','col2']]
col6 | col2 | |
---|---|---|
id | ||
d | 89.0 | fast |
e | 13.0 | medium |
f | 17.0 | slow |
Return the 'date' column by label.
df2.loc[: , 'date']
id a 08/01/16 b 03/15/16 c 04/30/16 d 05/31/16 e 07/04/16 f 01/01/16 Name: date, dtype: object
Change the DataFrame df2 index from 'id' to 'date'. The inplace=True argument does not make a copy of the DataFrame.
df2.set_index('date', inplace=True)
Validate the index.
df2.index
Index(['08/01/16', '03/15/16', '04/30/16', '05/31/16', '07/04/16', '01/01/16'], dtype='object', name='date')
Request a row by label.
df2.loc['05/31/16']
id d col1 None col2 fast col3 NaN col4 NaN col5 17 col6 89 Name: 05/31/16, dtype: object
Request arbitrary rows. Notice the double square braces ([])
df2.loc[['03/15/16', '07/04/16']]
id | col1 | col2 | col3 | col4 | col5 | col6 | |
---|---|---|---|---|---|---|---|
date | |||||||
03/15/16 | b | warm | medium | 4.0 | 5.0 | 7.0 | 9.0 |
07/04/16 | e | cool | medium | 16.0 | 44.0 | 21.0 | 13.0 |
Request a range of rows.
df2.loc['04/30/16':'07/04/16',['col2','col1']]
col2 | col1 | |
---|---|---|
date | ||
04/30/16 | fast | hot |
05/31/16 | fast | None |
07/04/16 | medium | cool |
The SAS program below is equivalent to cell #25 above. It uses character values representing dates.
/******************************************************/
/* c05_select_dates_01Feb16_and_31Jul16.sas */
/******************************************************/
25 data df2(keep = col2 col1);
26 set df(where=(date between '04/30/16' and '07/04/16'));
27 put _all_;
_N_=1 _ERROR_=0 id=c col1=hot col2=fast col3=9 col4=4 col5=. col6=6 date=04/30/16
_N_=2 _ERROR_=0 id=d col1=cool col2= col3=. col4=. col5=17 col6=89 date=05/31/16
_N_=3 _ERROR_=0 id=e col1=cool col2=medium col3=16 col4=44 col5=21 col6=13 date=07/04/16
In cell #26 below we hit a snag. The issue begins with cell #21 above using the set_index attribute for the 'df2' DataFrame. Examine cell #22 above to observe how the df2['date'] column dtype is 'object'.
This indicates we are working with string literals and not datetime objects. Cells #24 and #25 work because these specific labels are values found in the df2['date'] index.
Cell #26, below does not work, since the range request contains the value '07/31/16 as the range end-point which is not a value in the index.
The remedy, shown below in cell #29 is to use the pd.to_datetime() method to convert the df2['date'] string values into a datetime object. The obvious analogy for SAS users is converting a string variable to a numeric variable which has an associated date format.
Converting string literals to datetime values is covered in the section, String Literal Mapped to datetime timestamp located here.
df2.loc['01/01/16':'07/31/16']
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) C:\Users\randy\Anaconda3\lib\site-packages\pandas\indexes\base.py in get_slice_bound(self, label, side, kind) 2909 try: -> 2910 return self._searchsorted_monotonic(label, side) 2911 except ValueError: C:\Users\randy\Anaconda3\lib\site-packages\pandas\indexes\base.py in _searchsorted_monotonic(self, label, side) 2875 -> 2876 raise ValueError('index must be monotonic increasing or decreasing') 2877 ValueError: index must be monotonic increasing or decreasing During handling of the above exception, another exception occurred: KeyError Traceback (most recent call last) <ipython-input-26-e06fa69430df> in <module>() ----> 1 df2.loc['01/01/16':'07/31/16'] C:\Users\randy\Anaconda3\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key) 1294 return self._getitem_tuple(key) 1295 else: -> 1296 return self._getitem_axis(key, axis=0) 1297 1298 def _getitem_axis(self, key, axis=0): C:\Users\randy\Anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis) 1435 if isinstance(key, slice): 1436 self._has_valid_type(key, axis) -> 1437 return self._get_slice_axis(key, axis=axis) 1438 elif is_bool_indexer(key): 1439 return self._getbool_axis(key, axis=axis) C:\Users\randy\Anaconda3\lib\site-packages\pandas\core\indexing.py in _get_slice_axis(self, slice_obj, axis) 1316 labels = obj._get_axis(axis) 1317 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, -> 1318 slice_obj.step, kind=self.name) 1319 1320 if isinstance(indexer, slice): C:\Users\randy\Anaconda3\lib\site-packages\pandas\indexes\base.py in slice_indexer(self, start, end, step, kind) 2783 """ 2784 start_slice, end_slice = self.slice_locs(start, end, step=step, -> 2785 kind=kind) 2786 2787 # return a slice C:\Users\randy\Anaconda3\lib\site-packages\pandas\indexes\base.py in slice_locs(self, start, end, step, kind) 2968 end_slice = None 2969 if end is not None: -> 2970 end_slice = self.get_slice_bound(end, 'right', kind) 2971 if end_slice is None: 2972 end_slice = len(self) C:\Users\randy\Anaconda3\lib\site-packages\pandas\indexes\base.py in get_slice_bound(self, label, side, kind) 2911 except ValueError: 2912 # raise the original KeyError -> 2913 raise err 2914 2915 if isinstance(slc, np.ndarray): C:\Users\randy\Anaconda3\lib\site-packages\pandas\indexes\base.py in get_slice_bound(self, label, side, kind) 2905 # we need to look up the label 2906 try: -> 2907 slc = self.get_loc(label) 2908 except KeyError as err: 2909 try: C:\Users\randy\Anaconda3\lib\site-packages\pandas\indexes\base.py in get_loc(self, key, method, tolerance) 1945 return self._engine.get_loc(key) 1946 except KeyError: -> 1947 return self._engine.get_loc(self._maybe_cast_indexer(key)) 1948 1949 indexer = self.get_indexer([key], method=method, tolerance=tolerance) pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:4154)() pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:4018)() pandas\hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12368)() pandas\hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12322)() KeyError: '07/31/16'
Reset the index for the DataFrame 'df2' to the default RangeIndex object.
df2.reset_index(inplace=True)
Validate the index.
df2.index
RangeIndex(start=0, stop=6, step=1)
Cast the df2['date'] column from dtype='object' (strings) to dtype=datetime.
df2['date'] = pd.to_datetime(df2.date)
Set the df2['date'] column as the index.
df2.set_index('date', inplace=True)
Validate the index. Observe the dytpe is now datetime64--a datetime stamp. See Chapter 8--Date, Time, and Timedelta Objects for more details on datetime arithmetic, shifting time intervals, and determining durations.
df2.index
DatetimeIndex(['2016-08-01', '2016-03-15', '2016-04-30', '2016-05-31', '2016-07-04', '2016-01-01'], dtype='datetime64[ns]', name='date', freq=None)
With the 'ds2['date'] column values converted to a datetime values, re-do the statement in cell #26 above.
df2.loc['02/01/16':'07/31/16']
id | col1 | col2 | col3 | col4 | col5 | col6 | |
---|---|---|---|---|---|---|---|
date | |||||||
2016-03-15 | b | warm | medium | 4.0 | 5.0 | 7.0 | 9.0 |
2016-04-30 | c | hot | fast | 9.0 | 4.0 | NaN | 6.0 |
2016-05-31 | d | None | fast | NaN | NaN | 17.0 | 89.0 |
2016-07-04 | e | cool | medium | 16.0 | 44.0 | 21.0 | 13.0 |
The SAS example below illustrates a similiar set of steps:
1. Read the original 'date' variable which is character and rename it to 'str_date'
2. Use the input funtion to 'read' the 'str_date' values and assign them to the 'date' variable using the mmddyy10. informat
3. Print the date value without any formatting showing it is now a SAS datetime value
4. Print the SAS datetime value using the mmddyy10. date format
/******************************************************/
/* c05_alter_string_to_datetime_variable.sas */
/******************************************************/
4 data df2(drop = str_date);
5 set df(rename=(date=str_date));
6 date=input(str_date,mmddyy10.);
7
8 if _n_ = 1 then put date= /
9 date mmddyy10.;
date=20667
08/01/2016
This approach works by creating either a Series or array of boolean values (True or False). This Series or array is then used by the .loc indexer to return all of the values that evaluate to True. Using the DataFrame df2 created in the cell above, consider the following.
We want to return all rows where 'col2' is not equal to 'fast. This is expressed as:
df2['col2'] != 'fast'
A Series is returned with the True/False values not equal to 'fast' for fd2['col2'] shown below. The df2['date'] column is returned since it remains as the index for the DataFrame. The second print() method displays this object as being derived from the class: Series.
print(df2['col2'] != 'fast')
print(type(df2['col2'] != 'fast'))
date 2016-08-01 True 2016-03-15 True 2016-04-30 False 2016-05-31 False 2016-07-04 True 2016-01-01 True Name: col2, dtype: bool <class 'pandas.core.series.Series'>
Passing the boolean Series:
df2['col2'] != 'fast'
to the .loc indexer to retrieve those rows with a boolean value of True. Also request 'col1' and 'col2', which a request by label.
df2.loc[df2['col2'] != 'fast', 'col1':'col2']
col1 | col2 | |
---|---|---|
date | ||
2016-08-01 | cold | slow |
2016-03-15 | warm | medium |
2016-07-04 | cool | medium |
2016-01-01 | cold | slow |
You can combine any number of boolean operation together. Boolean value comparison operators are documented here.
df2.loc[(df2.col3 >= 9) & (df2.col1 == 'cool'), ]
id | col1 | col2 | col3 | col4 | col5 | col6 | |
---|---|---|---|---|---|---|---|
date | |||||||
2016-07-04 | e | cool | medium | 16.0 | 44.0 | 21.0 | 13.0 |
The .isin() method returns a boolean vector similar to the behavior described in cell #34 above. In this example, the .isin list of elements evaluates True if the listed elements are found by the .loc indexer in 'col6'.
df2.loc[df2.col6.isin([6, 9, 13])]
id | col1 | col2 | col3 | col4 | col5 | col6 | |
---|---|---|---|---|---|---|---|
date | |||||||
2016-03-15 | b | warm | medium | 4.0 | 5.0 | 7.0 | 9.0 |
2016-04-30 | c | hot | fast | 9.0 | 4.0 | NaN | 6.0 |
2016-07-04 | e | cool | medium | 16.0 | 44.0 | 21.0 | 13.0 |
So far, the .loc indexers have resulted in an output stream. All of the indexers can be used to sub-set a DataFrame using assignment syntax shown in the cell #33 below.
df3 = df2.loc[df2.col6.isin([6, 9, 13])]
df3
id | col1 | col2 | col3 | col4 | col5 | col6 | |
---|---|---|---|---|---|---|---|
date | |||||||
2016-03-15 | b | warm | medium | 4.0 | 5.0 | 7.0 | 9.0 |
2016-04-30 | c | hot | fast | 9.0 | 4.0 | NaN | 6.0 |
2016-07-04 | e | cool | medium | 16.0 | 44.0 | 21.0 | 13.0 |
Generally, in these types of sub-setting operations, the shape of the extracted DataFrame will be smaller than the input DataFrame.
To return a DataFrame of the same shape as the original, use the where() method. Details for the pandas where() method is described here.
print('Shape for df2 is', df2.shape)
print('Shape for df3 is', df3.shape)
Shape for df2 is (6, 7) Shape for df3 is (3, 7)
The example SAS program below uses the WHERE IN (list) syntax to subset a data set analogous to the example in cell #32 above.
/******************************************************/
/* c05_where_in.sas */
/******************************************************/
NOTE: Data set "WORK.df2" has 6 observation(s) and 8 variable(s)
27 data df3;
28 set df2(where=(col6 in (6 9 13)));
NOTE: 3 observations were read from "WORK.df2"
NOTE: Data set "WORK.df3" has 3 observation(s) and 8 variable(s)
Notice how the SAS variable count and DataFrame column count differ by 1. That is because the DataFrame .shape() method does not include the index as part of its column count. By reseting the index, the SAS variable count and DataFrame columns count agree.
df3.reset_index(inplace=True)
print('Shape for df3 is', df3.shape)
Shape for df3 is (3, 8)
The .loc indexer can also be used to do an in-place update of values.
Find the values for df2['col2'] column before updating.
df2.loc[: , 'col2']
date 2016-08-01 slow 2016-03-15 medium 2016-04-30 fast 2016-05-31 fast 2016-07-04 medium 2016-01-01 slow Name: col2, dtype: object
df2.loc[df2['col6'] > 50, "col2"] = "VERY FAST"
The values for df2['col2'] column after the update.
df2.loc[: , 'col2']
date 2016-08-01 slow 2016-03-15 medium 2016-04-30 fast 2016-05-31 VERY FAST 2016-07-04 medium 2016-01-01 slow Name: col2, dtype: object
The .ix indexer combines characteristics of the .loc and .iloc indexers. This means you can select rows and columns by labels and integers.
The syntax for the .ix indexer is:
df.ix[row selection, column selection]
For both the row and column selection, a comma (,) is used to request a list of multiple cells. A colon (:) is used to request a range of cells.
Similar to the .loc indexer you can select combinations of rows and columns.
The .ix indexer is sometimes tricky to use. A good rule of thumb is if you are indexing using labels or indexing using integers, use the .loc and .iloc to avoid unexpected results. The documentation details are found here.
Consider the creation of the DataFrame 'df4' constructed in the cell below. It is similar to DataFrame df2 created in cell #10 above. The differences are the addition of another column and columns being identified with labels as well as integers.
df4 = pd.DataFrame([['a', 'cold','slow', np.nan, 2., 6., 3., 17, '08/01/16'],
['b', 'warm', 'medium', 4, 5, 7, 9, 21, '03/15/16'],
['c', 'hot', 'fast', 9, 4, np.nan, 6, 10, '04/30/16'],
['d', 'None', 'fast', np.nan, np.nan, 17, 89, 44, '05/31/16'],
['e', 'cool', 'medium', 16, 44, 21, 13, 99, '07/04/16'],
['f', 'cold', 'slow', np.nan, 29, 33, 17, 11,'01/01/16']],
columns=['id', 'col1', 'col2', 'col3', 'col4', 4, 5, 6, 'date'])
df4
id | col1 | col2 | col3 | col4 | 4 | 5 | 6 | date | |
---|---|---|---|---|---|---|---|---|---|
0 | a | cold | slow | NaN | 2.0 | 6.0 | 3.0 | 17 | 08/01/16 |
1 | b | warm | medium | 4.0 | 5.0 | 7.0 | 9.0 | 21 | 03/15/16 |
2 | c | hot | fast | 9.0 | 4.0 | NaN | 6.0 | 10 | 04/30/16 |
3 | d | None | fast | NaN | NaN | 17.0 | 89.0 | 44 | 05/31/16 |
4 | e | cool | medium | 16.0 | 44.0 | 21.0 | 13.0 | 99 | 07/04/16 |
5 | f | cold | slow | NaN | 29.0 | 33.0 | 17.0 | 11 | 01/01/16 |
Set the index to column df4['id'].
df4.set_index('id', inplace=True)
The .ix indexer allows slicing by labels and integer poitions along the index. And look closely at the results since the request for columns is based on the integer-position of the column. The column requestor of 6 is not a column label, but its position.
df4.ix['b':'e', 6:8]
6 | date | |
---|---|---|
id | ||
b | 21 | 03/15/16 |
c | 10 | 04/30/16 |
d | 44 | 05/31/16 |
e | 99 | 07/04/16 |
.iloc uses the integer position in the index and only accepts integers
.loc uses the labels in the index
.ix generally behaves like the .loc
Finally, to appreciate the differences, consider the following DataFrame. Also notice the un-Pythonic style of using a semi-colon at the end of the DataFrame definition.
df5 = pd.DataFrame([['a', 'cold','slow', np.nan, 2., 6., 3.],
['b', 'warm', 'medium', 4, 5, 7, 9],
['c', 'hot', 'fast', 9, 4, np.nan, 6],
['d', 'cool', None, np.nan, np.nan, 17, 89],
['e', 'cool', 'medium', 16, 44, 21, 13],
['f', 'cold', 'slow', np.nan, 29, 33, 17]],
index = [6, 8, 2, 3, 4, 5,]);
df5
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
6 | a | cold | slow | NaN | 2.0 | 6.0 | 3.0 |
8 | b | warm | medium | 4.0 | 5.0 | 7.0 | 9.0 |
2 | c | hot | fast | 9.0 | 4.0 | NaN | 6.0 |
3 | d | cool | None | NaN | NaN | 17.0 | 89.0 |
4 | e | cool | medium | 16.0 | 44.0 | 21.0 | 13.0 |
5 | f | cold | slow | NaN | 29.0 | 33.0 | 17.0 |
The .iloc indexer returns the first two rows since it looks at positions.
df5.iloc[:2]
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
6 | a | cold | slow | NaN | 2.0 | 6.0 | 3.0 |
8 | b | warm | medium | 4.0 | 5.0 | 7.0 | 9.0 |
The .loc indexer returns 3 rows since it looks at the labels.
df5.loc[:2]
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
6 | a | cold | slow | NaN | 2.0 | 6.0 | 3.0 |
8 | b | warm | medium | 4.0 | 5.0 | 7.0 | 9.0 |
2 | c | hot | fast | 9.0 | 4.0 | NaN | 6.0 |
The .ix indexer returns the same number of rows as the .loc indexer since its behavior is to first use labels before looking by position. Looking by position with an integer-based index can lead to unexpected results. This illustrated in cell #49 below.
df5.ix[:2]
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
6 | a | cold | slow | NaN | 2.0 | 6.0 | 3.0 |
8 | b | warm | medium | 4.0 | 5.0 | 7.0 | 9.0 |
2 | c | hot | fast | 9.0 | 4.0 | NaN | 6.0 |
For the next two examples, review the DataFrame df5 index structure in the cell below.
df5.index
Int64Index([6, 8, 2, 3, 4, 5], dtype='int64')
The .iloc example in the cell below returns the first row. That's because it is looking by position.
df5.iloc[:1]
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
6 | a | cold | slow | NaN | 2.0 | 6.0 | 3.0 |
The .ix example in the cell below raises a KeyError since 1 is not found in the index.
df5.ix[:1]
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) C:\Users\randy\Anaconda3\lib\site-packages\pandas\indexes\base.py in get_slice_bound(self, label, side, kind) 2909 try: -> 2910 return self._searchsorted_monotonic(label, side) 2911 except ValueError: C:\Users\randy\Anaconda3\lib\site-packages\pandas\indexes\base.py in _searchsorted_monotonic(self, label, side) 2875 -> 2876 raise ValueError('index must be monotonic increasing or decreasing') 2877 ValueError: index must be monotonic increasing or decreasing During handling of the above exception, another exception occurred: KeyError Traceback (most recent call last) <ipython-input-53-bf1757e6e2b5> in <module>() ----> 1 df5.ix[:1] C:\Users\randy\Anaconda3\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key) 76 else: 77 key = com._apply_if_callable(key, self.obj) ---> 78 return self._getitem_axis(key, axis=0) 79 80 def _get_label(self, label, axis=0): C:\Users\randy\Anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis) 990 labels = self.obj._get_axis(axis) 991 if isinstance(key, slice): --> 992 return self._get_slice_axis(key, axis=axis) 993 elif (is_list_like_indexer(key) and 994 not (isinstance(key, tuple) and C:\Users\randy\Anaconda3\lib\site-packages\pandas\core\indexing.py in _get_slice_axis(self, slice_obj, axis) 1235 if not need_slice(slice_obj): 1236 return obj -> 1237 indexer = self._convert_slice_indexer(slice_obj, axis) 1238 1239 if isinstance(indexer, slice): C:\Users\randy\Anaconda3\lib\site-packages\pandas\core\indexing.py in _convert_slice_indexer(self, key, axis) 190 # if we are accessing via lowered dim, use the last dim 191 ax = self.obj._get_axis(min(axis, self.ndim - 1)) --> 192 return ax._convert_slice_indexer(key, kind=self.name) 193 194 def _has_valid_setitem_indexer(self, indexer): C:\Users\randy\Anaconda3\lib\site-packages\pandas\indexes\base.py in _convert_slice_indexer(self, key, kind) 1077 else: 1078 try: -> 1079 indexer = self.slice_indexer(start, stop, step, kind=kind) 1080 except Exception: 1081 if is_index_slice: C:\Users\randy\Anaconda3\lib\site-packages\pandas\indexes\base.py in slice_indexer(self, start, end, step, kind) 2783 """ 2784 start_slice, end_slice = self.slice_locs(start, end, step=step, -> 2785 kind=kind) 2786 2787 # return a slice C:\Users\randy\Anaconda3\lib\site-packages\pandas\indexes\base.py in slice_locs(self, start, end, step, kind) 2968 end_slice = None 2969 if end is not None: -> 2970 end_slice = self.get_slice_bound(end, 'right', kind) 2971 if end_slice is None: 2972 end_slice = len(self) C:\Users\randy\Anaconda3\lib\site-packages\pandas\indexes\base.py in get_slice_bound(self, label, side, kind) 2911 except ValueError: 2912 # raise the original KeyError -> 2913 raise err 2914 2915 if isinstance(slc, np.ndarray): C:\Users\randy\Anaconda3\lib\site-packages\pandas\indexes\base.py in get_slice_bound(self, label, side, kind) 2905 # we need to look up the label 2906 try: -> 2907 slc = self.get_loc(label) 2908 except KeyError as err: 2909 try: C:\Users\randy\Anaconda3\lib\site-packages\pandas\indexes\base.py in get_loc(self, key, method, tolerance) 1945 return self._engine.get_loc(key) 1946 except KeyError: -> 1947 return self._engine.get_loc(self._maybe_cast_indexer(key)) 1948 1949 indexer = self.get_indexer([key], method=method, tolerance=tolerance) pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:4154)() pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:4018)() pandas\hashtable.pyx in pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:6610)() pandas\hashtable.pyx in pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:6554)() KeyError: 1
So far, so good. We have a basic understanding of how indexes can be established, utilized, and reset. We can use the .iloc, .loc, and .ix indexers to retrieve subsets of columns and rows. But what about real-world scenarios where data is rarely, if ever tidy?
The synthetic examples above work (except the intentional errors of course) since they rely on constructing the DataFrames in an orderly manner, like having 'id' columns in alphabetical order or dates in chronological order.
Consider the DataFrame 'df5' created below. It is similar to DataFrame 'df2' in cell #10 above with the exception of the df5['id'] column containing non-unique values.
df5 = pd.DataFrame([['b', 'cold','slow', np.nan, 2., 6., 3., '01/01/16'],
['c', 'warm', 'medium', 4, 5, 7, 9, '03/15/16'],
['a', 'hot', 'fast', 9, 4, np.nan, 6, '04/30/16'],
['d', 'cool', None, np.nan, np.nan, 17, 89, '05/31/16'],
['c', 'cool', 'medium', 16, 44, 21, 13, '07/04/16'],
['e', 'cold', 'slow', np.nan, 29, 33, 17, '08/30/16']],
columns=['id', 'col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'date']);
df5
id | col1 | col2 | col3 | col4 | col5 | col6 | date | |
---|---|---|---|---|---|---|---|---|
0 | b | cold | slow | NaN | 2.0 | 6.0 | 3.0 | 01/01/16 |
1 | c | warm | medium | 4.0 | 5.0 | 7.0 | 9.0 | 03/15/16 |
2 | a | hot | fast | 9.0 | 4.0 | NaN | 6.0 | 04/30/16 |
3 | d | cool | None | NaN | NaN | 17.0 | 89.0 | 05/31/16 |
4 | c | cool | medium | 16.0 | 44.0 | 21.0 | 13.0 | 07/04/16 |
5 | e | cold | slow | NaN | 29.0 | 33.0 | 17.0 | 08/30/16 |
Set the index for DataFrame df5 to the df5['id'] column.
df5.set_index('id', inplace=True)
Validate the index for DataFrame df5.
df5.index
Index(['b', 'c', 'a', 'd', 'c', 'e'], dtype='object', name='id')
We can use the .loc indexer to request the rows in the range of 'b' through 'd'.
df5.loc['b':'d', :]
col1 | col2 | col3 | col4 | col5 | col6 | date | |
---|---|---|---|---|---|---|---|
id | |||||||
b | cold | slow | NaN | 2.0 | 6.0 | 3.0 | 01/01/16 |
c | warm | medium | 4.0 | 5.0 | 7.0 | 9.0 | 03/15/16 |
a | hot | fast | 9.0 | 4.0 | NaN | 6.0 | 04/30/16 |
d | cool | None | NaN | NaN | 17.0 | 89.0 | 05/31/16 |
If you look closely at the results from the example above, you will find the first occurence of the row 'id' label 'c' was returned, but not the second row labeled 'c'. 'id' label 'c' is obviously non-unique. And that is only part of the issue. Consider futher the use of a non-unique label for the row range selection in the example below.
This issue is described in sparse prose here.
If we should want the row label range of 'b' to 'c' with all the columns we raise the error:
"Cannot get right slice bound for non-unique label: 'c'"
df5.loc['b':'c', :]
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) <ipython-input-58-64fc99b408b0> in <module>() ----> 1 df5.loc['b':'c', :] C:\Users\randy\Anaconda3\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key) 1292 1293 if type(key) is tuple: -> 1294 return self._getitem_tuple(key) 1295 else: 1296 return self._getitem_axis(key, axis=0) C:\Users\randy\Anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_tuple(self, tup) 802 continue 803 --> 804 retval = getattr(retval, self.name)._getitem_axis(key, axis=i) 805 806 return retval C:\Users\randy\Anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis) 1435 if isinstance(key, slice): 1436 self._has_valid_type(key, axis) -> 1437 return self._get_slice_axis(key, axis=axis) 1438 elif is_bool_indexer(key): 1439 return self._getbool_axis(key, axis=axis) C:\Users\randy\Anaconda3\lib\site-packages\pandas\core\indexing.py in _get_slice_axis(self, slice_obj, axis) 1316 labels = obj._get_axis(axis) 1317 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, -> 1318 slice_obj.step, kind=self.name) 1319 1320 if isinstance(indexer, slice): C:\Users\randy\Anaconda3\lib\site-packages\pandas\indexes\base.py in slice_indexer(self, start, end, step, kind) 2783 """ 2784 start_slice, end_slice = self.slice_locs(start, end, step=step, -> 2785 kind=kind) 2786 2787 # return a slice C:\Users\randy\Anaconda3\lib\site-packages\pandas\indexes\base.py in slice_locs(self, start, end, step, kind) 2968 end_slice = None 2969 if end is not None: -> 2970 end_slice = self.get_slice_bound(end, 'right', kind) 2971 if end_slice is None: 2972 end_slice = len(self) C:\Users\randy\Anaconda3\lib\site-packages\pandas\indexes\base.py in get_slice_bound(self, label, side, kind) 2922 if isinstance(slc, np.ndarray): 2923 raise KeyError("Cannot get %s slice bound for non-unique " -> 2924 "label: %r" % (side, original_label)) 2925 2926 if isinstance(slc, slice): KeyError: "Cannot get right slice bound for non-unique label: 'c'"
What is going on here is the values in 'id' column are non-unique. To enable detection of this condition the attribute .index.is_montonic_increasing and .index_montonic_decreasing return a boolean to test for this non-uniqueness property.
Applied to the df5 DataFrame created in cell #54 above returns False.
df5.index.is_monotonic_increasing
False
The .is_monotonic_increasing attribute applied to the first DataFrame created above, 'df' returns true.
df.index.is_monotonic_increasing
True
While not spelled out in any documentation I found, the moral of the story is when using indices with non-unique values, be wary.
Indexing and Selecting Data pandas 0.19.0 doucmentation
Selecting DataFrame rows and columns using iloc, loc, and ix in Pandas by Shane Lynn
Indexing, Slicing and Subsetting DataFrames in Python by Chris Friedline.
Non-monotonic indexes require exact matches pandas 0.19.0 documentation
Useful pandas Snippets by Becky Swegler, Computers are for People.