This notebook is part of my Python data science curriculum
import numpy as np
import pandas as pd
In this table, the columns indicate that type of the index. Indexes behave differently depending on whether the values are on monotonic (M), non-monotonic order (NM), or contain duplicates (D), so multiple cases have been included.
The rows indicate the arguments to the bracket operator. Slices can behave differently depending on whether the endpoints are present in the index or not. [] indicates both endpoints are included; [) indicates the right endpoint of the slice is not present in the index. When the index is nonmonotonic, slices always return rows in the order they are present in the Series - they never skip around based on the values of the index.
Int M | Int NM | Float M | Float NM | Str M | Str NM | Str D | Tup M | Tup NM | |
---|---|---|---|---|---|---|---|---|---|
single str | - | - | - | - | Idx | Idx | Idx*1 | - | - |
str list | - | - | - | - | Idx {} | Idx {} | Idx {} | - | - |
str slc [] | - | - | - | - | Idx [] | Idx [] | Idx [] | - | - |
str slc [) | - | - | - | - | Idx [) | - | Idx [) | - | - |
single int | Idx | Idx | Idx | Idx | Row | Row | - | Row | Row |
int list | Idx {} | Idx {} | Idx {} | Idx {} | Row {} | Row {} | Row {} | Row {} | Row {} |
int slc [] | Row [) | Row [) | Idx [] | Idx [] | Row [) | Row [) | Row [) | Row [) | Row [) |
int slc [) | Row [) | Row [) | Idx [) | - | Row [) | Row [) | Row [) | Row [) | Row [) |
single float | Idx | Idx | Idx | Idx | - | - | - | - | - |
float list | Idx {} | Idx {} | Idx {} | Idx {} | - | - | - | - | - |
float slc [] | - | - | Idx [] | Idx [] | - | - | - | - | - |
float slc [) | - | - | Idx [) | - | - | - | - | - | - |
single tuple | - | - | - | - | - | - | - | Idx | Idx |
tuple slc [] | - | - | - | - | - | - | - | Idx [] | Idx [] |
tuple slc [) | - | - | - | - | - | - | - | - | - |
[) indicates that the right endpoint is not included
[] indicates that the right endpoint is included
- indicates that combination doesn't work and you get an exception
*1 Returns a scalar if the key matched one value, or a Series if it matched more than one.
Slices always return a Series (possibly empty or with a single element).
Single indexes always return a scalar.
# If you do not provide an index, a RangeIndex is created for you:
sn = pd.Series([0,11,22,33,44])
sn.index
RangeIndex(start=0, stop=5, step=1)
# You can also explicitly create a RangeIndex starting somewhere other than zero:
sr = pd.Series([0,11,22,33,44], index=pd.RangeIndex(10,15))
sr.index
# As far as I can tell, a RangeIndex is functionally equivalent to an IntegerIndex,
# but there might be corner cases I haven't found yet!
RangeIndex(start=10, stop=15, step=1)
# Strings get dtype object
scm = pd.Series([0,11,22,33,44], index=['a','b','c','d','e'])
scn = pd.Series([0,11,22,33,44], index=['a','b','c','f','e'])
scd = pd.Series([0,11,22,33,44], index=['a','b','c','c','e'])
scm.index
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
sim = pd.Series([0,11,22,33,44], index=[10,11,12,14,13])
sin = pd.Series([0,11,22,33,44], index=[10,11,12,15,13])
sid = pd.Series([0,11,22,33,44], index=[10,11,12,12,13])
sim.index
Int64Index([10, 11, 12, 14, 13], dtype='int64')
sfm = pd.Series([0,11,22,33,44], index=[10.0,11,12,13,14])
sfn = pd.Series([0,11,22,33,44], index=[10.0,11,12,15,14])
sfm.index
Float64Index([10.0, 11.0, 12.0, 13.0, 14.0], dtype='float64')
# Finally, let's create one with complex objects (e.g. tuples) for an index:
stm = pd.Series([0,11,22,33,44], index=[(10,10), (11,11), (12,12), (14,14), (13,13)])
stn = pd.Series([0,11,22,33,44], index=[(10,10), (11,11), (12,12), (15,15), (13,13)])
stm
(10, 10) 0 (11, 11) 11 (12, 12) 22 (14, 14) 33 (13, 13) 44 dtype: int64
# With an integer index, a single element uses the index
sim[12]
22
# but a slice uses the row numbers:
sim[2:3]
12 22 dtype: int64
# if your index were a float instead of int, the slice _would_ use the index,
# and it would include both endpoints:
sfn[12:14]
12.0 22 15.0 33 14.0 44 dtype: int64
# If you have an integer index, using a single float works and uses the index:
sim[11.0]
11
# but a float slice is an error:
try:
sim[11.0:12.0]
except Exception as e:
print(repr(e))
TypeError("cannot do slice indexing on <class 'pandas.core.indexes.numeric.Int64Index'> with these indexers [11.0] of <class 'float'>")
# With an Ojbect dtype that's actually characters, you can slice with values that are not
# in the index, if and only if the index is monotonic:
print(scm['e':'q'])
try:
scn['e':'q']
except Exception as e:
print(repr(e))
e 44 dtype: int64 KeyError('q')
# but with other objects this doesn't work, even if comparison is defined - only elements
# present in the index can be used even if the index is monotonic:
print((10,10) < (12,13))
print(stm[(10,10):(11,11)])
try:
stm[(10,10):(17,17)]
except Exception as e:
print(repr(e))
True (10, 10) 0 (11, 11) 11 dtype: int64 KeyError((17, 17))
# Beyond the chart above, there are _even more_ cases. Consider what happens when we have a DUPLICATE label - now
# we can't index by row anymore
scd = pd.Series([0,11,22,33,44], index=['a','b','c','c','e'])
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) ~/.pyenv/versions/anaconda3-5.3.0/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_value(self, series, key) 3117 return self._engine.get_value(s, k, -> 3118 tz=getattr(series.dtype, 'tz', None)) 3119 except KeyError as e1: pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value() pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value() pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/index.pyx in pandas._libs.index.IndexEngine._get_loc_duplicates() TypeError: '<' not supported between instances of 'str' and 'int' During handling of the above exception, another exception occurred: IndexError Traceback (most recent call last) <ipython-input-82-ba3497aaf88f> in <module>() 2 3 scd = pd.Series([0,11,22,33,44], index=['a','b','c','c','e']) ----> 4 scd[1] ~/.pyenv/versions/anaconda3-5.3.0/lib/python3.7/site-packages/pandas/core/series.py in __getitem__(self, key) 765 key = com._apply_if_callable(key, self) 766 try: --> 767 result = self.index.get_value(self, key) 768 769 if not is_scalar(result): ~/.pyenv/versions/anaconda3-5.3.0/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_value(self, series, key) 3136 # python 3 3137 if is_scalar(key): # pragma: no cover -> 3138 raise IndexError(key) 3139 raise InvalidIndexError(key) 3140 IndexError: 1
As an alternative, always use .loc (by Index value) and .iloc(by integer starting at 0). You will still have to remember that monotonic and nonmonotonic indices behave differently when you try to slice with an endpoint that is not present, but most of the other heuristics are simplified away.
Note that these are not functions - they expose a special attribute which you then access with [] as usual, so you can get slices.
sin[1:2]
11 11 dtype: int64
sin.iloc[1:2]
11 11 dtype: int64
sin.loc[11:12]
11 11 12 22 dtype: int64
sfn[11:12]
11.0 11 12.0 22 dtype: int64
sfn.iloc[1:2]
11.0 11 dtype: int64
sfn.loc[11:12]
11.0 11 12.0 22 dtype: int64
# If you don't supply column or row names, you get an implicit RangeIndex:
dn = pd.DataFrame(np.arange(0,6))
print(dn.columns)
print(dn.index)
RangeIndex(start=0, stop=1, step=1) RangeIndex(start=0, stop=6, step=1)
# If you construct a dataframe from a list, the elements are taken as ROWS, not columns:
pd.DataFrame([np.arange(0,3), np.arange(1,4)])
0 | 1 | 2 | |
---|---|---|---|
0 | 0 | 1 | 2 |
1 | 1 | 2 | 3 |
# To construct it from columns, pass them as a dict with names:
pd.DataFrame({'x1':np.arange(0,3), 'x2':np.arange(1,4)})
x1 | x2 | |
---|---|---|
0 | 0 | 1 |
1 | 1 | 2 |
2 | 2 | 3 |
# The only way I know of to construct it from columns without supplying names is to go via a
# Numpy 2d array:
pd.DataFrame(np.hstack([np.arange(0,3).reshape(3,1), np.arange(1,4).reshape(3,1)]))
0 | 1 | |
---|---|---|
0 | 0 | 1 |
1 | 1 | 2 |
2 | 2 | 3 |
# If you make two dataframes and concatenate them, note that both columns are
# called 0:
pd.concat([pd.DataFrame(np.arange(0,3)),pd.DataFrame(np.arange(1,4))], axis=1)
0 | 0 | |
---|---|---|
0 | 0 | 1 |
1 | 1 | 2 |
2 | 2 | 3 |
# In practice, columns will essentially always have names, so we consider only two cases here,
# with and without named rows.
dn = pd.DataFrame({'w':np.arange(0,66,11),'v':np.arange(11,77,11)})
dn
w | v | |
---|---|---|
0 | 0 | 11 |
1 | 11 | 22 |
2 | 22 | 33 |
3 | 33 | 44 |
4 | 44 | 55 |
5 | 55 | 66 |
ds = dn.copy()
ds.index=['a','b','c','d','e','f']
ds
w | v | |
---|---|---|
a | 0 | 11 |
b | 11 | 22 |
c | 22 | 33 |
d | 33 | 44 |
e | 44 | 55 |
f | 55 | 66 |
# We also need one with an integer index, where the integers don't start at 0.
di = dn.copy()
di.index=[10,11,12,13,14,15]
di
w | v | |
---|---|---|
10 | 0 | 11 |
11 | 11 | 22 |
12 | 22 | 33 |
13 | 33 | 44 |
14 | 44 | 55 |
15 | 55 | 66 |
In this table, we need to record whether it tries to index the rows or columns, whether it tries to use the Index (named) or Row (integers starting at 0) naming, and whether it returns the result as a DataFrame (2d) or Series (1d).
Row Idx | Arg | Axis | Lookup | Result | Right Endpt |
---|---|---|---|---|---|
any | scalar | Col | Index | Series | |
any | list | Col | Index | DF | |
str | str slc | Row | Index | DF | incl |
str | int slc | Row | Row | DF | excl |
int | int slc | Row | Row | DF | excl |
Since ix is such a disaster, we will consider only loc and iloc. Any time you use a slice (even if it has 1 element) or a list, that dimension is preserved; any time you use a scalar, that dimension is dropped.
# This returns a single number
ds.loc['b','v']
22
# This returns a 1x1 data frame
ds.loc['b':'b','w':'w']
w | |
---|---|
b | 11 |
# As does this
ds.loc[['b'],['w']]
w | |
---|---|
b | 11 |
# Since the column names are monotonically decreasing, I can refer to a value which isn't present as part of the slice:
ds.loc['b':'b','z':'v']
w | v | |
---|---|---|
b | 11 | 22 |
# If I want to drop one dimension and keep the other, I can do so like this. When a dimension is dropped, I always
# get back a Series, which will print as a column. Note the metadata:
ds.loc['b','w':'v']
w 11 v 22 Name: b, dtype: int64
# All the same ideas apply with iloc
ds.iloc[1,1]
22
# Note that iloc integer slices exclude the right endpoint:
ds.iloc[1:2,1:2]
v | |
---|---|
b | 22 |
ds.iloc[[1],[1]]
v | |
---|---|
b | 22 |
# The only trick part is when you want to index rows by name and columns by number or vice versa.
# Some approaches are shown here:
# http://pandas-docs.github.io/pandas-docs-travis/indexing.html#ix-indexer-is-deprecated
# but I find those approaches expose a disgusting amount of implementation detail, and are just as unusable in their
# own way as .ix was.
#
# I prefer to simply index in two steps:
ds.iloc[[1],:].loc[:,['w']]
w | |
---|---|
b | 11 |
# The difference in time is probably not important for most purposes
%timeit ds.iloc[[1],:].loc[:,['w']]
%timeit ds.iloc[[1], ds.columns.get_indexer(['w'])]
%timeit ds.ix[[1],['w']]
1.38 ms ± 23.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.06 ms ± 48.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
/home/terran/.pyenv/versions/anaconda3-5.3.0/lib/python3.7/site-packages/ipykernel_launcher.py:1: DeprecationWarning: .ix is deprecated. Please use .loc for label based indexing or .iloc for positional indexing See the documentation here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated """Entry point for launching an IPython kernel.
1.02 ms ± 57.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# Alternatively, we could convert the numbers to names and use .loc:
%timeit ds.loc[[ds.index[1]], ['w']]
1.5 ms ± 64.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)