# To install this library in Jupyter notebook
import sys
!{sys.executable} -m pip install pandas --quiet
import pandas as pd
pd.__version__ , pd.__path__
('1.3.4', ['/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas'])
A Series is a one-dimensional array capable of holding a sequence of values of any data type (integers, floating point numbers, strings, Python objects etc) which by default have numeric data labels starting from zero. You can imagine a Pandas Series as a column in a spreadsheet or a Pandas Dataframe object.
pd.Series()
methodpd.Series(data, index, dtype, name)
data
: can be a Python list, Python dictionary, numPy array, or a scalar value.index
: If you donot pass the index argument, it will default to np.arrange(n)
. Indices must be hashable (numbers or strings) and have the same length as data
. Non-unique index values are allowed. Index is used for three purposes:dtype
: Optionally, you can assign any valid numpy datatype to the series object (np.sctypes). If not specified, this will be inferred from data
.name
: Optionally, you can assign a name to a series, which becomes attribute of the series object. Moreover, it becomes the column name, if that series object is used to create a dataframe later.
import pandas as pd
import numpy as np
list1 = ['Arif', 'Rauf', 'Maaz', '','Hadeed'] # note the empty string
# When index is not provided, it creates an index for the data starting from zero and with a step size of one.
s = pd.Series(data=list1)
print(s)
print(type(s))
0 Arif 1 Rauf 2 Maaz 3 4 Hadeed dtype: object <class 'pandas.core.series.Series'>
Observe that output is shown in two columns - the index is on the left and the data value is on the right. If we do not explicitly specify an index for the data values while creating a series, then by default indices range from 0 through N – 1. Here N is the number of data elements.
You can explicitly specify the index for a Series object, which can be either int or string type, and must be of the same size as the values in the series. Otherwise, it will raise a ValueError
list1 = ['Arif', 'Rauf', 'Maaz', 'Hadeed']
indices = ['MS01', 'MS02', '', 'MS02'] # non-unique index values are allowed and you can have empty string as index
s = pd.Series(data=list1, index=indices)
print(s)
print(type(s))
MS01 Arif MS02 Rauf Maaz MS02 Hadeed dtype: object <class 'pandas.core.series.Series'>
s['MS01']
'Arif'
Also note that non-unique indices are allowed
list1 = ['Arif', 'Rauf', 'Maaz', 'Hadeed']
indices = [2.1, 2.2, 2.3, 2.4]
s = pd.Series(data=list1, index=indices)
print(s)
print(type(s))
--------------------------------------------------------------------------- NameError Traceback (most recent call last) /var/folders/1t/g3ylw8h50cjdqmk5d6jh1qmm0000gn/T/ipykernel_29216/2678464800.py in <module> 2 indices = [2.1, 2.2, 2.3, 2.4] 3 ----> 4 s = pd.Series(data=list1, index=indices) 5 print(s) 6 print(type(s)) NameError: name 'pd' is not defined
You can create a series with NaN values, using np.nan
, which is IEEE 754 floating-point representation of Not a Number. NaN values can act as a placeholder for any missing numerical values in the array.
list1 = [1, 2.7, np.nan, 54]
s = pd.Series(data=list1)
print(s)
print(type(s))
0 1.0 1 2.7 2 NaN 3 54.0 dtype: float64 <class 'pandas.core.series.Series'>
Also note the
dtype
of the series object is inferred from the data asfloat64
You can use the dtype
argument to specify a datatype to the series object.
list1 = [27, 33, 19]
s = pd.Series(data=list1, dtype=np.uint8)
print(s)
print(type(s))
0 27 1 33 2 19 dtype: uint8 <class 'pandas.core.series.Series'>
Optionally, you can assign a name to a series, which becomes attribute of the series object. Moreover, it becomes the column name, if that series object is used to create a dataframe later.
list1 = ['Arif', 'Rauf', '', 'Hadeed']
indices = ['MS01', 'MS02', 'MS03', 'MS04']
s = pd.Series(data=list1, index=indices, name='myseries1')
print(s)
print(type(s))
MS01 Arif MS02 Rauf MS03 MS04 Hadeed Name: myseries1, dtype: object <class 'pandas.core.series.Series'>
s = pd.Series(data = np.arange(4))
print(s)
print(type(s))
arr1 = np.array([22.3,33.6, 98, 44])
s = pd.Series(data=arr1, dtype='float64')
print(s)
print(type(s))
0 22.3 1 33.6 2 98.0 3 44.0 dtype: float64 <class 'pandas.core.series.Series'>
my_dict = {
'name':"Arif",
'gender':"Male",
'Role':"Teacher",
'subject':"Data Science"}
s = pd.Series(data=my_dict)
print(s)
print(type(s))
name Arif gender Male Role Teacher subject Data Science dtype: object <class 'pandas.core.series.Series'>
When you create a series from dictionary, it will automatically take the keys as index and the value as data
s = pd.Series(data=25)
print(s)
print(type(s))
0 25 dtype: int64 <class 'pandas.core.series.Series'>
# Need to pass atleast `dtype` else you get a warning
s=pd.Series()
print(s)
print(type(s))
Series([], dtype: float64) <class 'pandas.core.series.Series'>
/var/folders/1t/g3ylw8h50cjdqmk5d6jh1qmm0000gn/T/ipykernel_20298/938514528.py:2: DeprecationWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning. s=pd.Series()
.
notationmy_dict = {0:"Rauf", 1:np.nan, 2:"Maaz", 3:"Hadeed", 4:"Mujahid", 5:"Mohid", 6:"Jamil"}
s = pd.Series(my_dict, name="myseries1")
s
0 Rauf 1 NaN 2 Maaz 3 Hadeed 4 Mujahid 5 Mohid 6 Jamil Name: myseries1, dtype: object
# `name` attribute of a series object return the name of the series object
s.name
'myseries1'
# `index` attribute of a series object return the list of indices and its datatype
s.index
Int64Index([0, 1, 2, 3, 4, 5, 6], dtype='int64')
# `values` attribute of a series object return the list of values and its datatype
s.values
array(['Rauf', '', 'Maaz', 'Hadeed', 'Mujahid', 'Mohid', 'Jamil'], dtype=object)
# `dtype` attribute of a series object return the type of underlying data
s.dtype
dtype('O')
# `shape` attribute of a series object return a tuple of shape of underlying data
s.shape
(7,)
# `nbytes` attribute of a series object return the number of bytes of underlying data (object data type take 8 bytes)
s.nbytes
56
# `size` attribute of a series object return number of elements in the underlying data
s.size
# `ndim` attribute of a series object return number of dimensions of underlying data
s.ndim
1
# `hasnans` attribute of a series object return true if there are NaN values in the data
s.hasnans
True
- When index is unique, Pandas use a hashtable to map key to value
and searching can be done in O(1) time.
- When index is non-unique but sorted, Pandas use binary search, which takes logarithmic time O(logN).
- When index is randomly ordered, searching takes linear time, as Pandas need to check all the keys in the index O(N).
list1 = ['Rauf', 'Arif', 'Maaz', 'Hadeed', 'Mujahid']
s = pd.Series(data=list1)
print(s)
print(s.index)
0 Rauf 1 Arif 2 Maaz 3 Hadeed 4 Mujahid dtype: object RangeIndex(start=0, stop=5, step=1)
Index attribute of series object shows that index range for this series is from (0-4) with step value of 1
Let us modify the index of this series object to some random integers by assigning a random array of integers to index
attribute of this series object
arr1 = np.random.randint(low = 100, high = 200, size = 5)
s.index = arr1
print(s)
print(s.index)
113 Rauf 152 Arif 176 Maaz 191 Hadeed 179 Mujahid dtype: object Int64Index([113, 152, 176, 191, 179], dtype='int64')
s.index = [1,4,2,6.3,9]
print(s)
print(s.index)
1.0 Rauf 4.0 Arif 2.0 Maaz 6.3 Hadeed 9.0 Mujahid dtype: object Float64Index([1.0, 4.0, 2.0, 6.3, 9.0], dtype='float64')
Changing index of a series to a list of strings
list1 = ['Rauf', 'Arif', 'Maaz', 'Hadeed', 'Mujahid']
s = pd.Series(data=list1)
print(s)
print(s.index)
0 Rauf 1 Arif 2 Maaz 3 Hadeed 4 Mujahid dtype: object RangeIndex(start=0, stop=5, step=1)
indices = ['num1', 'num2', 'num3', 'num4', 'num5']
s.index = indices
print(s)
print(s.index)
num1 Rauf num2 Arif num3 Maaz num4 Hadeed num5 Mujahid dtype: object Index(['num1', 'num2', 'num3', 'num4', 'num5'], dtype='object')
s[]
operator and specifying the index (integer/label)s.loc[]
method and specifying the index (integer/label).iloc[]
method and specify the position (an integer value from 0 to length-1). It also support negative indexing, the last element can be accessed by an index of -1Identification using Integer Indices or by Position
list1 = ['Rauf', 'Arif', 'Maaz', 'Hadeed', 'Mujahid']
indices = [5, 10, 15, 20, 25]
s = pd.Series(data=list1, index=indices)
s
5 Rauf 10 Arif 15 Maaz 20 Hadeed 25 Mujahid dtype: object
# Give index to subscript operator
s[25]
# Subscript operator do not work on position
#s[0] # will raise an error because index 0 do not exist
'Mujahid'
# Give index to loc method
s.loc[20]
# loc method do not work on position
#s.loc[0] # will raise an error because index 0 do not exist
'Hadeed'
# iloc method is position based, so will flag an error if you pass an actual index
#s.iloc[20]
# The iloc method is passed position and not index
s.iloc[3]
'Hadeed'
Fancy Indexing
# Can access multiple values by specifying a list of indices
s[[20, 5]]
20 Hadeed 5 Rauf dtype: object
# Can access multiple values by specifying a list of indices
s.loc[[20, 5]]
20 Hadeed 5 Rauf dtype: object
# Can access multiple values by specifying list of positions
s.iloc[[3, 0]]
20 Hadeed 5 Rauf dtype: object
Negative Indexing, work only for iloc
#s[-1]
#s.loc[-1]
s.iloc[-1]
Identification using String Indices or by Position
list1 = ['Rauf', 'Arif', 'Maaz', 'Hadeed', 'Mujahid']
indices = ['num1', 'num2', 'num3', 'num4', 'num5']
s = pd.Series(data=list1, index=indices)
s
num1 Rauf num2 Arif num3 Maaz num4 Hadeed num5 Mujahid dtype: object
# Give index to subscript operator (which in this case is a string or label)
s['num1']
'Rauf'
# for position as well
s[2]
'Maaz'
# Give index to loc method (which in this case is a string or label)
s.loc['num1']
'Rauf'
# Will not work on position the way [] worked previously
#s.loc[0]
# iloc method is position based, so will flag an error if you pass it string indices
#s.iloc['num1']
# however will work fine if you pass an integer specifying the position
s.iloc[0]
'Rauf'
s.iloc[-1]
Fancy Indexing
# Can access multiple values by specifying a list of indices (which in this case are strings or labels)
s[['num3', 'num1']]
# Can access multiple values by specifying a list of indices (which in this case are strings or labels)
s.loc[['num3', 'num1']]
# iloc method is position based, so will flag an error if you pass it string indices
#s.iloc['num3', 'num1']
# however will work fine if you pass an integer specifying the position
s.iloc[[2,0]]
A series can be sliced using :
symbol, which returns a subset of a series object (values with corresponding indices).
There are three arguments of slice object [[start]:[stop][:step]]
, and all are optional
The slice object can be used in three ways to slice a Pandas Series object::
s[]
operator and specifying the index (integer/label)s.loc[]
method and specifying the index (integer/label).iloc
method and specify the position (an integer value from 0 to length-1). It also support negative indexing, the last element can be accessed by an index of -1Keep following points in mind:
stop
argument is NOT inclusive for s[]
for integer indices, while it is inclusive for string indices.stop
argument is inclusive for s.loc[]
for both integer and label indices.stop
argument is NOT inclusive for s.iloc[]
being position based.Note: Once you slice a Pandas series, you get a view of the original object, which is similar to shallow copy. So if you modify an element in original series object, the change will also be visible in the other series object.
Selection/Filtering/Subsetting of Series object having Integer indices
list1 = ['Rauf', 'Arif', 'Maaz', 'Hadeed', 'Mujahid']
indices = [5, 10, 15, 20, 25]
s = pd.Series(data=list1, index=indices)
s
--------------------------------------------------------------------------- NameError Traceback (most recent call last) /var/folders/1t/g3ylw8h50cjdqmk5d6jh1qmm0000gn/T/ipykernel_18786/296599127.py in <module> 1 list1 = ['Rauf', 'Arif', 'Maaz', 'Hadeed', 'Mujahid'] 2 indices = [5, 10, 15, 20, 25] ----> 3 s = pd.Series(data=list1, index=indices) 4 s NameError: name 'pd' is not defined
s[5:15]
Series([], dtype: object)
# The subscript operator considers the slice object as positional index and not as the actual indices
# (if we have integer indices)
# The `stop` argument is NOT inclusive for `s[]` for integer indices
s[1:4]
10 Arif 15 Maaz 20 Hadeed dtype: object
#The loc[] method considers the slice object as actual indices and not as positional indices
# The stop argument is inclusive for `s.loc[]` for both integer and label indices
s.loc[5:15]
5 Rauf 10 Arif 15 Maaz dtype: object
# The iloc[] method considers the slice object as positional index and not as the actual indices
# The `stop` argument is NOT inclusive for `s.iloc[]` being position based
s.iloc[1:4]
10 Arif 15 Maaz 20 Hadeed dtype: object
Selection/Filtering/Subsetting of Series object having String Indices
list1 = ['Rauf', 'Arif', 'Maaz', 'Hadeed', 'Mujahid']
indices = ['num1', 'num2', 'num3', 'num4', 'num5']
s = pd.Series(data=list1, index=indices)
s
num1 Rauf num2 Arif num3 Maaz num4 Hadeed num5 Mujahid dtype: object
s[0:2]
num1 Rauf num2 Arif dtype: object
# The subscript operator considers the slice object as positional index and not as the actual indices
# (if we have integer indices). However, will also consider the actual indices in case of string indices
# The `stop` argument is inclusive for `s[]` for string indices, while it is NOT inclusive for integer indices.
s['num2':'num4']
num2 Arif num3 Maaz num4 Hadeed dtype: object
# The `stop` argument is inclusive for `s[]` for string indices, while it is NOT inclusive for integer indices.
s[0:2]
#The loc[] method considers the slice object as actual indices and not as positional indices
# The stop argument is inclusive for `s.loc[]` for both integer and label indices
s.loc['num2':'num4']
# The iloc[] method considers the slice object as positional index and not as the actual indices
# iloc method is position based, so will flag an error if you pass it string indices
#s.iloc['num2': 'num4']
# however will work fine if you pass an integer values (specifying positions) in the slice operator
# Moreover the stop index is not inclusive
s.iloc[1:4]
Understanding Step with Series object having String Indices
s
# The step works fine with string indices as well
s['num2':'num5':1]
s['num2':'num5':2]
s['num5':'num3':-1]
Example 1: Adding two series object with same integer indices
list1 = [1,3,5,7,9];
list2 = [2,4,6,8,10];
s1 = pd.Series(data=list1);
s2 = pd.Series(data=list1);
print(s1)
print(s1.index)
0 1 1 3 2 5 3 7 4 9 dtype: int64 RangeIndex(start=0, stop=5, step=1)
print(s2)
print(s2.index)
0 1 1 3 2 5 3 7 4 9 dtype: int64 RangeIndex(start=0, stop=5, step=1)
s3 = s1 + s2
print(s3)
print(s3.index)
0 2 1 6 2 10 3 14 4 18 dtype: int64 RangeIndex(start=0, stop=5, step=1)
Example 2: Adding two series object having different integer indices
list1 = [6,9,7,5]
index1 = [0,1,2,3]
list2 = [8,6,2,1]
index2 = [0,2,3,5]
s1 = pd.Series(data=list1, index=index1);
s2 = pd.Series(data=list2, index=index2);
print(s1)
print(s1.index)
0 6 1 9 2 7 3 5 dtype: int64 Int64Index([0, 1, 2, 3], dtype='int64')
print(s2)
print(s2.index)
0 8 2 6 3 2 5 1 dtype: int64 Int64Index([0, 2, 3, 5], dtype='int64')
s3 = s1 + s2
print(s3)
print(s3.index)
0 14.0 1 NaN 2 13.0 3 7.0 5 NaN dtype: float64 Int64Index([0, 1, 2, 3, 5], dtype='int64')
Problem: While performing mathematical operations on series having mismatched indices, all missing values are filled in with NaN by default.
Solution: To handle this problem, instead of using the operators (+, -, *, /
), an explicit call to s.add()
, s.sub()
, s.mul()
and s.div()
is preferred. This allows us to replace the missing values in any of the series witth a specific value, so as to have a concrete output in place of NaN
s1.add(s2, fill_value=0) # Compare it with above result
0 14.0 1 9.0 2 13.0 3 7.0 5 1.0 dtype: float64
Example 3: Adding two series object having different string indices
list1 = [6,9,7,5, 2]
labels1 = ['num1', 'num2', 'num3', 'num4', 'num5']
list2 = [8,6,2,3,6]
labels2 = ['num1', 'num2', 'num3', 'num8', 'num5']
s1 = pd.Series(data=list1, index=labels1)
s2 = pd.Series(data=list2, index=labels2)
print(s1)
print(s1.index)
num1 6 num2 9 num3 7 num4 5 num5 2 dtype: int64 Index(['num1', 'num2', 'num3', 'num4', 'num5'], dtype='object')
print(s2)
print(s2.index)
num1 8 num2 6 num3 2 num8 3 num5 6 dtype: int64 Index(['num1', 'num2', 'num3', 'num8', 'num5'], dtype='object')
# Let us use the `add()` method
#s1+s2
s3 = s1.add(s2, fill_value=5)
#s3 = s1.add(s2)
print(s3)
print(s3.index)
num1 14.0 num2 15.0 num3 9.0 num4 10.0 num5 8.0 num8 8.0 dtype: float64 Index(['num1', 'num2', 'num3', 'num4', 'num5', 'num8'], dtype='object')
My dear students, please make time to practice following topics related to Series:
reset_index()
method for completely resetting the indexs.pop(index)
is passed an index and it returns the data item at the index and removes it from seriess.drop(indexes)
is passed one or a list of indices and returns a series of the data items. Series remains unchanged unless the inplace = True argument is passeds1.append(s2, ignore_index=False, verify_integrity=False)
is used to concatenate two series and return the concatenated series, original series remain unchangeds1.update(s2)
is used to miduft the series s1
inplace using the values from passed seriesWe will discuss these while studying Pandas Dataframe object InshaAllah
- In a series object we can define our own labeled index to access elements of an array. These can be numbers or strings. NumPy arrays are accessed by their integer position using numbers only.
- In a series object the elements can be indexed in descending order also. In NumPy arrays, the indexing starts with zero for the first element and the index is fixed.
- While performing arithmetic operations on series having misaligned indices, NaN or missing values may be generated. In NumPy arrays, the concept of broadcasting exist and there is no concept of NaN values. While performing arithmetic on incompatible numPy arrays the operation fails.
- Series require more memory. NumPy arrays occupies lesser memory.