The Python and Data Analysis Library (pandas)¶

This is a beginner guide looking at one of the most common Python libraries. This can be viewed in the browser however the Interactive Python Notebook File can also be opened in JupyterLab. The Python and Data Analysis Library (which is abbreviated using lower case pandas) to organize and manipulate data in a dataframe (which can be thought of as equivalent to an Excel spreadsheet).

Perquisites¶

Before looking at any Python library it is recommended to familarize yourself with the Python programming language and object orientated programming concepts. Details about these in addition to Python installation instructions are available in my perquisite guide below.

It is also recommended to familarize yourself with the Numeric Python library numpy. I have an additional JupyterLab notebook covering the numpy library.

Importing the pandas Library¶

The pandas library is one of the most commonly used python libraries particularly in the field of data science and as a consequence it is commonly imported using a 2 letter abbreviation, similar to the numpy library.

In [1]:
import numpy as np
import pandas as pd


NumPy Arrays¶

Before starting with pandas it is worth having a quick recap of how we encode numeric data as a list and as a numpy array.

Lists¶

Numeric data can be expressed in a list:

In [2]:
x=[0,1,2,3,4]

In [3]:
x

Out[3]:
[0, 1, 2, 3, 4]
In [4]:
y1=[0,2,4,6,8]

In [5]:
y1

Out[5]:
[0, 2, 4, 6, 8]
In [6]:
y2=[0,1.5,3,4.5,6.0]

In [7]:
y2

Out[7]:
[0, 1.5, 3, 4.5, 6.0]

Although the data is numeric, the object is a list and therefore the attributes and attributes are related to common list operations and not common numeric operations.

Attributes and methods available from an object can be seen by typing in the object name followed by a dot . and then pressing tab ↹.

A method can be thought of as function that belongs to the object. For example the list method append will take the original list and append an additional item at the end of the list. As methods are functions they must be called with parenthesis. The parenthesis can contain the positional input arguments (which always be defined in order at the beginning) or keyword input arguments (which can optionally be assigned and otherwise use their default values). To get details about the input arguments press shift ⇧ and tab ↹.

Here we see the positional input argument object.

Note also that the list method append has no return value and performs an inplace update of the original list.

Let's assign the positional input argument object to 4. It will be appended to the end of the list y1. An inplace update of y1 will occur with this new element.

In [8]:
y1.append(4)

In [9]:
y1

Out[9]:
[0, 2, 4, 6, 8, 4]

This additional element can be popped off by using the list method pop. Once again to get details about the input arguments press shift ⇧ and tab ↹. The list method pop has a single keyword input argument index=-1. This means that the keyword input argument index has a default value of -1. So if we don't change it, we will pop the last value (the -1 value) off from the list.

Note this method performs an inplace update of the list but also has a return value which is the item popped.

In [10]:
y1.pop()

Out[10]:
4
In [11]:
y1

Out[11]:
[0, 2, 4, 6, 8]

The + operator performs the operation outlined in the special method __add__. If we type in ⇧ and tab ↹ the string form: states it is a "list object" method. For a list this special method is setup to perform concatenation.

So for two equally sized lists, we get a new list that has twice as many elements.

In [12]:
y1+y2

Out[12]:
[0, 2, 4, 6, 8, 0, 1.5, 3, 4.5, 6.0]

1D Arrays¶

When dealing with exclusively numeric data, a numpy array is more powerful. The attributes and methods are more specialized for numeric operations. For example the operator + will perform element by element addition of equally sized numpy arrays opposed to concatenating the two equally sized arrays together. The function np.array can be used to convert a list into an array.

In [13]:
x=np.array(x)

In [14]:
x

Out[14]:
array([0, 1, 2, 3, 4])
In [15]:
y1=np.array(y1)

In [16]:
y1

Out[16]:
array([0, 2, 4, 6, 8])
In [17]:
y2=np.array(y2)

In [18]:
y2

Out[18]:
array([0. , 1.5, 3. , 4.5, 6. ])

The attributes and methods available to the new object y1 can now be examined by typing in the object name followed by a dot . and then pressing tab ↹.

We see the list of attributes and methods are different for the object y1 now that it is a numpy array and as expected are more numeric focused. This object has both attributes and functions. An attribute can be thought of as an object that is referenced with respect to another object (i.e. is an another instance of a class). Typically it is a property of the object for example the shape will give the dimension. An attribute has no input arguments and is therefore called without parenthesis. Pressing shift ⇧ and tab ↹ will do nothing as the attribute has no input arguments.

In [19]:
y1.shape

Out[19]:
(5,)

Note the shape of the numpy array only specifies a single dimension. This means that the numpy array is neither a row or a column but can instead be viewed as either or as convenient.

The + operator performs the operation outlined in the special method __add__. If we type in ⇧ and tab ↹ the string form: states it is a "ndarray object" method. For a ndarray this special method is setup to perform element by element addition.

In [20]:
y1+y2

Out[20]:
array([ 0. ,  3.5,  7. , 10.5, 14. ])

2D Arrays (Columns or Rows)¶

The numpy array above is neither a column or row however there are times we need to specify it as either a column or a row for a specific numeric operation for example concatenation. It can explicitly be reshaped using the ndarray method reshape. The positional input argument is newshape a tuple corresponding to the new dimensions. If -1 is specified for one of the dimensions, it will take all elements and apply it to that dimension.

In [21]:
x.reshape

Out[21]:
<function ndarray.reshape>

In [22]:
x=x.reshape((-1,1))

In [23]:
x

Out[23]:
array([[0],
[1],
[2],
[3],
[4]])
In [24]:
y1=y1.reshape((-1,1))

In [25]:
y1

Out[25]:
array([[0],
[2],
[4],
[6],
[8]])
In [26]:
y2=y2.reshape((-1,1))

In [27]:
y2

Out[27]:
array([[0. ],
[1.5],
[3. ],
[4.5],
[6. ]])

2D Arrays (Matrices)¶

The column vectors can be concatenated to create a matrix. The keyword input argument axis for a matrix can be 0 corresponding to the rows or 1 corresponding to the number of columns. Recall when working with numpy arrays, that any new dimension are added they are always added to the front of the tuple.

(ncols)=array1d.shape

(nrows,ncols)=array2d.shape

(npages,nrows,ncols)=array3d.shape

(nframes,npages,nrows,ncols)=array4d.shape

...

This means for a 1d array axis=0 corresponds to columns, for a 2d array axis=0 corresponds to rows, for a 3d array axis=0 corresponds to pages and for a 4d array axis=0 corresponds to frames. More generally the axis can be taken using its negative index where axis=-1 is always corresponding to the columns and axis=-2 is always corresponding to the rows.

We will move onto dataframes which have data in the form of a 2d array. Therefore when axis is used in a dataframe axis=0 corresponds to rows and axis=1 corresponds to columns.

In [28]:
matrix=np.concatenate([x,y1,y2],axis=-1)

In [29]:
matrix

Out[29]:
array([[0. , 0. , 0. ],
[1. , 2. , 1.5],
[2. , 4. , 3. ],
[3. , 6. , 4.5],
[4. , 8. , 6. ]])

The Series, DataFrame and Index Classes¶

The pandas Series Class¶

Note when the matrix was created that the names of the vectors x, y1 and y2 got lost and were instead replaced by the name of the matrix. Instead of creating numpy arrays which only have data and an variable name we can explicitly creae an object which can possess a name. We will start by creating columns in the form of a pandas series. Note when we are instantiating a class i.e. creating an instance or an object we need to call the class in CamelCaseCapitalization. This notation is used in the Python programming language to distinguish class creation from functions.

Classes like functions use parethesis and contain positional and keyword input arguments. Once again we can access these by use of ⇧ and tab ↹.

In this case all the input arguments are keyword input arguments but the keyword input data can also be used as a positional input argument. We can provide the data in the form of a list and provide the name as a string.

The datatype will be inferred from the contents input in the list. For example x, y1 are assigned as the datatype int64 and int32 from the list and numpy array respectively. y2 is a float as the list has numbers which include a decimal point.

Note that the output has a default integer 0,1,2,3 and 4. Each value in a series is paired with its perspective index value.

In [30]:
x=[0,1,2,3,4]
x=pd.Series(data=x,name='x')

In [31]:
x

Out[31]:
0    0
1    1
2    2
3    3
4    4
Name: x, dtype: int64
In [32]:
y1=[0,2,4,6,8]
y1=np.array(y1)
y1=pd.Series(data=y1,name='y1')

In [33]:
y1

Out[33]:
0    0
1    2
2    4
3    6
4    8
Name: y1, dtype: int32
In [34]:
y2=[0,1.5,3,4.5,6.0]
y2=np.array([0,1.5,3,4.5,6.0])
y2=pd.Series(data=y2,name='y2')

In [35]:
y2

Out[35]:
0    0.0
1    1.5
2    3.0
3    4.5
4    6.0
Name: y2, dtype: float64

The attributes and methods available to the new object y1 can now be examined. Once again by typing in the object name followed by a dot . and then pressing tab ↹.

We can use the attribute name for example to return the name of the series as a string.

In [36]:
y1.name

Out[36]:
'y1'

The pandas Index Class¶

Note each of the series x, y1 and y2 have the same default numeric integer index with the values 0,1,2,3 and 4. This can be changed when creating the series to a custom list of strings.

In [37]:
x=[0,1,2,3,4]
rnames=['r0','r1','r2','r3','r4']
x=pd.Series(data=x,name='x',index=rnames)

In [38]:
x

Out[38]:
r0    0
r1    1
r2    2
r3    3
r4    4
Name: x, dtype: int64

Instead of using a seperate list for data and index, data can be supplied directly as a dictionary. Since we lready have these as seperate lists, let's zip the two lists together to make a dictionary.

In [39]:
idx=dict(zip(rnames,x))

In [40]:
idx

Out[40]:
{'r0': 0, 'r1': 1, 'r2': 2, 'r3': 3, 'r4': 4}

Now let's create a series directly from the dictionary.

In [41]:
x=pd.Series(data=idx,name='x')

In [42]:
x

Out[42]:
r0    0
r1    1
r2    2
r3    3
r4    4
Name: x, dtype: int64

If the index names are labelled appropriately (i.e. in line with variable names) they will be accessible as attributes to the series. Calling the attribute returns the value.

In [43]:
x.r0

Out[43]:
0

Otherwise the value can be accessed using square brackets analogous to indexing in a dictionary.

In [44]:
x['r0']

Out[44]:
0

Also analagous to a dictionary, where the list of keys and strings can be accessed as attributes. The index and values of a series can be accessed as attributes.

In [45]:
x.index

Out[45]:
Index(['r0', 'r1', 'r2', 'r3', 'r4'], dtype='object')
In [46]:
x.values

Out[46]:
array([0, 1, 2, 3, 4], dtype=int64)

Notice the outputs of these are a pandas Index class and numpy array respectively. Let's now have a look instantiating an index class directly. Recall when instantantiating a class we need to use CamelCaseCapitalization.

We can access the list of input arguments by use of ⇧ and tab ↹. Here we see that the index class is very similar to the series class and virtually all the keyword input arguments are the same. The index class doesn't have the index keyword input argument as the index of an index is always going to consist of numerical integers.

In [47]:
rnames=['r0','r1','r2','r3','r4']
idx=pd.Index(data=rnames)

In [48]:
idx

Out[48]:
Index(['r0', 'r1', 'r2', 'r3', 'r4'], dtype='object')

Let's now update the index of x, y1 and y2

In [49]:
rnames=['r0','r1','r2','r3','r4']
x=[0,1,2,3,4]
y1=[0,2,4,6,8]
y2=[0,1.5,3,4.5,6.0]

x=pd.Series(data=x,name='x',index=rnames)
y1=pd.Series(data=y1,name='y1',index=rnames)
y2=pd.Series(data=y2,name='y2',index=rnames)

In [50]:
x

Out[50]:
r0    0
r1    1
r2    2
r3    3
r4    4
Name: x, dtype: int64
In [51]:
y1

Out[51]:
r0    0
r1    2
r2    4
r3    6
r4    8
Name: y1, dtype: int64
In [52]:
y2

Out[52]:
r0    0.0
r1    1.5
r2    3.0
r3    4.5
r4    6.0
Name: y2, dtype: float64

The pandas DataFrame Class¶

We have seen an index class which we essentially use to create row names and a series class which is a single column that has a number of index value key pairs analogous to a dictionary. We can now look at creating a datastructure that has multiple columns with a shared index called a dataframe. Recall again when instantantiating a class we need to use CamelCaseCapitalization.

We can access the list of input arguments by use of ⇧ and tab ↹. Here we see that the dataframe class is very similar to the series class. We have the keyword input arguments data and index. The data supplied to a dataframe is more complicated as it is 2D opposed to 1D but the index keyword input argument works the same way. The keyword input argument columns is present instead of name and is used to name each of the columns.

Let's have a look at creating a dataframe class using one of the previously created series.

In [53]:
df=pd.DataFrame(data=x,columns=[x.name])

In [54]:
df

Out[54]:
x
r0 0
r1 1
r2 2
r3 3
r4 4

This dataframe has only one column named x which can be accessed as an attribute.

In [55]:
df.x

Out[55]:
r0    0
r1    1
r2    2
r3    3
r4    4
Name: x, dtype: int64

This attribute is a series and recall that we can access an index in a series as an attribute.

In [56]:
df.x.r0

Out[56]:
0

We can add a new column to the dataframe by indexing into the dataframe by using square brackets and assigning a new name of the column.

In [57]:
df['y1']=y1

In [58]:
df

Out[58]:
x y1
r0 0 0
r1 1 2
r2 2 4
r3 3 6
r4 4 8

Doing so will create the dataframe attribute y1 which can now be accessed.

In [59]:
df.y1

Out[59]:
r0    0
r1    2
r2    4
r3    6
r4    8
Name: y1, dtype: int64

Note that the attribute is only created after the addition of a new column and therefore an attribute can only be accessed in the column already exists. Attempting to create a new column using attribute indexing will result in an error.

In [60]:
df['y2']=y2

In [61]:
df

Out[61]:
x y1 y2
r0 0 0 0.0
r1 1 2 1.5
r2 2 4 3.0
r3 3 6 4.5
r4 4 8 6.0

The dataframe can also be created from a dictionary. In this form the keyword input argument data contains the column names and the column data, therefore the keyword input argument columns is not used. The dictionary is setup so the keys are the column names and the dictionary values are the data present as either a list or pandas series.

In [62]:
colnamedata={x.name:x,y1.name:y1,y2.name:y2}
df=pd.DataFrame(data=colnamedata)

In [63]:
df

Out[63]:
x y1 y2
r0 0 0 0.0
r1 1 2 1.5
r2 2 4 3.0
r3 3 6 4.5
r4 4 8 6.0

Finally it is more common to create a dataframe from a numpy matrix array and assign both the column names and index (row) names as lists of strings.

In [64]:
matrix

Out[64]:
array([[0. , 0. , 0. ],
[1. , 2. , 1.5],
[2. , 4. , 3. ],
[3. , 6. , 4.5],
[4. , 8. , 6. ]])
In [65]:
cnames=['x','y1','y2']

In [66]:
rnames=['r0','r1','r2','r3','r4']

In [67]:
df=pd.DataFrame(data=matrix,
columns=cnames,
index=rnames)

In [68]:
df

Out[68]:
x y1 y2
r0 0.0 0.0 0.0
r1 1.0 2.0 1.5
r2 2.0 4.0 3.0
r3 3.0 6.0 4.5
r4 4.0 8.0 6.0

Examining the dataframe above, you've likely noticed an analogy with a spreadsheet. In this section we will look at how data is stored in a spreadsheet and read it into a dataframe.

CSV is an abbreviation for Comma Separated Values. CSV files when open in Excel look like the following.

When opened in Notepad++ (with hidden punctuation options displaying) they look like.

The comma separated file uses a comma as a delimiter (separator) as an instruction to move to the next column. This is analogous to the comma being used as a delimitor (separator) in a list e.g. l1=[1,2,3,4]. In this case the comma acts as a seperator for each element in the list.

To move onto the next line we use the return key [↵].

In notepad++ the hidden punctuation symbols can be viewed. These hidden punctuation symbols originally corresponded to typewritter commands. CR is an abbreviation for Carriage Return and is analogous to moving the typewritters ink carriage from the end of a line on the right hand side back to the start on the left hand side of the page. LF is an abbreviation for Line Feed and would instruct the type writter to shift the piece of paper up by the width of a line, so the type writter is ready to begin typing on the next line.

To access a file using only the file name, the file must be saved in the same folder as the JupyterLab Notebook. Otherwise the full file path must be specified.

In [69]:
csv_file=pd.read_csv('CSV1-NoHeaders.csv')


Note that Python uses 0th Order Indexing and Excel uses 1st Order Indexing. The 1st Row of the Excel File was taken to be the column names instead of the columns 0th index.

In [70]:
csv_file

Out[70]:
1 2
0 2 4
1 3 6
2 4 8
3 5 10
4 6 12
5 7 14
6 8 16
7 9 18
8 10 20
9 11 22
10 12 24

We can override the default behaviour by use of the keyword input argument and override it's default behaviour from 'infer' to None.

In [71]:
csv_file=pd.read_csv('CSV1-NoHeaders.csv',

In [72]:
csv_file

Out[72]:
0 1
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
6 7 14
7 8 16
8 9 18
9 10 20
10 11 22
11 12 24

We can use the keyword input argument names to name the columns while we read the csv file.

In [73]:
cnames=['c0','c1']

In [74]:
csv_file=pd.read_csv('CSV1-NoHeaders.csv',

In [75]:
csv_file

Out[75]:
c0 c1
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
6 7 14
7 8 16
8 9 18
9 10 20
10 11 22
11 12 24

We can also look at CSV file with headers.

In this case the default keyword argument for headers 'infer' can be used and the column names will be taken from row 1 (Excel 1st order indexing) of the CSV file.

In [76]:
csv_file=pd.read_csv('CSV2-Headers.csv')

In [77]:
csv_file

Out[77]:
a b
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
6 7 14
7 8 16
8 9 18
9 10 20
10 11 22
11 12 24

We want to skip the rows 2 and 3 (Excel 1st order indexing) which is rows 1 and 2 in Python. We can use the keyword input argument skiprows and then provide a list of rows to skip.

In [78]:
csv_file=pd.read_csv('CSV3-Headers-Comments.csv',
skiprows=[1,2])

In [79]:
csv_file

Out[79]:
a b
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
6 7 14
7 8 16
8 9 18
9 10 20
10 11 22
11 12 24

If the file is a very large file, we may not want to import it all want to examine a handful of rows to preview. This can be done using the nrows keyword input argument. In this case we will assign it to 5.

In [80]:
csv_file=pd.read_csv('CSV3-Headers-Comments.csv',
skiprows=[1,2],
nrows=5)

In [81]:
csv_file

Out[81]:
a b
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10

We can also select which columns we want using the keyword input argument usecols. In this case if we only want the 2nd column with header b (Excel 0 order indexing) which is the 1st Column in Python (0th order indexing).

In [82]:
csv_file=pd.read_csv('CSV3-Headers-Comments.csv',
skiprows=[1,2],
nrows=5,
usecols=[1])

In [83]:
csv_file

Out[83]:
b
0 2
1 4
2 6
3 8
4 10

The CSV file may already have index names.

In [84]:
csv_file=pd.read_csv('CSV4-Headers-Index.csv')

In [85]:
csv_file

Out[85]:
Unnamed: 0 a b
0 r0 1 2
1 r1 2 4
2 r2 3 6
3 r3 4 8
4 r4 5 10
5 r5 6 12
6 r6 7 14
7 r7 8 16
8 r8 9 18
9 r9 10 20
10 r10 11 22
11 r11 12 24

In this case Column A (Excel) is displayed as an additional Column and as it wasn't assigned a name it has automatically been referred to as "Unnamed: 0". We can use the keyword input argument index_col to assign it (Excel column A) to column 0.

In [86]:
csv_file=pd.read_csv('CSV4-Headers-Index.csv',
index_col=0)

In [87]:
csv_file

Out[87]:
a b
r0 1 2
r1 2 4
r2 3 6
r3 4 8
r4 5 10
r5 6 12
r6 7 14
r7 8 16
r8 9 18
r9 10 20
r10 11 22
r11 12 24

The function read_csv can be used with both CSV and TXT files. Let's first examine a TXT file in Excel where it looks almost identical with exception to the file extension.

Now let's have a look at it in Notepad++. Here we see a difference in a delimiter which displays as an arrow but denotes a tab ↹. To type a tab in Python we use '\t'.

If we try to read this file using the default keyword input arguments of read_csv:

In [88]:
txt_file=pd.read_csv('TXT-Headers.txt')

In [89]:
txt_file

Out[89]:
a\tb
0 1\t2
1 2\t4
2 3\t6
3 4\t8
4 5\t10
5 6\t12
6 7\t14
7 8\t16
8 9\t18
9 10\t20
10 11\t22
11 12\t24

We see we get only one column that has the datatype of object (usually meaning strings) and the \t is included in the column as part of the string.

In [90]:
txt_file.dtypes

Out[90]:
a\tb    object
dtype: object

This is because the read_csv function is not using \t as a delimiter and is instead looking for a , by default. We can override this default behaviour by using the keyword input argument delimiter (or it's alias sep) and assigning it to '\t'.

In [91]:
txt_file=pd.read_csv('TXT-Headers.txt',
sep='\t')


txt_file

At first glance an XLSX file may look very similar to a CSV or TXT file. If we examine one in Excel.

However unlike a CSV or TXT file which is just a single sheet of raw data. The Excel file may have multiple sheets and may also be formatted using different fonts, colors and font sizes. This collectively makes it a more complcated structure.

When opened in Notepad++ this complexity shows making the file hard to understand.

Fortunately we are not usually interested in all of these formatting options when reading data into a dataframe and the command read_excel can be used for XLSX files and shares may of the keyword input arguments as read_csv.

In [92]:
xlsx_file=pd.read_excel('XLSX-Headers.xlsx')

In [93]:
xlsx_file

Out[93]:
a b
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
6 7 14
7 8 16
8 9 18
9 10 20
10 11 22
11 12 24

In a multiple sheet workbook, the keyword input argument sheet_name can be selected to specify the sheet. This keyword input argument can take the sheet name in this case "Sheet1" as a string or it can take it as an index. When using an index we use zero-order indexing in Python so Sheet1 (1st order Excel indexing) will be index 0.

In [94]:
xlsx_file=pd.read_excel('XLSX-Headers.xlsx',
sheet_name='Sheet1')

In [95]:
xlsx_file

Out[95]:
a b
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
6 7 14
7 8 16
8 9 18
9 10 20
10 11 22
11 12 24

Handling Missing Data¶

Unfortunately when importing data, we run into the situation where some of the data is missing. For example cell B3 has a division by 0 error, Row 8 is missing, A11 and B12 are missing.

In [96]:
xlsx_file=pd.read_excel('XLSX2-Headers-MissingData.xlsx')

In [97]:
xlsx_file

Out[97]:
a b
0 1.0 2.0
1 2.0 NaN
2 3.0 6.0
3 4.0 8.0
4 5.0 10.0
5 6.0 12.0
6 NaN NaN
7 8.0 16.0
8 9.0 18.0
9 NaN 20.0
10 11.0 NaN
11 12.0 24.0

The missing data displays as NaN an abbreviation for Not an Number. The presence of the NaN values can distort the data especially when used in numerical calculations.

Checking for NaN Values¶

Let's look at how we can handle NaN values. Firstly we can use the pandas function isna to get a Boolean table to let us know what cells contain missing data. Alternatively we can also use the pandas function notna to let us know what cells contain good data.

In [98]:
pd.isna(xlsx_file)

Out[98]:
a b
0 False False
1 False True
2 False False
3 False False
4 False False
5 False False
6 True True
7 False False
8 False False
9 True False
10 False True
11 False False
In [99]:
pd.notna(xlsx_file)

Out[99]:
a b
0 True True
1 True False
2 True True
3 True True
4 True True
5 True True
6 False False
7 True True
8 True True
9 False True
10 True False
11 True True

If we assign the output to a object name. We can see that the new object is also a dataframe.

In [100]:
na_values=pd.isna(xlsx_file)

In [101]:
type(na_values)

Out[101]:
pandas.core.frame.DataFrame

Therefore we can use dataframe methods and attributes on it. Boolean data can be used in mathematical calculations and True corresponds to 1 while False corresponds to 0.

Let's use the method sum. To find out more details about the input arguments we can once again press shift ⇧ and tab ↹. We see that it is equivalent to the numpy method sum. The default value for the keyword input argument axis is None. When defaulted to None the sum is performed on the index (rows). The keyword input argument can be set to 0 to work on rows and 1 to work on columns. We will leave all keyword input arguments as their default values.

In [102]:
na_values.sum()

Out[102]:
a    2
b    3
dtype: int64

We can combine the above in one line, the pandas function pd.isna(xlsx_file) is a dataframe and the .sum() is a dataframe method. This notation is a bit more concise but more shift ⇧ and tab ↹ won't yield details on the sum method input arguments like it would in the case where the output was assigned to a new object name and the method instead is directly called from that object name.

In [103]:
pd.isna(xlsx_file).sum()

Out[103]:
a    2
b    3
dtype: int64
In [104]:
pd.notna(xlsx_file).sum()

Out[104]:
a    10
b     9
dtype: int64

The summation of NaN and not NaN values gives 12 values for each column as expected.

We can use the dataframe method dropna to disregard not a number values. Let's have a look at the input arguments using shift ⇧ and tab ↹.

The default keyword arguments are axis and how are set to 0 and 'any' respectively. This means it will act on rows and remove any rows that have Not an Number value.

Dropping NaN Values¶

In [105]:
xlsx_file.dropna()

Out[105]:
a b
0 1.0 2.0
2 3.0 6.0
3 4.0 8.0
4 5.0 10.0
5 6.0 12.0
7 8.0 16.0
8 9.0 18.0
11 12.0 24.0

We can update the keyword input argument how to be 'all' instead and in this case it will only remove rows where all the values are NaN values.

In [106]:
xlsx_file.dropna(how='all')

Out[106]:
a b
0 1.0 2.0
1 2.0 NaN
2 3.0 6.0
3 4.0 8.0
4 5.0 10.0
5 6.0 12.0
7 8.0 16.0
8 9.0 18.0
9 NaN 20.0
10 11.0 NaN
11 12.0 24.0

We can also use the keyword input argument subset, to restrict the search to only specific columns, in this case only column 'b'.

In [107]:
xlsx_file.dropna(how='any',subset=['b'])

Out[107]:
a b
0 1.0 2.0
2 3.0 6.0
3 4.0 8.0
4 5.0 10.0
5 6.0 12.0
7 8.0 16.0
8 9.0 18.0
9 NaN 20.0
11 12.0 24.0

Filling Values¶

We can also use the method fillna to replace the NaN with another value. We can use the keyword input argument value to replace all the blanks with a static value such as 0.

In [108]:
xlsx_file.fillna(value=0)

Out[108]:
a b
0 1.0 2.0
1 2.0 0.0
2 3.0 6.0
3 4.0 8.0
4 5.0 10.0
5 6.0 12.0
6 0.0 0.0
7 8.0 16.0
8 9.0 18.0
9 0.0 20.0
10 11.0 0.0
11 12.0 24.0

We can also use a dictionary to use a different value on each column. This may be useful if the different columns have a different scale and we wish to use values to denote saturation values for instance. The dictionary will use the column names as keys and the values will the desired value to fill each columns NaN values with.

In [109]:
fill_dict={'a':0,'b':99}
xlsx_file.fillna(value=fill_dict)

Out[109]:
a b
0 1.0 2.0
1 2.0 99.0
2 3.0 6.0
3 4.0 8.0
4 5.0 10.0
5 6.0 12.0
6 0.0 99.0
7 8.0 16.0
8 9.0 18.0
9 0.0 20.0
10 11.0 99.0
11 12.0 24.0

The keyword input argument method can alternatively be used to use a method to fill in the NaN. It can be assigned to forward fill 'ffill' or backfill 'bfill' where each missing value will take the previous non-missing value going forward down the array or next non-missing value going backward, up the array.

In [110]:
xlsx_file.fillna(method='ffill')

Out[110]:
a b
0 1.0 2.0
1 2.0 2.0
2 3.0 6.0
3 4.0 8.0
4 5.0 10.0
5 6.0 12.0
6 6.0 12.0
7 8.0 16.0
8 9.0 18.0
9 9.0 20.0
10 11.0 20.0
11 12.0 24.0
In [111]:
xlsx_file.fillna(method='bfill')

Out[111]:
a b
0 1.0 2.0
1 2.0 6.0
2 3.0 6.0
3 4.0 8.0
4 5.0 10.0
5 6.0 12.0
6 8.0 16.0
7 8.0 16.0
8 9.0 18.0
9 11.0 20.0
10 11.0 24.0
11 12.0 24.0

We can try to instead use the method interpolation to use interpolation across each column if the values are ordered and fill in the missing values using the interpolated function. Let's have a look at the input arguments using shift ⇧ and tab ↹. Once again there is a keyword input argument method which defaults to 'linear' meaning linear interpolation will be used.

Interpolation¶

Because the raw data was precisely linear, this performs a perfect interpolation as expected. We can update the keyword input argument method to be 'quadratic' or 'cubic' for more complicated data.

In [112]:
xlsx_file.interpolate()

Out[112]:
a b
0 1.0 2.0
1 2.0 4.0
2 3.0 6.0
3 4.0 8.0
4 5.0 10.0
5 6.0 12.0
6 7.0 14.0
7 8.0 16.0
8 9.0 18.0
9 10.0 20.0
10 11.0 22.0
11 12.0 24.0
In [113]:
xlsx_file.interpolate(method='quadratic')

Out[113]:
a b
0 1.0 2.0
1 2.0 4.0
2 3.0 6.0
3 4.0 8.0
4 5.0 10.0
5 6.0 12.0
6 7.0 14.0
7 8.0 16.0
8 9.0 18.0
9 10.0 20.0
10 11.0 22.0
11 12.0 24.0
In [114]:
xlsx_file.interpolate(method='cubic')

Out[114]:
a b
0 1.0 2.0
1 2.0 4.0
2 3.0 6.0
3 4.0 8.0
4 5.0 10.0
5 6.0 12.0
6 7.0 14.0
7 8.0 16.0
8 9.0 18.0
9 10.0 20.0
10 11.0 22.0
11 12.0 24.0

In all the cases above the output was just printed to the console, to perform an inplace update of the data, we can reassign the output to the object name xlsx_file.

In [115]:
xlsx_file=xlsx_file.interpolate()

In [116]:
xlsx_file

Out[116]:
a b
0 1.0 2.0
1 2.0 4.0
2 3.0 6.0
3 4.0 8.0
4 5.0 10.0
5 6.0 12.0
6 7.0 14.0
7 8.0 16.0
8 9.0 18.0
9 10.0 20.0
10 11.0 22.0
11 12.0 24.0

Writing Data to a File¶

Let's create a few additional dataframes from xlsx_file.

In [117]:
df=xlsx_file

In [118]:
df

Out[118]:
a b
0 1.0 2.0
1 2.0 4.0
2 3.0 6.0
3 4.0 8.0
4 5.0 10.0
5 6.0 12.0
6 7.0 14.0
7 8.0 16.0
8 9.0 18.0
9 10.0 20.0
10 11.0 22.0
11 12.0 24.0
In [119]:
df2=2*df

In [120]:
df2

Out[120]:
a b
0 2.0 4.0
1 4.0 8.0
2 6.0 12.0
3 8.0 16.0
4 10.0 20.0
5 12.0 24.0
6 14.0 28.0
7 16.0 32.0
8 18.0 36.0
9 20.0 40.0
10 22.0 44.0
11 24.0 48.0
In [121]:
df3=3*df

In [122]:
df3

Out[122]:
a b
0 3.0 6.0
1 6.0 12.0
2 9.0 18.0
3 12.0 24.0
4 15.0 30.0
5 18.0 36.0
6 21.0 42.0
7 24.0 48.0
8 27.0 54.0
9 30.0 60.0
10 33.0 66.0
11 36.0 72.0

We can use the dataframe methods to_csv and to_excel to save the data to a file.

Writing a DataFrame to an Excel File¶

Let's look at writing to an Excel file first using the method to_excel. We can press ⇧ and tab ↹ to get details about the input arguments.

There are a large number of keyword input arguments. An excel_writer object needs to be created to write multiple sheets to a file. To write a single sheet, all we need to do is specify the name of the file. The sheet_name will use the default string 'sheet_1'.

In [123]:
df.to_excel('df_export_xlsx.xlsx')


The file can be found in the same folder as the script file.

Note the xlsx file has an index and a header which were created automatically from the dataframe. Changing the keyword input arguments index and header to None will remove these.

In [124]:
df.to_excel('df_export_xlsx2.xlsx',header=None,index=None)


Additional blank rows or columns can be added to the excel file by using the keyword input arguments startrow and startcol respectively. For example let's set 3 blank rows and 2 blank columns respectively.

In [125]:
df.to_excel('df_export_xlsx3.xlsx',startrow=3,startcol=2)


Writing DataFrames to an Excel File with Multiple Sheets¶

To write a multiple sheet object it is slightly more complicated. Using the procedure above with different sheet names will simply overwrite the file each time opposed to appending a sheet. Instead we must create an pandas Excel writer class. Recall when instantantiating a class we need to use CamelCaseCapitalization.

The positional input argument path (which also works as a keyword input argument) must be assigned to a file name.

In [126]:
writer=pd.ExcelWriter(path='df_export_xlsx4_multisheet.xlsx')


Now we can use the dataframe method to_excel to write each dataframe to an excel sheet. Instead of specifying the file name we specify the writter class.

In [127]:
df.to_excel(writer,sheet_name='df')
df2.to_excel(writer,sheet_name='df2')
df3.to_excel(writer,sheet_name='df3')


The excel file will not be created until the writers method save is used. We can access the writers methods by typing in the writer name followed by a dot . and then pressing tab ↹.

To look at the input arguments we can use shift ⇧ and tab ↹. In this case we see there are none. So we need to call the method without specifying any.

In [128]:
writer.save()


We see that the multi-sheet Excel file is created.

Writing a DataFrame to a CSV File¶

The to_csv dataframe method can be used to write to a csv or txt filewhich can be viewed by pressing shift ⇧ and tab ↹. CSV files are always a single sheet so there is no writer option, just the positional input argument path. Some of the other keyword input arguments shown here are common with those in the dataframe method to_excel. However most of the keyword input arguments relate to formatting and use the formatting as a CSV file by default.

In [129]:
df.to_csv('df_export_csv.csv')


We can view this in Notepad++ to see that it is a CSV file as expected.

Writing a DataFrame to a TXT File¶

The method to_csv can also be used to save a txt file.

In [130]:
df.to_csv('df_export_txt.txt')


The default keyword arguments are setup for CSV files by default and so the text file has the wrong delimiter a comma opposed to a ↹.

We can use the keyword input argument sep to specify a tab as a delimiter.

In [131]:
df.to_csv('df_export_txt2.txt',sep='\t')


Indexing¶

Let's create two basic dataframes, one with both columns labelled and one with both columns and indexes labelled.

In [132]:
from scipy import random
random.seed(0)
matrix=random.randint(low=0,high=11,size=(12,5))
cnames=['c'+str(i) for i in range(5)]
rnames=['r'+str(i) for i in range(12)]
df=pd.DataFrame(data=matrix,columns=cnames)
df2=pd.DataFrame(data=matrix,columns=cnames,index=rnames)

In [133]:
df

Out[133]:
c0 c1 c2 c3 c4
0 5 0 3 3 7
1 9 3 5 2 4
2 7 6 8 8 10
3 1 6 7 7 8
4 1 5 9 8 9
5 4 3 0 3 5
6 0 2 3 8 1
7 3 3 3 7 0
8 1 9 9 0 10
9 4 7 3 2 7
10 2 0 0 4 5
11 5 6 8 4 1
In [134]:
df2

Out[134]:
c0 c1 c2 c3 c4
r0 5 0 3 3 7
r1 9 3 5 2 4
r2 7 6 8 8 10
r3 1 6 7 7 8
r4 1 5 9 8 9
r5 4 3 0 3 5
r6 0 2 3 8 1
r7 3 3 3 7 0
r8 1 9 9 0 10
r9 4 7 3 2 7
r10 2 0 0 4 5
r11 5 6 8 4 1

Indexing a Column¶

We have seen before that we can select a column using attribute indexing and square bracket indexing

In [135]:
df.c0

Out[135]:
0     5
1     9
2     7
3     1
4     1
5     4
6     0
7     3
8     1
9     4
10    2
11    5
Name: c0, dtype: int32
In [136]:
df['c0']

Out[136]:
0     5
1     9
2     7
3     1
4     1
5     4
6     0
7     3
8     1
9     4
10    2
11    5
Name: c0, dtype: int32

Then we can select a cell by using an index using square bracket indexing

In [137]:
df['c0'][1]

Out[137]:
9
In [138]:
df.c0[1]

Out[138]:
9

For the dataframe where the indexes are labelled, attribute indexing can also be used

In [139]:
df2.c0.r1

Out[139]:
9
In [140]:
df2['c0'].r1

Out[140]:
9

Indexing a Row¶

Let's have a look at selection of a row directly now. We can use the attribute loc (location) or iloc (integer location). loc is designed for use with string index names and iloc is designed for use with numeric indexes and will also work with indexes that are labelled.

In [141]:
df2.loc

Out[141]:
<pandas.core.indexing._LocIndexer at 0x1e32660f9f0>

When these are used, we get an instance of an Indexer object. We can index into this object using the row names.

In [142]:
df2.loc['r1']

Out[142]:
c0    9
c1    3
c2    5
c3    2
c4    4
Name: r1, dtype: int32

Note that the output is an instance as a series class and hence displayed as a column opposed to a row. This can be seen if we assign the object to a variable name.

In [143]:
row1=df2.loc['r1']

In [144]:
type(row1)

Out[144]:
pandas.core.series.Series

Therefore we can access the series attributes and methods by typing in the object name followed by a dot . and then pressing tab ↹.

Because this is a series, we can index into it using the series index names (which in this case are the original dataframes column names).

In [145]:
row1.c0

Out[145]:
9
In [146]:
df2.loc['r1'].c0

Out[146]:
9
In [147]:
df2.loc['r1']['c0']

Out[147]:
9

For df there is no index strings so loc will work with numeric indexes.

In [148]:
df.loc[1]

Out[148]:
c0    9
c1    3
c2    5
c3    2
c4    4
Name: 1, dtype: int32

The integer loc will work with both df and df2 using numeric integers.

In [149]:
df.iloc[1]

Out[149]:
c0    9
c1    3
c2    5
c3    2
c4    4
Name: 1, dtype: int32
In [150]:
df2.iloc[1]

Out[150]:
c0    9
c1    3
c2    5
c3    2
c4    4
Name: r1, dtype: int32

integer location iloc can be used to select multiple rows by using slices (similar to slicing in a list). The output will be a dataframe.

In [151]:
df.iloc[3:5]

Out[151]:
c0 c1 c2 c3 c4
3 1 6 7 7 8
4 1 5 9 8 9

The slice 3:5 "3 to 5" starts at 3 and goes up to but exclusing 5 in steps of 1 and therefore displays index 3 and 4.

Because the output is a dataframe a column can be indexed from it.

In [152]:
df.iloc[3:5].c0

Out[152]:
3    1
4    1
Name: c0, dtype: int32
In [153]:
df.iloc[3:5]['c0']

Out[153]:
3    1
4    1
Name: c0, dtype: int32

Both loc and iloc have the means of selecting a cell or selection of cells by use of a single square bracket to select the row and columns.

In [154]:
df2.loc['r1','c0']

Out[154]:
9
In [155]:
df2.iloc[1,0]

Out[155]:
9

Indexing a Cell¶

There are also the attributes at and integer at iat can be used to directly select a cell, these are the preferred attributes to use when updating the value fo a cell.

In [156]:
df2.at['r1','c0']

Out[156]:
9
In [157]:
df2

Out[157]:
c0 c1 c2 c3 c4
r0 5 0 3 3 7
r1 9 3 5 2 4
r2 7 6 8 8 10
r3 1 6 7 7 8
r4 1 5 9 8 9
r5 4 3 0 3 5
r6 0 2 3 8 1
r7 3 3 3 7 0
r8 1 9 9 0 10
r9 4 7 3 2 7
r10 2 0 0 4 5
r11 5 6 8 4 1
In [158]:
df2.at['r1','c0']=11

In [159]:
df2

Out[159]:
c0 c1 c2 c3 c4
r0 5 0 3 3 7
r1 11 3 5 2 4
r2 7 6 8 8 10
r3 1 6 7 7 8
r4 1 5 9 8 9
r5 4 3 0 3 5
r6 0 2 3 8 1
r7 3 3 3 7 0
r8 1 9 9 0 10
r9 4 7 3 2 7
r10 2 0 0 4 5
r11 5 6 8 4 1

Adding a New Column or Row¶

Let's have a look at the dataframe

In [160]:
df2

Out[160]:
c0 c1 c2 c3 c4
r0 5 0 3 3 7
r1 11 3 5 2 4
r2 7 6 8 8 10
r3 1 6 7 7 8
r4 1 5 9 8 9
r5 4 3 0 3 5
r6 0 2 3 8 1
r7 3 3 3 7 0
r8 1 9 9 0 10
r9 4 7 3 2 7
r10 2 0 0 4 5
r11 5 6 8 4 1

We can create a new column from a series or array that has a length that matches the number of rows in the dataframe. We can index a new column name by use of square brackets.

In [161]:
df2['c5']=random.randint(low=0,high=11,size=12)

In [162]:
df2

Out[162]:
c0 c1 c2 c3 c4 c5
r0 5 0 3 3 7 4
r1 11 3 5 2 4 9
r2 7 6 8 8 10 10
r3 1 6 7 7 8 10
r4 1 5 9 8 9 8
r5 4 3 0 3 5 1
r6 0 2 3 8 1 1
r7 3 3 3 7 0 7
r8 1 9 9 0 10 9
r9 4 7 3 2 7 9
r10 2 0 0 4 5 3
r11 5 6 8 4 1 6

Once the column has been created, it can be accessed by using the column name as an attribute.

In [163]:
df2.c5

Out[163]:
r0      4
r1      9
r2     10
r3     10
r4      8
r5      1
r6      1
r7      7
r8      9
r9      9
r10     3
r11     6
Name: c5, dtype: int32

Note that this attribute is created as a result of the column being created. We can only access attributes that already exist. If we attempt to assign the output to an attribute that doesn't already exist we get an error. This means that attribute dot notation cannot be used to create a new attribute and can only be used to access an existing attribute once it is created.

A new row can be added by use of the attributes loc or iloc.

In [164]:
df2.loc['r12']=random.randint(low=0,high=11,size=6)

In [165]:
df2

Out[165]:
c0 c1 c2 c3 c4 c5
r0 5 0 3 3 7 4
r1 11 3 5 2 4 9
r2 7 6 8 8 10 10
r3 1 6 7 7 8 10
r4 1 5 9 8 9 8
r5 4 3 0 3 5 1
r6 0 2 3 8 1 1
r7 3 3 3 7 0 7
r8 1 9 9 0 10 9
r9 4 7 3 2 7 9
r10 2 0 0 4 5 3
r11 5 6 8 4 1 6
r12 7 2 0 3 5 9

The Append Method¶

Alternatively the append method may be used to append a row to a dataframe. The append method can append a panda series or dictionary to the dataframe. To get details about the input arguments press shift ⇧ and tab ↹.

The positional keyword argument is the row to appended. If the keyword input argument ignore_index is set to True other can be either a dictionary or pandas series which will be appended. When this ignore_index is set to True and all the index labels will be ignored giving instead a numeric index. The keys in the dictionary must correspond to the column name sand the values, correspond to the values to be added for each row.

In [166]:
df2

Out[166]:
c0 c1 c2 c3 c4 c5
r0 5 0 3 3 7 4
r1 11 3 5 2 4 9
r2 7 6 8 8 10 10
r3 1 6 7 7 8 10
r4 1 5 9 8 9 8
r5 4 3 0 3 5 1
r6 0 2 3 8 1 1
r7 3 3 3 7 0 7
r8 1 9 9 0 10 9
r9 4 7 3 2 7 9
r10 2 0 0 4 5 3
r11 5 6 8 4 1 6
r12 7 2 0 3 5 9
In [167]:
new_row={'c0':9,'c1':9,'c2':9,'c3':9,'c4':9,'c5':9}

In [168]:
df2.append(new_row,ignore_index=True)

Out[168]:
c0 c1 c2 c3 c4 c5
0 5 0 3 3 7 4
1 11 3 5 2 4 9
2 7 6 8 8 10 10
3 1 6 7 7 8 10
4 1 5 9 8 9 8
5 4 3 0 3 5 1
6 0 2 3 8 1 1
7 3 3 3 7 0 7
8 1 9 9 0 10 9
9 4 7 3 2 7 9
10 2 0 0 4 5 3
11 5 6 8 4 1 6
12 7 2 0 3 5 9
13 9 9 9 9 9 9

When ignore_index is set to False only a series with a name can be appended. All the previous names in the dataframe will be unaltered.

In [169]:
new_series=pd.Series(data={'c0':9,'c1':9,'c2':9,'c3':9,'c4':9,'c5':9},name='r13')

In [170]:
new_series

Out[170]:
c0    9
c1    9
c2    9
c3    9
c4    9
c5    9
Name: r13, dtype: int64

Although the series displays as a column, it will be appended as a row.

In [171]:
df2.append(new_series)

Out[171]:
c0 c1 c2 c3 c4 c5
r0 5 0 3 3 7 4
r1 11 3 5 2 4 9
r2 7 6 8 8 10 10
r3 1 6 7 7 8 10
r4 1 5 9 8 9 8
r5 4 3 0 3 5 1
r6 0 2 3 8 1 1
r7 3 3 3 7 0 7
r8 1 9 9 0 10 9
r9 4 7 3 2 7 9
r10 2 0 0 4 5 3
r11 5 6 8 4 1 6
r12 7 2 0 3 5 9
r13 9 9 9 9 9 9

Deleting a Column or a Row¶

The dataframe method drop can be used to delete a column or row. This method can be used with a label keyword input argument and axis keyword input argument where the label is the name of the row or column and axis works similar to many of the numpy functions where axis=0 acts on rows and axis=1 acts on columns. If used this way both the label and axis need to match. Alternatively it is simpler to use the index and columns keyword input arguments as they are a bit more intuitive.

Deleting a Column¶

To delete a column we can use the keyword input argument labels to specify a list of column name of strings or a single column string. We also need to use axis 1 to work on columns.

In [172]:
df.drop(labels=['c0','c1'],axis=1)

Out[172]:
c2 c3 c4
0 3 3 7
1 5 2 4
2 8 8 10
3 7 7 8
4 9 8 9
5 0 3 5
6 3 8 1
7 3 7 0
8 9 0 10
9 3 2 7
10 0 4 5
11 8 4 1

Deleting a Row¶

To delete a row we can use the keyword input argument labels to specify a list of row name strings or integer indexes (if unlabelled) or a row column string or index (if unlabelled). We also need to use axis 0 to work on rows. This axis is unlabelled so the integer index 10 is used instead.

In [173]:
df.drop(labels=10,axis=0)

Out[173]:
c0 c1 c2 c3 c4
0 5 0 3 3 7
1 11 3 5 2 4
2 7 6 8 8 10
3 1 6 7 7 8
4 1 5 9 8 9
5 4 3 0 3 5
6 0 2 3 8 1
7 3 3 3 7 0
8 1 9 9 0 10
9 4 7 3 2 7
11 5 6 8 4 1

Deleting Columns and Rows¶

The above syntax can delete only columns or rows. The keyword input arguments index and columns is a bit more intuitive and can be used to delete both simultaneously.

In [174]:
df.drop(index=10,columns=['c0','c2'])

Out[174]:
c1 c3 c4
0 0 3 7
1 3 2 4
2 6 8 10
3 6 7 8
4 5 8 9
5 3 3 5
6 2 8 1
7 3 7 0
8 9 0 10
9 7 2 7
11 6 4 1

So far the output has only been printed to the console and the dataframe itself has not been updated. To perform an inplace update we can use the keyword input argument inplace and assign it to True.

In [175]:
df2.drop(index='r10',columns=['c0','c2'],inplace=True)

In [176]:
df2

Out[176]:
c1 c3 c4 c5
r0 0 3 7 4
r1 3 2 4 9
r2 6 8 10 10
r3 6 7 8 10
r4 5 8 9 8
r5 3 3 5 1
r6 2 8 1 1
r7 3 7 0 7
r8 9 0 10 9
r9 7 2 7 9
r11 6 4 1 6
r12 2 3 5 9

Alternatively we can reassign the output to the dataframes name.

In [177]:
df2=df2.drop(index=['r1','r9'],columns=['c1','c3'])

In [178]:
df2

Out[178]:
c4 c5
r0 7 4
r2 10 10
r3 8 10
r4 9 8
r5 5 1
r6 1 1
r7 0 7
r8 10 9
r11 1 6
r12 5 9

When the keyword input argument inplace is set to True, an inplace update to the dataframe will occur and the method will therefore have no input argument i.e. nothing printed to the cell. Care should be taken to avoid using both the keyword input argument inplace and reassignment of the object name as it will to a NoneType instead of an updated dataframe.

In [179]:
df2=df2.drop(index=['r3','r4'],inplace=True)

In [180]:
df2


We can see that we get nothing when we ask for the output of df2. If we check its type we can see it is NoneType.

In [181]:
type(df2)

Out[181]:
NoneType

Reordering Columns or Rows¶

Let's first restore our dataframes.

In [182]:
random.seed(0)
matrix=random.randint(low=0,high=11,size=(12,5))
cnames=['c'+str(i) for i in range(5)]
rnames=['r'+str(i) for i in range(12)]
df=pd.DataFrame(data=matrix,columns=cnames)
df2=pd.DataFrame(data=matrix,columns=cnames,index=rnames)

In [183]:
df2

Out[183]:
c0 c1 c2 c3 c4
r0 5 0 3 3 7
r1 9 3 5 2 4
r2 7 6 8 8 10
r3 1 6 7 7 8
r4 1 5 9 8 9
r5 4 3 0 3 5
r6 0 2 3 8 1
r7 3 3 3 7 0
r8 1 9 9 0 10
r9 4 7 3 2 7
r10 2 0 0 4 5
r11 5 6 8 4 1

Reordering Columns¶

We can reorder the columns in a dataframe by indexing into it with a reordered list of column strings. For example we want c0 to be at the end we would use.

In [184]:
newcolorder=['c1','c2','c3','c4','c0']

In [185]:
df2[newcolorder]

Out[185]:
c1 c2 c3 c4 c0
r0 0 3 3 7 5
r1 3 5 2 4 9
r2 6 8 8 10 7
r3 6 7 7 8 1
r4 5 9 8 9 1
r5 3 0 3 5 4
r6 2 3 8 1 0
r7 3 3 7 0 3
r8 9 9 0 10 1
r9 7 3 2 7 4
r10 0 0 4 5 2
r11 6 8 4 1 5

To update df2, we can reassign the object name df2 to the output.

In [186]:
df2=df2[newcolorder]

In [187]:
df2

Out[187]:
c1 c2 c3 c4 c0
r0 0 3 3 7 5
r1 3 5 2 4 9
r2 6 8 8 10 7
r3 6 7 7 8 1
r4 5 9 8 9 1
r5 3 0 3 5 4
r6 2 3 8 1 0
r7 3 3 7 0 3
r8 9 9 0 10 1
r9 7 3 2 7 4
r10 0 0 4 5 2
r11 6 8 4 1 5

Reordering Rows¶

To reorder rows, we can use the attribute loc or iloc.

In [188]:
newroworder=['r1','r2','r3','r4','r5','r6','r7','r8','r9','r10','r11','r0']

In [189]:
df2.loc[newroworder]

Out[189]:
c1 c2 c3 c4 c0
r1 3 5 2 4 9
r2 6 8 8 10 7
r3 6 7 7 8 1
r4 5 9 8 9 1
r5 3 0 3 5 4
r6 2 3 8 1 0
r7 3 3 7 0 3
r8 9 9 0 10 1
r9 7 3 2 7 4
r10 0 0 4 5 2
r11 6 8 4 1 5
r0 0 3 3 7 5

Note we can also use this to drop columns or rows by instead only making a selection.

In [190]:
newrowselorder=newroworder=['r1','r2','r3','r4','r0']

In [191]:
df2.loc[newrowselorder]

Out[191]:
c1 c2 c3 c4 c0
r1 3 5 2 4 9
r2 6 8 8 10 7
r3 6 7 7 8 1
r4 5 9 8 9 1
r0 0 3 3 7 5

Relabelling Columns or Rows¶

Relabelling Columns¶

Column names can be accessed by use of the columns attribute.

In [192]:
df2.columns

Out[192]:
Index(['c1', 'c2', 'c3', 'c4', 'c0'], dtype='object')

The columns can be renamed by assigning it to a new list, or index of equal length to the number of existing columns.

In [193]:
new_cols=['c_1','c_2','c_3','c_4','c_0']

In [194]:
df2.columns=new_cols


In [195]:
df2

Out[195]:
c_1 c_2 c_3 c_4 c_0
r0 0 3 3 7 5
r1 3 5 2 4 9
r2 6 8 8 10 7
r3 6 7 7 8 1
r4 5 9 8 9 1
r5 3 0 3 5 4
r6 2 3 8 1 0
r7 3 3 7 0 3
r8 9 9 0 10 1
r9 7 3 2 7 4
r10 0 0 4 5 2
r11 6 8 4 1 5

The above requires a list of all the columns. It is possible to rename some columns without specifying them all using the dataframe method rename. To get more details about the dataframe method renames positional input arguments, press shift ⇧ and tab ↹.

We can use the keyword columns and assign it to a dictionary where the dictionary keys consist of the old column names and the dicionary values correspond to the new column names. In this case we wish to rename 'c_0' to 'y'.

In [196]:
col_rename={'c_0':'y'}
df2.rename(columns=col_rename)

Out[196]:
c_1 c_2 c_3 c_4 y
r0 0 3 3 7 5
r1 3 5 2 4 9
r2 6 8 8 10 7
r3 6 7 7 8 1
r4 5 9 8 9 1
r5 3 0 3 5 4
r6 2 3 8 1 0
r7 3 3 7 0 3
r8 9 9 0 10 1
r9 7 3 2 7 4
r10 0 0 4 5 2
r11 6 8 4 1 5

To apply the changes we can either reassign the output to df2 or alternatively use the keyword input argument inplace and setting it to True. Recall that using the keyword input argument inplace will perform an inplace update of the dataframe and lead to no return statement. Therefore using inplace and setting it to True and reassigning the output to the dataframe name will lead to a NoneType object opposed to the updated dataframe.

In [197]:
df2.rename(columns=col_rename,inplace=True)


We can also use the string attribute on the dataframe attribute columns, to access a number of string methods to carry out on the column names.

In [198]:
df2.columns.str

Out[198]:
<pandas.core.strings.accessor.StringMethods at 0x1e3265c8ac0>

We can type in this attribute followed by a dot . and then access the list of string attrbutes and methods by pressing tab ↹.

We can use the str method capitalize the first letter of each column. To get details about the input arguments press shift ⇧ and tab ↹.

In this case the method capitalize has no input arguments.

In [199]:
df2.columns.str.capitalize()

Out[199]:
Index(['C_1', 'C_2', 'C_3', 'C_4', 'Y'], dtype='object')

The output displays in the cell. To update the columns the columns attribute of the dataframe can be reassigned.

In [200]:
df2.columns=df2.columns.str.capitalize()


The dataframe now has upper case column names.

In [201]:
df2

Out[201]:
C_1 C_2 C_3 C_4 Y
r0 0 3 3 7 5
r1 3 5 2 4 9
r2 6 8 8 10 7
r3 6 7 7 8 1
r4 5 9 8 9 1
r5 3 0 3 5 4
r6 2 3 8 1 0
r7 3 3 7 0 3
r8 9 9 0 10 1
r9 7 3 2 7 4
r10 0 0 4 5 2
r11 6 8 4 1 5

The str methods upper and lower will set every letter in the column name to be upper and lower case respectively.

In [202]:
df2.columns.str.upper()

Out[202]:
Index(['C_1', 'C_2', 'C_3', 'C_4', 'Y'], dtype='object')
In [203]:
df2.columns.str.lower()

Out[203]:
Index(['c_1', 'c_2', 'c_3', 'c_4', 'y'], dtype='object')

Reassigning the dataframe attribute to the method.

In [204]:
df2.columns=df2.columns.str.lower()


Looking at the output of the dataframe

In [205]:
df2

Out[205]:
c_1 c_2 c_3 c_4 y
r0 0 3 3 7 5
r1 3 5 2 4 9
r2 6 8 8 10 7
r3 6 7 7 8 1
r4 5 9 8 9 1
r5 3 0 3 5 4
r6 2 3 8 1 0
r7 3 3 7 0 3
r8 9 9 0 10 1
r9 7 3 2 7 4
r10 0 0 4 5 2
r11 6 8 4 1 5

Sometimes the str method replace is useful if column names are read from a file and contain spaces or underscores that we wish to remove. To get details about the input arguments press shift ⇧ and tab ↹.

The positional input arguments pat and repl are abbreviation for pattern and replacement. These can also be explicitly expressed as keyword input arguments. We can look at replacing c_ with c.

In [206]:
df2.columns.str.replace(pat='c_',repl='c')

Out[206]:
Index(['c1', 'c2', 'c3', 'c4', 'y'], dtype='object')

df2.columns.str.replace(pat='c',repl='c_') df2.columns.str.upper()

To apply the changes we can assign the output to the dataframe columns attribute.

In [207]:
df2.columns=df2.columns.str.replace(pat='c_',repl='c')


Let's see the updated dataframe.

In [208]:
df2

Out[208]:
c1 c2 c3 c4 y
r0 0 3 3 7 5
r1 3 5 2 4 9
r2 6 8 8 10 7
r3 6 7 7 8 1
r4 5 9 8 9 1
r5 3 0 3 5 4
r6 2 3 8 1 0
r7 3 3 7 0 3
r8 9 9 0 10 1
r9 7 3 2 7 4
r10 0 0 4 5 2
r11 6 8 4 1 5

Relabelling Rows¶

The dataframe row names or index can be accessed using the attribute index.

In [209]:
df2.index

Out[209]:
Index(['r0', 'r1', 'r2', 'r3', 'r4', 'r5', 'r6', 'r7', 'r8', 'r9', 'r10',
'r11'],
dtype='object')

As both the attributes columns and index are instances of the Index class, the same attributes and methods can be accessed from them.

In [210]:
type(df2.index)

Out[210]:
pandas.core.indexes.base.Index
In [211]:
type(df2.columns)

Out[211]:
pandas.core.indexes.base.Index

This can be reassigned to a new list of string names of equal length as seen earlier with columns.

In [212]:
new_rows=['r_'+str(r) for r in range(12)]
new_rows

Out[212]:
['r_0',
'r_1',
'r_2',
'r_3',
'r_4',
'r_5',
'r_6',
'r_7',
'r_8',
'r_9',
'r_10',
'r_11']
In [213]:
df2.index=new_rows


Viewing the dataframe, we can see the changes have been made.

In [214]:
df2

Out[214]:
c1 c2 c3 c4 y
r_0 0 3 3 7 5
r_1 3 5 2 4 9
r_2 6 8 8 10 7
r_3 6 7 7 8 1
r_4 5 9 8 9 1
r_5 3 0 3 5 4
r_6 2 3 8 1 0
r_7 3 3 7 0 3
r_8 9 9 0 10 1
r_9 7 3 2 7 4
r_10 0 0 4 5 2
r_11 6 8 4 1 5

We can also rename select row names using a dictionary and the dataframe method rename with the keyword input argument index.

In [215]:
row_rename={'r_11':'x'}
df2=df2.rename(index=row_rename)


Viewing the dataframe, once again we can see that the changes have been made.

In [216]:
df2

Out[216]:
c1 c2 c3 c4 y
r_0 0 3 3 7 5
r_1 3 5 2 4 9
r_2 6 8 8 10 7
r_3 6 7 7 8 1
r_4 5 9 8 9 1
r_5 3 0 3 5 4
r_6 2 3 8 1 0
r_7 3 3 7 0 3
r_8 9 9 0 10 1
r_9 7 3 2 7 4
r_10 0 0 4 5 2
x 6 8 4 1 5

str methods can once again be accessed from instances of the Index class such as replace allowing one to replace a repeating pattern.

In [217]:
df2.index=df2.index.str.replace(pat='r_',repl='r')

In [218]:
df2

Out[218]:
c1 c2 c3 c4 y
r0 0 3 3 7 5
r1 3 5 2 4 9
r2 6 8 8 10 7
r3 6 7 7 8 1
r4 5 9 8 9 1
r5 3 0 3 5 4
r6 2 3 8 1 0
r7 3 3 7 0 3
r8 9 9 0 10 1
r9 7 3 2 7 4
r10 0 0 4 5 2
x 6 8 4 1 5

Sorting Data¶

Let's recreate our dataframes and have a look at how we can sort data.

In [219]:
random.seed(0)
matrix=random.randint(low=0,high=11,size=(12,5))
cnames=['c'+str(i) for i in range(5)]
rnames=['r'+str(i) for i in range(12)]
df=pd.DataFrame(data=matrix,columns=cnames)
df2=pd.DataFrame(data=matrix,columns=cnames,index=rnames)

In [220]:
df2

Out[220]:
c0 c1 c2 c3 c4
r0 5 0 3 3 7
r1 9 3 5 2 4
r2 7 6 8 8 10
r3 1 6 7 7 8
r4 1 5 9 8 9
r5 4 3 0 3 5
r6 0 2 3 8 1
r7 3 3 3 7 0
r8 1 9 9 0 10
r9 4 7 3 2 7
r10 2 0 0 4 5
r11 5 6 8 4 1

We can use the dataframe method sort_values to sort a dataframe. To get details about the input arguments we can press shift ⇧ and tab ↹.

The positional input argument by (also acts as a keyword input argument) can be a list of str of column names or row names. The keyword input argument axis can either be 0 (meaning its working on sorting the rows within a column) or 1 (meaning its working on sorting the columns within a row). Both by and axis must match. The keyword input argument ascending has a default value of True meaning the data will be sorted from lowest to highest. If this is assigned to False, it will instead sort in descending order. The keyword input argument inplace can be set to True to apply inplace changes.

Sorting by Rows within a Column¶

axis has a default value of 0 meaning the method sort_values will default to sorting all the rows within a column. The code below will sort all the rows within column c0.

In [221]:
df2.sort_values(by=['c0'])

Out[221]:
c0 c1 c2 c3 c4
r6 0 2 3 8 1
r3 1 6 7 7 8
r4 1 5 9 8 9
r8 1 9 9 0 10
r10 2 0 0 4 5
r7 3 3 3 7 0
r5 4 3 0 3 5
r9 4 7 3 2 7
r0 5 0 3 3 7
r11 5 6 8 4 1
r2 7 6 8 8 10
r1 9 3 5 2 4

We see that c1 has some duplicate values (r3, r4 and r8) they have been sorted corresponding to the original row number. We can instead sort these by column c1 and if there are duplicate values in c0 and c1 we can then sort by c2. To do this we can specify a list containing the 0th, 1st and 2nd column to sort the values by.

In [222]:
df2.sort_values(by=['c0','c1','c2'])

Out[222]:
c0 c1 c2 c3 c4
r6 0 2 3 8 1
r4 1 5 9 8 9
r3 1 6 7 7 8
r8 1 9 9 0 10
r10 2 0 0 4 5
r7 3 3 3 7 0
r5 4 3 0 3 5
r9 4 7 3 2 7
r0 5 0 3 3 7
r11 5 6 8 4 1
r2 7 6 8 8 10
r1 9 3 5 2 4

Alternatively if we wanted to sort in descending order we can set the keyword argument ascending to be False.

In [223]:
df2.sort_values(by=['c0','c1','c2'],ascending =False)

Out[223]:
c0 c1 c2 c3 c4
r1 9 3 5 2 4
r2 7 6 8 8 10
r11 5 6 8 4 1
r0 5 0 3 3 7
r9 4 7 3 2 7
r5 4 3 0 3 5
r7 3 3 3 7 0
r10 2 0 0 4 5
r8 1 9 9 0 10
r3 1 6 7 7 8
r4 1 5 9 8 9
r6 0 2 3 8 1

To apply the changes we can either reassign the output to df2 or set the inplace keyword input argument to True. Recall that when the keyword input argument inplace is True the method will not return a value and therefore if the output is reassigned to the variable name and the keyword input argument inplace is assigned to True, NoneType will be assigned to the object name.

In [224]:
df2.sort_values(by=['c0','c1','c2'],ascending =False,inplace=True)


We can see the changes are made

In [225]:
df2

Out[225]:
c0 c1 c2 c3 c4
r1 9 3 5 2 4
r2 7 6 8 8 10
r11 5 6 8 4 1
r0 5 0 3 3 7
r9 4 7 3 2 7
r5 4 3 0 3 5
r7 3 3 3 7 0
r10 2 0 0 4 5
r8 1 9 9 0 10
r3 1 6 7 7 8
r4 1 5 9 8 9
r6 0 2 3 8 1

The dataframe method sort_index can be used to sort the rows by the index name. By default the keyword input argument axis is set to 0 meaning it acts on the row names (index). The keyword input arguments ascending and inplace are also present in this method.

Note that the str in the index are sorting alphabetically ('r0', 'r1', 'r10', 'r11', 'r2', ...) and not naturally sorted ('r0', 'r1', 'r2', ... , 'r11')

In [226]:
df2.sort_index()

Out[226]:
c0 c1 c2 c3 c4
r0 5 0 3 3 7
r1 9 3 5 2 4
r10 2 0 0 4 5
r11 5 6 8 4 1
r2 7 6 8 8 10
r3 1 6 7 7 8
r4 1 5 9 8 9
r5 4 3 0 3 5
r6 0 2 3 8 1
r7 3 3 3 7 0
r8 1 9 9 0 10
r9 4 7 3 2 7

Natural sorting is unfortunately not natively supported within pandas but can be carried out using the third-party library natsort.

To use the third-party library natsort it must be installed. Open the Anaconda Powershell prompt and type in.

conda install natsort

Let's look at the dataframe.

In [227]:
df2

Out[227]:
c0 c1 c2 c3 c4
r1 9 3 5 2 4
r2 7 6 8 8 10
r11 5 6 8 4 1
r0 5 0 3 3 7
r9 4 7 3 2 7
r5 4 3 0 3 5
r7 3 3 3 7 0
r10 2 0 0 4 5
r8 1 9 9 0 10
r3 1 6 7 7 8
r4 1 5 9 8 9
r6 0 2 3 8 1

We can import natsorted which we can use to naturally reorder the row names in the index.

In [228]:
from natsort import natsorted
natsorted(df2.index)

Out[228]:
['r0', 'r1', 'r2', 'r3', 'r4', 'r5', 'r6', 'r7', 'r8', 'r9', 'r10', 'r11']

Once we can have them we can use them as to select rows from the dataframe using the loc attribute. This will give the dataframe in natural order.

In [229]:
df2.loc[natsorted(df2.index)]

Out[229]:
c0 c1 c2 c3 c4
r0 5 0 3 3 7
r1 9 3 5 2 4
r2 7 6 8 8 10
r3 1 6 7 7 8
r4 1 5 9 8 9
r5 4 3 0 3 5
r6 0 2 3 8 1
r7 3 3 3 7 0
r8 1 9 9 0 10
r9 4 7 3 2 7
r10 2 0 0 4 5
r11 5 6 8 4 1

Recalling that an index is in the form of a dictionary and the key is numeric with the row names corresponding to the values.

In [230]:
df2.index.values

Out[230]:
array(['r1', 'r2', 'r11', 'r0', 'r9', 'r5', 'r7', 'r10', 'r8', 'r3', 'r4',
'r6'], dtype=object)

The key 3 for instance yields the str value 'r0'.

In [231]:
df2.index[3]

Out[231]:
'r0'

We can use index_natsorted to get the ordered numeric keys corresponding to each of the row name values.

In [232]:
from natsort import index_natsorted
index_natsorted(df2.index)

Out[232]:
[3, 0, 1, 9, 10, 5, 11, 6, 8, 4, 7, 2]

Once we can have them we can use them to select rows from the dataframe using the iloc attribute. This will give the dataframe in natural order.

In [233]:
df2.iloc[index_natsorted(df2.index)]

Out[233]:
c0 c1 c2 c3 c4
r0 5 0 3 3 7
r1 9 3 5 2 4
r2 7 6 8 8 10
r3 1 6 7 7 8
r4 1 5 9 8 9
r5 4 3 0 3 5
r6 0 2 3 8 1
r7 3 3 3 7 0
r8 1 9 9 0 10
r9 4 7 3 2 7
r10 2 0 0 4 5
r11 5 6 8 4 1

Going further we can use the numpy function argsort.

In [234]:
np.argsort(index_natsorted(df2.index))

Out[234]:
array([ 1,  2, 11,  0,  9,  5,  7, 10,  8,  3,  4,  6], dtype=int64)

Comparing directly with the dataframe attribute index we can see this returns the index names as numeric int values.

In [235]:
df2.index

Out[235]:
Index(['r1', 'r2', 'r11', 'r0', 'r9', 'r5', 'r7', 'r10', 'r8', 'r3', 'r4',
'r6'],
dtype='object')

Returning to the dataframe method sort_index. We can use the keyword input argument key and assign it to a lambda function with a single dummy argument x and the expression f.

In [236]:
f=np.argsort(index_natsorted(df2.index))
df2.sort_index(key= lambda x: f)

Out[236]:
c0 c1 c2 c3 c4
r0 5 0 3 3 7
r1 9 3 5 2 4
r2 7 6 8 8 10
r3 1 6 7 7 8
r4 1 5 9 8 9
r5 4 3 0 3 5
r6 0 2 3 8 1
r7 3 3 3 7 0
r8 1 9 9 0 10
r9 4 7 3 2 7
r10 2 0 0 4 5
r11 5 6 8 4 1

In the line above no matter what is supplied for the dummy argument x, the same output displays. i.e. x is not placed within the expression and the expression is therefore independent of the value of x.

In [237]:
la1=lambda x: f
la1(1)

Out[237]:
array([ 1,  2, 11,  0,  9,  5,  7, 10,  8,  3,  4,  6], dtype=int64)
In [238]:
la1('a')

Out[238]:
array([ 1,  2, 11,  0,  9,  5,  7, 10,  8,  3,  4,  6], dtype=int64)

The dataframe method arg_sort will use these numeric keys opposed to the str values of the index to sort the dataframe.

To apply the changes inplace, the keyword argument inplace can be set to True.

In [239]:
df2.sort_index(key= lambda x: f,inplace=True)


Sorting by Columns within a Row¶

Let's have a look at the dataframe.

In [240]:
df2

Out[240]:
c0 c1 c2 c3 c4
r0 5 0 3 3 7
r1 9 3 5 2 4
r2 7 6 8 8 10
r3 1 6 7 7 8
r4 1 5 9 8 9
r5 4 3 0 3 5
r6 0 2 3 8 1
r7 3 3 3 7 0
r8 1 9 9 0 10
r9 4 7 3 2 7
r10 2 0 0 4 5
r11 5 6 8 4 1

axis has a default value of 0 meaning the dataframe method sort_values will default to sorting all the rows within a column. To sort all the columns within a row, axis should be set to 1.

In [241]:
df2.sort_values(by=['r0','r1','r2'],axis=1)

Out[241]:
c1 c3 c2 c0 c4
r0 0 3 3 5 7
r1 3 2 5 9 4
r2 6 8 8 7 10
r3 6 7 7 1 8
r4 5 8 9 1 9
r5 3 3 0 4 5
r6 2 8 3 0 1
r7 3 7 3 3 0
r8 9 0 9 1 10
r9 7 2 3 4 7
r10 0 4 0 2 5
r11 6 4 8 5 1

The dataframe method sort_index can also be used with the keyword input argument axis set to 1 to sort the dataframe using the column names.

In [242]:
df2.sort_index(axis=1)

Out[242]:
c0 c1 c2 c3 c4
r0 5 0 3 3 7
r1 9 3 5 2 4
r2 7 6 8 8 10
r3 1 6 7 7 8
r4 1 5 9 8 9
r5 4 3 0 3 5
r6 0 2 3 8 1
r7 3 3 3 7 0
r8 1 9 9 0 10
r9 4 7 3 2 7
r10 2 0 0 4 5
r11 5 6 8 4 1

DataTypes¶

Let's recreate our dataframe.

In [243]:
random.seed(0)
matrix=random.randint(low=0,high=11,size=(12,5))
cnames=['c'+str(i) for i in range(5)]
rnames=['r'+str(i) for i in range(12)]
df=pd.DataFrame(data=matrix,columns=cnames)
df2=pd.DataFrame(data=matrix,columns=cnames,index=rnames)


The dataframe method head can be used to preview the top n rows in a dataframe. To get details about the input arguments press shift ⇧ and tab ↹. The dataframe method head has a keyword input argument n which has a default value of 5.

We see the top 5 rows (0 to 5 in steps of 1 inclusive of the lower bound and exclusive of the upper bound).

In [244]:
df2.head()

Out[244]:
c0 c1 c2 c3 c4
r0 5 0 3 3 7
r1 9 3 5 2 4
r2 7 6 8 8 10
r3 1 6 7 7 8
r4 1 5 9 8 9

Alternatively we can use the dataframe method tail to see the last n rows in a dataframe.

In [245]:
df2.tail()

Out[245]:
c0 c1 c2 c3 c4
r7 3 3 3 7 0
r8 1 9 9 0 10
r9 4 7 3 2 7
r10 2 0 0 4 5
r11 5 6 8 4 1

We can use the dataframe method describe to get statistical details about each column in a dataframe. To get details about the input arguments press shift ⇧ and tab ↹. We will leave them all as default.

In [246]:
df.describe()

Out[246]:
c0 c1 c2 c3 c4
count 12.000000 12.000000 12.000000 12.000000 12.000000
mean 3.500000 4.166667 4.833333 4.666667 5.583333
std 2.713602 2.790677 3.298301 2.806918 3.528026
min 0.000000 0.000000 0.000000 0.000000 0.000000
25% 1.000000 2.750000 3.000000 2.750000 3.250000
50% 3.500000 4.000000 4.000000 4.000000 6.000000
75% 5.000000 6.000000 8.000000 7.250000 8.250000
max 9.000000 9.000000 9.000000 8.000000 10.000000

In a dataframe each column (which is an instance of a pandas series class) has an assigned datatype. This can be checked by using the dataframe attribute dtypes.

In [247]:
df2.dtypes

Out[247]:
c0    int32
c1    int32
c2    int32
c3    int32
c4    int32
dtype: object

We see that all the columns are int32 which is expected as the columns have automatically been created using the randint function.

Be careful when working with different datatypes as you may get unexpected results. For example if we attempt to reassigning the cell 'r0', 'c1' (in the integer series 'c1') to the float 21.2.

In [248]:
df2.at['r0','c1']=21.2


We get the int 21 when we read back the cell. This is because the panda series 'c1' is set to use int32 and therefore when a value in this series is assigned to a float anything following the decimal point is truncated.

In [249]:
df2.at['r0','c1']

Out[249]:
21

We can look at a series directly and use the series dtype to return the dtype of a single column.

In [250]:
df2.c1.dtype

Out[250]:
dtype('int32')

We can then use the method astype to change the dtype. To do so we can use the positional input argument (also acts as a keyword input argument) dtype to select the dtype. We can assign the input argument to a str 'float', 'bool' or 'str'. Alternatively as all of these including 'int' are classes, we can assign the class directly so we can use int, float, bool or str (without quotations).

Note the decimal point followed by a 0 will display to enphasise its a float.

In [251]:
df2.c1.astype(dtype='float')

Out[251]:
r0     21.0
r1      3.0
r2      6.0
r3      6.0
r4      5.0
r5      3.0
r6      2.0
r7      3.0
r8      9.0
r9      7.0
r10     0.0
r11     6.0
Name: c1, dtype: float64

When it is a bool, any non-zero values will be proscribed to as True and any 0 values will be proscribed to False.

In [252]:
df2.c1.astype(dtype='bool')

Out[252]:
r0      True
r1      True
r2      True
r3      True
r4      True
r5      True
r6      True
r7      True
r8      True
r9      True
r10    False
r11     True
Name: c1, dtype: bool

When it is a str there is no immediate difference in the form output to the console. The dtype however will be object.

In [253]:
df2.c1.astype(dtype='str')

Out[253]:
r0     21
r1      3
r2      6
r3      6
r4      5
r5      3
r6      2
r7      3
r8      9
r9      7
r10     0
r11     6
Name: c1, dtype: object

To perform an inplace update, we need to reassign the output to the pandas series. For this we can use attribute indexing or square bracket indexing to reference an exist series.

In [254]:
df2.c1=df2.c1.astype(dtype='float')


Recall however if we want to add a new series to the dataframe we have to use square bracket indexing.

In [255]:
df2['c5']=df2.c1.astype(dtype='float')


Let's return the original series back to an int datatype.

In [256]:
df2.c1=df2.c1.astype(dtype='int')


Let's now have a look at the dataframe.

In [257]:
df2

Out[257]:
c0 c1 c2 c3 c4 c5
r0 5 21 3 3 7 21.0
r1 9 3 5 2 4 3.0
r2 7 6 8 8 10 6.0
r3 1 6 7 7 8 6.0
r4 1 5 9 8 9 5.0
r5 4 3 0 3 5 3.0
r6 0 2 3 8 1 2.0
r7 3 3 3 7 0 3.0
r8 1 9 9 0 10 9.0
r9 4 7 3 2 7 7.0
r10 2 0 0 4 5 0.0
r11 5 6 8 4 1 6.0

If we now update the cells.

In [258]:
df2.at['r0','c1']=21.2
df2.at['r0','c5']=21.2


We see that the column c1 is an int and c5 is a float.

In [259]:
df2

Out[259]:
c0 c1 c2 c3 c4 c5
r0 5 21 3 3 7 21.2
r1 9 3 5 2 4 3.0
r2 7 6 8 8 10 6.0
r3 1 6 7 7 8 6.0
r4 1 5 9 8 9 5.0
r5 4 3 0 3 5 3.0
r6 0 2 3 8 1 2.0
r7 3 3 3 7 0 3.0
r8 1 9 9 0 10 9.0
r9 4 7 3 2 7 7.0
r10 2 0 0 4 5 0.0
r11 5 6 8 4 1 6.0

A row can be selected from the dataframe (as a pandas series) using the attribute loc. The dtype of the pandas series will be the most complicated datatype. In this case since one of the elements in the series is a float the datatype of the pandas series will therefore be float.

In [260]:
df2.loc['r0']

Out[260]:
c0     5.0
c1    21.0
c2     3.0
c3     3.0
c4     7.0
c5    21.2
Name: r0, dtype: float64

If an additional panda series of the datatype str is added to the dataframe.

In [261]:
df2['c5']=df2.c1.astype(dtype='str')


The row (which is taken as a pandas series) will take the most complicated datatype which is the str and therefore the datatype will display as an object.

In [262]:
df2.loc['r0']

Out[262]:
c0     5
c1    21
c2     3
c3     3
c4     7
c5    21
Name: r0, dtype: object

Additional datatypes are available such as the category datatype for working with categorical data and the datetime or timedelta datatypes when working with time or time differences respectively.

Category DataType¶

Let's create a dataframe emulating a users selecting and throwing a dice from a box of randomly colored dices. The dice has 6 numbers and 3 colors are available.

In [263]:
random.seed(0)
dice=random.choice((1,2,3,4,5,6),size=12)
color=random.choice(('red','green','blue'),size=12)
rnames=['r'+str(i) for i in range(12)]
df=pd.DataFrame(data=dice,columns=['dice'],
index=rnames)
df['color']=color

In [264]:
df

Out[264]:
dice color
r0 5 red
r1 6 red
r2 1 blue
r3 4 green
r4 4 blue
r5 4 blue
r6 2 red
r7 4 green
r8 6 green
r9 3 green
r10 5 green
r11 1 red

Let's use the dataframe attribute dtypes to look at the datatypes of each column. The datatypes are int and object (str).

In [265]:
df.dtypes

Out[265]:
dice      int32
color    object
dtype: object

Setting the DataType to Category¶

To use the category datatype we need to assign the keyword input argument to the str 'category'.

In [266]:
df.dice.astype(dtype='category')

Out[266]:
r0     5
r1     6
r2     1
r3     4
r4     4
r5     4
r6     2
r7     4
r8     6
r9     3
r10    5
r11    1
Name: dice, dtype: category
Categories (6, int64): [1, 2, 3, 4, 5, 6]

Note that unlike int, float, bool and str used earlier, the category datatype is not a seperate class used in core Python so using category without quotations will give a NameError.

We can reassign both the series dice and color to the new datatype.

In [267]:
df.dice=df.dice.astype(dtype='category')

In [268]:
df.color=df.color.astype(dtype='category')


Now if we check the datatypes we can see that both series are of the type category.

In [269]:
df.dtypes

Out[269]:
dice     category
color    category
dtype: object

A category pandas series will have the attribute cat which can be used to access category attributes and methods.

In [270]:
df.color.cat

Out[270]:
<pandas.core.arrays.categorical.CategoricalAccessor object at 0x000001E3266D7490>

To get a list of these attributes and methods, type in an additional dot . followed by a tab ↹.

Categories¶

The categories attribute will display the categories.

In [271]:
df.color.cat.categories

Out[271]:
Index(['blue', 'green', 'red'], dtype='object')

The categories are by default alphabetically ordered. Custom ordering can be applied, this can be checked using the attribute order which returns False in this case as no custom ordering has been applied.

In [272]:
df.color.cat.ordered

Out[272]:
False

The cat codes proscribe an integer to each cat. In this case blue is 0, green is 1 and red is 2 as they are default alphabetically ordered.

In [273]:
df.color.cat.codes

Out[273]:
r0     2
r1     2
r2     0
r3     1
r4     0
r5     0
r6     2
r7     1
r8     1
r9     1
r10    1
r11    2
dtype: int8
In [274]:
df.color

Out[274]:
r0       red
r1       red
r2      blue
r3     green
r4      blue
r5      blue
r6       red
r7     green
r8     green
r9     green
r10    green
r11      red
Name: color, dtype: category
Categories (3, object): ['blue', 'green', 'red']

We can index into a cell that belongs to a pandas category series to get the assigned category.

In [275]:
df.color.r0

Out[275]:
'red'

And we can replace it with an existing category using a str format.

In [276]:
df.color.r0='green'


By default we cannot replace it with a new category, for example if we try adding the value 'yellow' we get a ValueError.

This is setup this way by default to prevent users from inputting several varations of categories which should be a single category by accident. For example by use of str capitlization, missing, including or using a hyphen or dash as a space. These are common human errors when dealing with categorical data and the likes of Excel spreadsheets.

Let's have a look at the existing categories.

In [277]:
df.color.cat.categories

Out[277]:
Index(['blue', 'green', 'red'], dtype='object')

Let's use the cat method add_categories to add the category 'yellow'. Type in the method and then press shift ⇧ and tab ↹ to get details about the input arguments. We see that we can specify either a single category or a list of categories.

In [278]:
df.color.cat.add_categories('yellow')

Out[278]:
r0     green
r1       red
r2      blue
r3     green
r4      blue
r5      blue
r6       red
r7     green
r8     green
r9     green
r10    green
r11      red
Name: color, dtype: category
Categories (4, object): ['blue', 'green', 'red', 'yellow']

To apply an inplace update we will need to reassign the output to the series df.color

In [279]:
df.color=df.color.cat.add_categories('yellow')


Now we can select the cell within the column color and the row r0 and update it to 'yellow'.

In [280]:
df.color.r0='yellow'

In [281]:
df.color

Out[281]:
r0     yellow
r1        red
r2       blue
r3      green
r4       blue
r5       blue
r6        red
r7      green
r8      green
r9      green
r10     green
r11       red
Name: color, dtype: category
Categories (4, object): ['blue', 'green', 'red', 'yellow']

Renaming Categories¶

Let's have a look at our categories.

In [282]:
df.color.cat.categories

Out[282]:
Index(['blue', 'green', 'red', 'yellow'], dtype='object')

We can rename them, for example to 1 letter characters using the cat method rename_categories. To get details about the input arguments type in the method and press shift ⇧ and tab ↹.

We can see that we can use a list or a dict of the new values. Let's have a look using a list.

In [283]:
newcats=['b','g','r','y']

In [284]:
df.color.cat.rename_categories(newcats)

Out[284]:
r0     y
r1     r
r2     b
r3     g
r4     b
r5     b
r6     r
r7     g
r8     g
r9     g
r10    g
r11    r
Name: color, dtype: category
Categories (4, object): ['b', 'g', 'r', 'y']

When using a list. The values need to be specified in the correct order and the list needs to match the length of existing categories, otherwise we get a ValueError.

newcats=['b','g','r']

df.color.cat.rename_categories(newcats)

A dictionary is more flexible, for example if we want to rename just the 'y' category.

In [285]:
newcatd={'yellow':'y'}

In [286]:
df.color.cat.rename_categories(newcatd)

Out[286]:
r0         y
r1       red
r2      blue
r3     green
r4      blue
r5      blue
r6       red
r7     green
r8     green
r9     green
r10    green
r11      red
Name: color, dtype: category
Categories (4, object): ['blue', 'green', 'red', 'y']

We can create a dictionary of the old categories and new categories by use of the zip function.

In [287]:
oldcats=df.color.cat.categories

In [288]:
newcatsd=dict(zip(oldcats,newcats))

In [289]:
newcatsd

Out[289]:
{'blue': 'b', 'green': 'g', 'red': 'r', 'yellow': 'y'}
In [290]:
df.color.cat.rename_categories(newcatsd)

Out[290]:
r0     y
r1     r
r2     b
r3     g
r4     b
r5     b
r6     r
r7     g
r8     g
r9     g
r10    g
r11    r
Name: color, dtype: category
Categories (4, object): ['b', 'g', 'r', 'y']

Merging Categories¶

Although a dictionary can have multiple keys with the same value.

In [291]:
newcatsd['yellow']='g'

In [292]:
newcatsd

Out[292]:
{'blue': 'b', 'green': 'g', 'red': 'r', 'yellow': 'g'}

We get an error if we try to use the dictionary to merge 2 old cats to a new category.

Fortunately this is easy enough to do by indexing and reassigning a value. Let's first update the pandas series to have single letter values.

In [293]:
newcatsd['yellow']='y'

In [294]:
df.color=df.color.cat.rename_categories(newcatsd)

In [295]:
df.color

Out[295]:
r0     y
r1     r
r2     b
r3     g
r4     b
r5     b
r6     r
r7     g
r8     g
r9     g
r10    g
r11    r
Name: color, dtype: category
Categories (4, object): ['b', 'g', 'r', 'y']

We can index into the series by using the condition df.color=='y'.

In [296]:
df.color[df.color=='y']

Out[296]:
r0    y
Name: color, dtype: category
Categories (4, object): ['b', 'g', 'r', 'y']

We can reassign this selection to an existing category.

In [297]:
df.color[df.color=='y']='g'


If we look at this series we see that all the values that were 'y' are now 'g'.

In [298]:
df.color

Out[298]:
r0     g
r1     r
r2     b
r3     g
r4     b
r5     b
r6     r
r7     g
r8     g
r9     g
r10    g
r11    r
Name: color, dtype: category
Categories (4, object): ['b', 'g', 'r', 'y']

4 categories are listed. We can remove unused categories using the method remove_unused_categories. To get details about the input arguments type in the method and press shift ⇧ and tab ↹. We see the only input argument is depreacted, so we can call it without any.

In [299]:
df.color.cat.remove_unused_categories()

Out[299]:
r0     g
r1     r
r2     b
r3     g
r4     b
r5     b
r6     r
r7     g
r8     g
r9     g
r10    g
r11    r
Name: color, dtype: category
Categories (3, object): ['b', 'g', 'r']
In [300]:
df.color=df.color.cat.remove_unused_categories()


Ordering Categories¶

The categories are ordered alphabetically. In some cases this may not be desirable. Let's make a new pandas series called grade, recall we have to index usign square brackets to create a new series. We will copy the values in df.color and then rename the categories to low med and high.

In [301]:
df['grade']=df.color

In [302]:
df.grade=df.grade.cat.rename_categories({'g':'low','b':'med','r':'high'})

In [303]:
df.grade

Out[303]:
r0      low
r1     high
r2      med
r3      low
r4      med
r5      med
r6     high
r7      low
r8      low
r9      low
r10     low
r11    high
Categories (3, object): ['med', 'low', 'high']

If we have a look at the categories we see that they were originally alphabetically ordered from 'blue', 'green' and 'red'. In this case the order 'med', 'low' and 'high' does not make sense.

In [304]:
df.grade.cat.categories

Out[304]:
Index(['med', 'low', 'high'], dtype='object')

To get details about the input arguments type in the method and press shift ⇧ and tab ↹. We see the input argument can either be index like or a list.

In [305]:
new_order=['low','med','high']

In [306]:
df.grade.cat.reorder_categories(new_order)

Out[306]:
r0      low
r1     high
r2      med
r3      low
r4      med
r5      med
r6     high
r7      low
r8      low
r9      low
r10     low
r11    high
Categories (3, object): ['low', 'med', 'high']
In [307]:
new_order_index=pd.Index(new_order)

In [308]:
df.grade.cat.reorder_categories(new_order_index)

Out[308]:
r0      low
r1     high
r2      med
r3      low
r4      med
r5      med
r6     high
r7      low
r8      low
r9      low
r10     low
r11    high
Categories (3, object): ['low', 'med', 'high']

Let's reassign the output to the pd series df.grade.

In [309]:
df.grade=df.grade.cat.reorder_categories(new_order_index)


Although we have ordered the names of the categories. The categories aren't ordered. If we look at the categories we can see they are seperated by a comma infering they are unordered.

In [310]:
df.grade.cat.categories

Out[310]:
Index(['low', 'med', 'high'], dtype='object')

We can use the cat method as_ordered to set the series as ordered. To get details about the input arguments type in the method and press shift ⇧ and tab ↹. We see we don't need to provide any.

Note the < sign is in place of the , indicating the categories are ordered.

In [311]:
df.grade.cat.as_ordered()

Out[311]:
r0      low
r1     high
r2      med
r3      low
r4      med
r5      med
r6     high
r7      low
r8      low
r9      low
r10     low
r11    high
Categories (3, object): ['low' < 'med' < 'high']
In [312]:
df.grade=df.grade.cat.as_ordered()


Creating Categories from Numeric Data¶

Let's have a look at our dataframe.

In [313]:
df

Out[313]:
r0 5 g low
r1 6 r high
r2 1 b med
r3 4 g low
r4 4 b med
r5 4 b med
r6 2 r high
r7 4 g low
r8 6 g low
r9 3 g low
r10 5 g low
r11 1 r high

Let's drop the series df.grade and instead create a new grade series based on the numeric data in dice.

In [314]:
df=df.drop(columns='grade')

In [315]:
df

Out[315]:
dice color
r0 5 g
r1 6 r
r2 1 b
r3 4 g
r4 4 b
r5 4 b
r6 2 r
r7 4 g
r8 6 g
r9 3 g
r10 5 g
r11 1 r

We want to create 3 grades from the numeric data in dice. Because we have n grades, we need n+1 boundaries. The list is excusive of the lower bound and inclusive of the upper bound i.e. 0 to 2 (excluding 0 and including 2], 2 to 4 (excluding 2 and including 4], 4 to 6 (excluding 4 and including 6]. Note the categories are seperated by < indicating they are ordered.

In [316]:
bins=[0,2,4,6]
pd.cut(df.dice,bins)

Out[316]:
r0     (4, 6]
r1     (4, 6]
r2     (0, 2]
r3     (2, 4]
r4     (2, 4]
r5     (2, 4]
r6     (0, 2]
r7     (2, 4]
r8     (4, 6]
r9     (2, 4]
r10    (4, 6]
r11    (0, 2]
Name: dice, dtype: category
Categories (3, interval[int64]): [(0, 2] < (2, 4] < (4, 6]]

We can assign the output to a enw series using square brackets.

In [317]:
df['grade']=pd.cut(df.dice,bins)

In [318]:
df

Out[318]:
r0 5 g (4, 6]
r1 6 r (4, 6]
r2 1 b (0, 2]
r3 4 g (2, 4]
r4 4 b (2, 4]
r5 4 b (2, 4]
r6 2 r (0, 2]
r7 4 g (2, 4]
r8 6 g (4, 6]
r9 3 g (2, 4]
r10 5 g (4, 6]
r11 1 r (0, 2]
In [319]:
oldcats=df.grade.cat.categories
newcats=['low','med','high']
renamecats=dict(zip(oldcats,newcats))

In [320]:
df

Out[320]:
r0 5 g high
r1 6 r high
r2 1 b low
r3 4 g med
r4 4 b med
r5 4 b med
r6 2 r low
r7 4 g med
r8 6 g high
r9 3 g med
r10 5 g high
r11 1 r low
In [321]:
df.grade

Out[321]:
r0     high
r1     high
r2      low
r3      med
r4      med
r5      med
r6      low
r7      med
r8     high
r9      med
r10    high
r11     low
Categories (3, object): ['low' < 'med' < 'high']
In [322]:
df.grade=df.grade


Grouping by Category¶

We can create a groupby object using the dataframe method groupby. If we call this using open parenthesis we can see details about the input arguments. The default value for the keyword argument axis is 0 meaning the method will act on grouping rows in an individual column. The keyword input argument by (which also acts as a positional input argument) can therefore be used to select a column.

In [323]:
df.groupby(by='color')

Out[323]:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001E3266F53A0>

The output displays information tell us this is an object assigned to a specific location in memory. To inspect it in more detail I will assigned it to a new object name:

In [324]:
gbo=df.groupby(by='color')


If we type in the groupby object name followed by a dot . and then a tab ↹, a list of attributes and methods will display:

If we select the method count. To get details about the input arguments type in the method and press shift ⇧ and tab ↹. We see there are no input arguments.

In [325]:
gbo.count()

Out[325]:
color
b 3 3
g 6 6
r 3 3

Let's compare this with the dataframe.

In [326]:
df

Out[326]:
r0 5 g high
r1 6 r high
r2 1 b low
r3 4 g med
r4 4 b med
r5 4 b med
r6 2 r low
r7 4 g med
r8 6 g high
r9 3 g med
r10 5 g high
r11 1 r low

In the series df.color the value 'b' shows in 'r2', 'r4' and 'r5' i.e. 3 times.

In the series df.color the value 'g' shows in 'r0', 'r3', 'r7'. 'r8', 'r9' and 'r10' i.e. 6 times.

In the series df.color the value 'r' shows in 'r1', 'r6' and 'r11' i.e. 3 times.

In [327]:
gbo.count()

Out[327]:
color
b 3 3
g 6 6
r 3 3

This can be combined into a single line.

In [328]:
df.groupby(by='color').count()

Out[328]:
color
b 3 3
g 6 6
r 3 3

Although the above is a bit more concise, there is a major drawback when it comes to looking up attributes and methods. Typing a dot . and then tab ↹ after the first method is input doesn't display the list of available attributes and methods. Moreover typing in shift ⇧ and tab ↹ does not give details about the groupby method count.

The output can also be assigned to an object name and is an instance of the DataFrame class.

In [329]:
colorcount=df.groupby(by='color').count()

In [330]:
type(colorcount)

Out[330]:
pandas.core.frame.DataFrame

We can use the groupby method to group by the series color and then select another column as an attribute to perform groupby calculations on. We need a numeric series to perform numerical operations on. Let's return the series dice to the type int (recall it was set to category).

In [331]:
df.dice=df.dice.astype(int)

In [332]:
df.groupby(by='color').dice.sum()

Out[332]:
color
b     9
g    27
r     9
Name: dice, dtype: int64

Leaving the series as category and trying the above gives a TypeError.

Date and Time¶

When working with time we generally use a datetime time stamp and the numpy library has support for these. numpy was imported at the start of this notebook. If we type in the class name followed by shift ⇧ and tab ↹ we get the docstring for the classes Init signature unfortunately we see this lacking. However we use a str of the time stamp. For simplicity let's look at a year.

In [333]:
np.datetime64('2019')

Out[333]:
numpy.datetime64('2019')

We can also assign it to the object name date:

In [334]:
date=np.datetime64('2019')


Fortunately the pandas series handles this datatype much better and we can view the datatype within the variable explorer properly. We can easily convert this value into a series:

In [335]:
date=pd.Series(date)

In [336]:
date

Out[336]:
0   2019-01-01
dtype: datetime64[ns]

We see the str displays not only the year but also the month and day i.e. is of the format:

'YYYY-MM-DD'

We can also include a time after this.

'YYYY-MM-DD hh:mm:ss'

We can create another date:

In [337]:
date2=pd.Series(np.datetime64('2021-02-05 20:27:30'))

In [338]:
date2

Out[338]:
0   2021-02-05 20:27:30
dtype: datetime64[ns]

We now have two different time stamps and can now calculate the time difference between them:

In [339]:
date2-date

Out[339]:
0   766 days 20:27:30
dtype: timedelta64[ns]

This returns a slightly different object, a timedelta64 object i.e. an instance of the timedelta64 class.

Note the units years and the unit months are omitted as they are inconsistent units. Most years have 365 days but a leap year has 366 days and months range from 28-29 days in March and 31 days in January. Therefore it is more accurate to state time in days.

We can assign this output to a pandas series:

In [340]:
timediff=pd.Series(date2-date)


We can also create a timedelta64 object directly from the numpy library. If we type in the class name followed by shift ⇧ and tab ↹ we get the docstring for the classes Init signature unfortunately we see this is also lacking.

The first positional input argument is the value and the second is the unit. For example:

1 Day:

In [341]:
np.timedelta64(1,'D')

Out[341]:
numpy.timedelta64(1,'D')

1 hour:

In [342]:
np.timedelta64(1,'h')

Out[342]:
numpy.timedelta64(1,'h')

1 minute:

In [343]:
np.timedelta64(1,'m')

Out[343]:
numpy.timedelta64(1,'m')

1 second:

In [344]:
np.timedelta64(1,'s')

Out[344]:
numpy.timedelta64(1,'s')

We can combine these units together:

In [345]:
np.timedelta64(1,'D')+np.timedelta64(1,'s')

Out[345]:
numpy.timedelta64(86401,'s')

We can use the numpy function arange to create an array of numbers. If we type in the function name followed by shift ⇧ and tab ↹ we get the docstring for the function and see the keyword input arguments start, stop and step.

In [346]:
a=np.arange(start=0,stop=6,step=1)


This creates an array starting at the value 0 and going up to (but not including) 6 in steps of 1:

In [347]:
a

Out[347]:
array([0, 1, 2, 3, 4, 5])

We can also do this with time stamps or time differences:

In [348]:
times=np.arange(start=np.timedelta64(0,'s'),stop=np.timedelta64(6,'s'),step=np.timedelta64(1,'s'))

In [349]:
times

Out[349]:
array([0, 1, 2, 3, 4, 5], dtype='timedelta64[s]')

Let's convert this into a pandas series.

In [350]:
times=pd.Series(times)

In [351]:
times

Out[351]:
0   0 days 00:00:00
1   0 days 00:00:01
2   0 days 00:00:02
3   0 days 00:00:03
4   0 days 00:00:04
5   0 days 00:00:05
dtype: timedelta64[ns]

The start and stop can instead be specified as a timestamp.

In [352]:
dates=np.arange(start=np.datetime64('2021-01-01'),stop=np.datetime64('2021-01-06'),step=np.timedelta64(1,'D'))
dates=pd.Series(dates)

In [353]:
dates

Out[353]:
0   2021-01-01
1   2021-01-02
2   2021-01-03
3   2021-01-04
4   2021-01-05
dtype: datetime64[ns]

We can create a dataframe from these series using:

In [354]:
df=pd.DataFrame(data=a)
df['times']=times
df['dates']=dates

In [355]:
df

Out[355]:
0 times dates
0 0 0 days 00:00:00 2021-01-01
1 1 0 days 00:00:01 2021-01-02
2 2 0 days 00:00:02 2021-01-03
3 3 0 days 00:00:03 2021-01-04
4 4 0 days 00:00:04 2021-01-05
5 5 0 days 00:00:05 NaT

Note that the last value of dates is NaT (Not a Time) and is due to this series only having 5 values so no value was assigned for the last index.

Once familiar with the Python and Data Analysis Library it is recommended to move onto the matplotlib Python ploting library. I have an additional JupyterLab notebook covering the matplotlib library here.

The Python Plotting Library