The Python and Data Analysis Library (pandas)

This is a beginner guide looking at one of the most common Python libraries. This can be viewed in the browser however the Interactive Python Notebook File can also be opened in JupyterLab. The Python and Data Analysis Library (which is abbreviated using lower case pandas) to organize and manipulate data in a dataframe (which can be thought of as equivalent to an Excel spreadsheet).

Perquisites

Before looking at any Python library it is recommended to familarize yourself with the Python programming language and object orientated programming concepts. Details about these in addition to Python installation instructions are available in my perquisite guide below.

It is also recommended to familarize yourself with the Numeric Python library numpy. I have an additional JupyterLab notebook covering the numpy library.

Importing the pandas Library

The pandas library is one of the most commonly used python libraries particularly in the field of data science and as a consequence it is commonly imported using a 2 letter abbreviation, similar to the numpy library.

In [1]:
import numpy as np
import pandas as pd

NumPy Arrays

Before starting with pandas it is worth having a quick recap of how we encode numeric data as a list and as a numpy array.

Lists

Numeric data can be expressed in a list:

In [2]:
x=[0,1,2,3,4]
In [3]:
x
Out[3]:
[0, 1, 2, 3, 4]
In [4]:
y1=[0,2,4,6,8]
In [5]:
y1
Out[5]:
[0, 2, 4, 6, 8]
In [6]:
y2=[0,1.5,3,4.5,6.0]
In [7]:
y2
Out[7]:
[0, 1.5, 3, 4.5, 6.0]

Although the data is numeric, the object is a list and therefore the attributes and attributes are related to common list operations and not common numeric operations.

Attributes and methods available from an object can be seen by typing in the object name followed by a dot . and then pressing tab ↹.

list_attributes_methods

A method can be thought of as function that belongs to the object. For example the list method append will take the original list and append an additional item at the end of the list. As methods are functions they must be called with parenthesis. The parenthesis can contain the positional input arguments (which always be defined in order at the beginning) or keyword input arguments (which can optionally be assigned and otherwise use their default values). To get details about the input arguments press shift ⇧ and tab ↹.

Here we see the positional input argument object.

Note also that the list method append has no return value and performs an inplace update of the original list.

list_method_append

Let's assign the positional input argument object to 4. It will be appended to the end of the list y1. An inplace update of y1 will occur with this new element.

In [8]:
y1.append(4)
In [9]:
y1
Out[9]:
[0, 2, 4, 6, 8, 4]

This additional element can be popped off by using the list method pop. Once again to get details about the input arguments press shift ⇧ and tab ↹. The list method pop has a single keyword input argument index=-1. This means that the keyword input argument index has a default value of -1. So if we don't change it, we will pop the last value (the -1 value) off from the list.

Note this method performs an inplace update of the list but also has a return value which is the item popped.

list_method_pop

In [10]:
y1.pop()
Out[10]:
4
In [11]:
y1
Out[11]:
[0, 2, 4, 6, 8]

The + operator performs the operation outlined in the special method __add__. If we type in ⇧ and tab ↹ the string form: states it is a "list object" method. For a list this special method is setup to perform concatenation.

list__special_method_add

So for two equally sized lists, we get a new list that has twice as many elements.

In [12]:
y1+y2
Out[12]:
[0, 2, 4, 6, 8, 0, 1.5, 3, 4.5, 6.0]

1D Arrays

When dealing with exclusively numeric data, a numpy array is more powerful. The attributes and methods are more specialized for numeric operations. For example the operator + will perform element by element addition of equally sized numpy arrays opposed to concatenating the two equally sized arrays together. The function np.array can be used to convert a list into an array.

In [13]:
x=np.array(x)
In [14]:
x
Out[14]:
array([0, 1, 2, 3, 4])
In [15]:
y1=np.array(y1)
In [16]:
y1
Out[16]:
array([0, 2, 4, 6, 8])
In [17]:
y2=np.array(y2)
In [18]:
y2
Out[18]:
array([0. , 1.5, 3. , 4.5, 6. ])

The attributes and methods available to the new object y1 can now be examined by typing in the object name followed by a dot . and then pressing tab ↹.

ndarray_attributes_methods

We see the list of attributes and methods are different for the object y1 now that it is a numpy array and as expected are more numeric focused. This object has both attributes and functions. An attribute can be thought of as an object that is referenced with respect to another object (i.e. is an another instance of a class). Typically it is a property of the object for example the shape will give the dimension. An attribute has no input arguments and is therefore called without parenthesis. Pressing shift ⇧ and tab ↹ will do nothing as the attribute has no input arguments.

In [19]:
y1.shape
Out[19]:
(5,)

Note the shape of the numpy array only specifies a single dimension. This means that the numpy array is neither a row or a column but can instead be viewed as either or as convenient.

The + operator performs the operation outlined in the special method __add__. If we type in ⇧ and tab ↹ the string form: states it is a "ndarray object" method. For a ndarray this special method is setup to perform element by element addition.

ndarray__special_method_add

In [20]:
y1+y2
Out[20]:
array([ 0. ,  3.5,  7. , 10.5, 14. ])

2D Arrays (Columns or Rows)

The numpy array above is neither a column or row however there are times we need to specify it as either a column or a row for a specific numeric operation for example concatenation. It can explicitly be reshaped using the ndarray method reshape. The positional input argument is newshape a tuple corresponding to the new dimensions. If -1 is specified for one of the dimensions, it will take all elements and apply it to that dimension.

In [21]:
x.reshape
Out[21]:
<function ndarray.reshape>

ndarray_method_reshape2

In [22]:
x=x.reshape((-1,1))
In [23]:
x
Out[23]:
array([[0],
       [1],
       [2],
       [3],
       [4]])
In [24]:
y1=y1.reshape((-1,1))
In [25]:
y1
Out[25]:
array([[0],
       [2],
       [4],
       [6],
       [8]])
In [26]:
y2=y2.reshape((-1,1))
In [27]:
y2
Out[27]:
array([[0. ],
       [1.5],
       [3. ],
       [4.5],
       [6. ]])

2D Arrays (Matrices)

The column vectors can be concatenated to create a matrix. The keyword input argument axis for a matrix can be 0 corresponding to the rows or 1 corresponding to the number of columns. Recall when working with numpy arrays, that any new dimension are added they are always added to the front of the tuple.

(ncols)=array1d.shape

(nrows,ncols)=array2d.shape

(npages,nrows,ncols)=array3d.shape

(nframes,npages,nrows,ncols)=array4d.shape

...

This means for a 1d array axis=0 corresponds to columns, for a 2d array axis=0 corresponds to rows, for a 3d array axis=0 corresponds to pages and for a 4d array axis=0 corresponds to frames. More generally the axis can be taken using its negative index where axis=-1 is always corresponding to the columns and axis=-2 is always corresponding to the rows.

We will move onto dataframes which have data in the form of a 2d array. Therefore when axis is used in a dataframe axis=0 corresponds to rows and axis=1 corresponds to columns.

In [28]:
matrix=np.concatenate([x,y1,y2],axis=-1)
In [29]:
matrix
Out[29]:
array([[0. , 0. , 0. ],
       [1. , 2. , 1.5],
       [2. , 4. , 3. ],
       [3. , 6. , 4.5],
       [4. , 8. , 6. ]])

The Series, DataFrame and Index Classes

The pandas Series Class

Note when the matrix was created that the names of the vectors x, y1 and y2 got lost and were instead replaced by the name of the matrix. Instead of creating numpy arrays which only have data and an variable name we can explicitly creae an object which can possess a name. We will start by creating columns in the form of a pandas series. Note when we are instantiating a class i.e. creating an instance or an object we need to call the class in CamelCaseCapitalization. This notation is used in the Python programming language to distinguish class creation from functions.

Classes like functions use parethesis and contain positional and keyword input arguments. Once again we can access these by use of ⇧ and tab ↹.

pd-Series-Class

In this case all the input arguments are keyword input arguments but the keyword input data can also be used as a positional input argument. We can provide the data in the form of a list and provide the name as a string.

The datatype will be inferred from the contents input in the list. For example x, y1 are assigned as the datatype int64 and int32 from the list and numpy array respectively. y2 is a float as the list has numbers which include a decimal point.

Note that the output has a default integer 0,1,2,3 and 4. Each value in a series is paired with its perspective index value.

In [30]:
x=[0,1,2,3,4]
x=pd.Series(data=x,name='x')
In [31]:
x
Out[31]:
0    0
1    1
2    2
3    3
4    4
Name: x, dtype: int64
In [32]:
y1=[0,2,4,6,8]
y1=np.array(y1)
y1=pd.Series(data=y1,name='y1')
In [33]:
y1
Out[33]:
0    0
1    2
2    4
3    6
4    8
Name: y1, dtype: int32
In [34]:
y2=[0,1.5,3,4.5,6.0]
y2=np.array([0,1.5,3,4.5,6.0])
y2=pd.Series(data=y2,name='y2')
In [35]:
y2
Out[35]:
0    0.0
1    1.5
2    3.0
3    4.5
4    6.0
Name: y2, dtype: float64

The attributes and methods available to the new object y1 can now be examined. Once again by typing in the object name followed by a dot . and then pressing tab ↹.

pdseries_attributes_methods

We can use the attribute name for example to return the name of the series as a string.

In [36]:
y1.name
Out[36]:
'y1'

The pandas Index Class

Note each of the series x, y1 and y2 have the same default numeric integer index with the values 0,1,2,3 and 4. This can be changed when creating the series to a custom list of strings.

In [37]:
x=[0,1,2,3,4]
rnames=['r0','r1','r2','r3','r4']
x=pd.Series(data=x,name='x',index=rnames)
In [38]:
x
Out[38]:
r0    0
r1    1
r2    2
r3    3
r4    4
Name: x, dtype: int64

Instead of using a seperate list for data and index, data can be supplied directly as a dictionary. Since we lready have these as seperate lists, let's zip the two lists together to make a dictionary.

In [39]:
idx=dict(zip(rnames,x))
In [40]:
idx
Out[40]:
{'r0': 0, 'r1': 1, 'r2': 2, 'r3': 3, 'r4': 4}

Now let's create a series directly from the dictionary.

In [41]:
x=pd.Series(data=idx,name='x')
In [42]:
x
Out[42]:
r0    0
r1    1
r2    2
r3    3
r4    4
Name: x, dtype: int64

If the index names are labelled appropriately (i.e. in line with variable names) they will be accessible as attributes to the series. Calling the attribute returns the value.

pdseries_attibutes_methods

In [43]:
x.r0
Out[43]:
0

Otherwise the value can be accessed using square brackets analogous to indexing in a dictionary.

In [44]:
x['r0']
Out[44]:
0

Also analagous to a dictionary, where the list of keys and strings can be accessed as attributes. The index and values of a series can be accessed as attributes.

In [45]:
x.index
Out[45]:
Index(['r0', 'r1', 'r2', 'r3', 'r4'], dtype='object')
In [46]:
x.values
Out[46]:
array([0, 1, 2, 3, 4], dtype=int64)

Notice the outputs of these are a pandas Index class and numpy array respectively. Let's now have a look instantiating an index class directly. Recall when instantantiating a class we need to use CamelCaseCapitalization.

We can access the list of input arguments by use of ⇧ and tab ↹. Here we see that the index class is very similar to the series class and virtually all the keyword input arguments are the same. The index class doesn't have the index keyword input argument as the index of an index is always going to consist of numerical integers.

pd-Index_Class

In [47]:
rnames=['r0','r1','r2','r3','r4']
idx=pd.Index(data=rnames)
In [48]:
idx
Out[48]:
Index(['r0', 'r1', 'r2', 'r3', 'r4'], dtype='object')

Let's now update the index of x, y1 and y2

In [49]:
rnames=['r0','r1','r2','r3','r4']
x=[0,1,2,3,4]
y1=[0,2,4,6,8]
y2=[0,1.5,3,4.5,6.0]

x=pd.Series(data=x,name='x',index=rnames)
y1=pd.Series(data=y1,name='y1',index=rnames)
y2=pd.Series(data=y2,name='y2',index=rnames)
In [50]:
x
Out[50]:
r0    0
r1    1
r2    2
r3    3
r4    4
Name: x, dtype: int64
In [51]:
y1
Out[51]:
r0    0
r1    2
r2    4
r3    6
r4    8
Name: y1, dtype: int64
In [52]:
y2
Out[52]:
r0    0.0
r1    1.5
r2    3.0
r3    4.5
r4    6.0
Name: y2, dtype: float64

The pandas DataFrame Class

We have seen an index class which we essentially use to create row names and a series class which is a single column that has a number of index value key pairs analogous to a dictionary. We can now look at creating a datastructure that has multiple columns with a shared index called a dataframe. Recall again when instantantiating a class we need to use CamelCaseCapitalization.

We can access the list of input arguments by use of ⇧ and tab ↹. Here we see that the dataframe class is very similar to the series class. We have the keyword input arguments data and index. The data supplied to a dataframe is more complicated as it is 2D opposed to 1D but the index keyword input argument works the same way. The keyword input argument columns is present instead of name and is used to name each of the columns.

pdDataFrame_Class

Let's have a look at creating a dataframe class using one of the previously created series.

In [53]:
df=pd.DataFrame(data=x,columns=[x.name])
In [54]:
df
Out[54]:
x
r0 0
r1 1
r2 2
r3 3
r4 4

This dataframe has only one column named x which can be accessed as an attribute.

In [55]:
df.x
Out[55]:
r0    0
r1    1
r2    2
r3    3
r4    4
Name: x, dtype: int64

This attribute is a series and recall that we can access an index in a series as an attribute.

In [56]:
df.x.r0
Out[56]:
0

We can add a new column to the dataframe by indexing into the dataframe by using square brackets and assigning a new name of the column.

In [57]:
df['y1']=y1
In [58]:
df
Out[58]:
x y1
r0 0 0
r1 1 2
r2 2 4
r3 3 6
r4 4 8

Doing so will create the dataframe attribute y1 which can now be accessed.

In [59]:
df.y1
Out[59]:
r0    0
r1    2
r2    4
r3    6
r4    8
Name: y1, dtype: int64

Note that the attribute is only created after the addition of a new column and therefore an attribute can only be accessed in the column already exists. Attempting to create a new column using attribute indexing will result in an error.

pddataframe_series_attributes_newcolumn

In [60]:
df['y2']=y2
In [61]:
df
Out[61]:
x y1 y2
r0 0 0 0.0
r1 1 2 1.5
r2 2 4 3.0
r3 3 6 4.5
r4 4 8 6.0

The dataframe can also be created from a dictionary. In this form the keyword input argument data contains the column names and the column data, therefore the keyword input argument columns is not used. The dictionary is setup so the keys are the column names and the dictionary values are the data present as either a list or pandas series.

In [62]:
colnamedata={x.name:x,y1.name:y1,y2.name:y2}
df=pd.DataFrame(data=colnamedata)
In [63]:
df
Out[63]:
x y1 y2
r0 0 0 0.0
r1 1 2 1.5
r2 2 4 3.0
r3 3 6 4.5
r4 4 8 6.0

Finally it is more common to create a dataframe from a numpy matrix array and assign both the column names and index (row) names as lists of strings.

In [64]:
matrix
Out[64]:
array([[0. , 0. , 0. ],
       [1. , 2. , 1.5],
       [2. , 4. , 3. ],
       [3. , 6. , 4.5],
       [4. , 8. , 6. ]])
In [65]:
cnames=['x','y1','y2']
In [66]:
rnames=['r0','r1','r2','r3','r4']
In [67]:
df=pd.DataFrame(data=matrix,
               columns=cnames,
               index=rnames)
In [68]:
df
Out[68]:
x y1 y2
r0 0.0 0.0 0.0
r1 1.0 2.0 1.5
r2 2.0 4.0 3.0
r3 3.0 6.0 4.5
r4 4.0 8.0 6.0

Reading Data from a File

Examining the dataframe above, you've likely noticed an analogy with a spreadsheet. In this section we will look at how data is stored in a spreadsheet and read it into a dataframe.

spreadsheet

Reading from a CSV File

CSV is an abbreviation for Comma Separated Values. CSV files when open in Excel look like the following.

NoHeadersImg

When opened in Notepad++ (with hidden punctuation options displaying) they look like.

CSV1-NoHeadersImg2

The comma separated file uses a comma as a delimiter (separator) as an instruction to move to the next column. This is analogous to the comma being used as a delimitor (separator) in a list e.g. l1=[1,2,3,4]. In this case the comma acts as a seperator for each element in the list.

To move onto the next line we use the return key [↵].

In notepad++ the hidden punctuation symbols can be viewed. These hidden punctuation symbols originally corresponded to typewritter commands. CR is an abbreviation for Carriage Return and is analogous to moving the typewritters ink carriage from the end of a line on the right hand side back to the start on the left hand side of the page. LF is an abbreviation for Line Feed and would instruct the type writter to shift the piece of paper up by the width of a line, so the type writter is ready to begin typing on the next line.

To access a file using only the file name, the file must be saved in the same folder as the JupyterLab Notebook. Otherwise the full file path must be specified.

In [69]:
csv_file=pd.read_csv('CSV1-NoHeaders.csv')

Note that Python uses 0th Order Indexing and Excel uses 1st Order Indexing. The 1st Row of the Excel File was taken to be the column names instead of the columns 0th index.

In [70]:
csv_file
Out[70]:
1 2
0 2 4
1 3 6
2 4 8
3 5 10
4 6 12
5 7 14
6 8 16
7 9 18
8 10 20
9 11 22
10 12 24

We can override the default behaviour by use of the keyword input argument and override it's default behaviour from 'infer' to None.

In [71]:
csv_file=pd.read_csv('CSV1-NoHeaders.csv',
                    header=None)
In [72]:
csv_file
Out[72]:
0 1
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
6 7 14
7 8 16
8 9 18
9 10 20
10 11 22
11 12 24

We can use the keyword input argument names to name the columns while we read the csv file.

In [73]:
cnames=['c0','c1']
In [74]:
csv_file=pd.read_csv('CSV1-NoHeaders.csv',
                    header=None,names=cnames)
In [75]:
csv_file
Out[75]:
c0 c1
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
6 7 14
7 8 16
8 9 18
9 10 20
10 11 22
11 12 24

We can also look at CSV file with headers.

CSV2-HeadersImg

In this case the default keyword argument for headers 'infer' can be used and the column names will be taken from row 1 (Excel 1st order indexing) of the CSV file.

In [76]:
csv_file=pd.read_csv('CSV2-Headers.csv')
In [77]:
csv_file
Out[77]:
a b
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
6 7 14
7 8 16
8 9 18
9 10 20
10 11 22
11 12 24

CSV3-Headers-Comments

We want to skip the rows 2 and 3 (Excel 1st order indexing) which is rows 1 and 2 in Python. We can use the keyword input argument skiprows and then provide a list of rows to skip.

In [78]:
csv_file=pd.read_csv('CSV3-Headers-Comments.csv',
                     skiprows=[1,2])
In [79]:
csv_file
Out[79]:
a b
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
6 7 14
7 8 16
8 9 18
9 10 20
10 11 22
11 12 24

If the file is a very large file, we may not want to import it all want to examine a handful of rows to preview. This can be done using the nrows keyword input argument. In this case we will assign it to 5.

In [80]:
csv_file=pd.read_csv('CSV3-Headers-Comments.csv',
                     skiprows=[1,2],
                     nrows=5)
In [81]:
csv_file
Out[81]:
a b
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10

We can also select which columns we want using the keyword input argument usecols. In this case if we only want the 2nd column with header b (Excel 0 order indexing) which is the 1st Column in Python (0th order indexing).

In [82]:
csv_file=pd.read_csv('CSV3-Headers-Comments.csv',
                     skiprows=[1,2],
                     nrows=5,
                     usecols=[1])
In [83]:
csv_file
Out[83]:
b
0 2
1 4
2 6
3 8
4 10

The CSV file may already have index names.

CSV4-Headers-Index

In [84]:
csv_file=pd.read_csv('CSV4-Headers-Index.csv')
In [85]:
csv_file
Out[85]:
Unnamed: 0 a b
0 r0 1 2
1 r1 2 4
2 r2 3 6
3 r3 4 8
4 r4 5 10
5 r5 6 12
6 r6 7 14
7 r7 8 16
8 r8 9 18
9 r9 10 20
10 r10 11 22
11 r11 12 24

In this case Column A (Excel) is displayed as an additional Column and as it wasn't assigned a name it has automatically been referred to as "Unnamed: 0". We can use the keyword input argument index_col to assign it (Excel column A) to column 0.

In [86]:
csv_file=pd.read_csv('CSV4-Headers-Index.csv',
                    index_col=0)
In [87]:
csv_file
Out[87]:
a b
r0 1 2
r1 2 4
r2 3 6
r3 4 8
r4 5 10
r5 6 12
r6 7 14
r7 8 16
r8 9 18
r9 10 20
r10 11 22
r11 12 24

Reading from a TXT File

The function read_csv can be used with both CSV and TXT files. Let's first examine a TXT file in Excel where it looks almost identical with exception to the file extension.

TXT-HeadersImg

Now let's have a look at it in Notepad++. Here we see a difference in a delimiter which displays as an arrow but denotes a tab ↹. To type a tab in Python we use '\t'.

TXT-HeadersImg2

If we try to read this file using the default keyword input arguments of read_csv:

In [88]:
txt_file=pd.read_csv('TXT-Headers.txt')
In [89]:
txt_file
Out[89]:
a\tb
0 1\t2
1 2\t4
2 3\t6
3 4\t8
4 5\t10
5 6\t12
6 7\t14
7 8\t16
8 9\t18
9 10\t20
10 11\t22
11 12\t24

We see we get only one column that has the datatype of object (usually meaning strings) and the \t is included in the column as part of the string.

In [90]:
txt_file.dtypes
Out[90]:
a\tb    object
dtype: object

This is because the read_csv function is not using \t as a delimiter and is instead looking for a , by default. We can override this default behaviour by using the keyword input argument delimiter (or it's alias sep) and assigning it to '\t'.

In [91]:
txt_file=pd.read_csv('TXT-Headers.txt',
                    sep='\t')

txt_file

Reading from an Excel File

At first glance an XLSX file may look very similar to a CSV or TXT file. If we examine one in Excel.

XLSX_HeadersImg

However unlike a CSV or TXT file which is just a single sheet of raw data. The Excel file may have multiple sheets and may also be formatted using different fonts, colors and font sizes. This collectively makes it a more complcated structure.

XLSX-HeadersImg2

When opened in Notepad++ this complexity shows making the file hard to understand.

XLSX-HeadersImg3

Fortunately we are not usually interested in all of these formatting options when reading data into a dataframe and the command read_excel can be used for XLSX files and shares may of the keyword input arguments as read_csv.

In [92]:
xlsx_file=pd.read_excel('XLSX-Headers.xlsx')
In [93]:
xlsx_file
Out[93]:
a b
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
6 7 14
7 8 16
8 9 18
9 10 20
10 11 22
11 12 24

In a multiple sheet workbook, the keyword input argument sheet_name can be selected to specify the sheet. This keyword input argument can take the sheet name in this case "Sheet1" as a string or it can take it as an index. When using an index we use zero-order indexing in Python so Sheet1 (1st order Excel indexing) will be index 0.

In [94]:
xlsx_file=pd.read_excel('XLSX-Headers.xlsx',
                       sheet_name='Sheet1')
In [95]:
xlsx_file
Out[95]:
a b
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
6 7 14
7 8 16
8 9 18
9 10 20
10 11 22
11 12 24

Handling Missing Data

Unfortunately when importing data, we run into the situation where some of the data is missing. For example cell B3 has a division by 0 error, Row 8 is missing, A11 and B12 are missing.

XLSX2-Headers-MissingDataImg

In [96]:
xlsx_file=pd.read_excel('XLSX2-Headers-MissingData.xlsx')
In [97]:
xlsx_file
Out[97]:
a b
0 1.0 2.0
1 2.0 NaN
2 3.0 6.0
3 4.0 8.0
4 5.0 10.0
5 6.0 12.0
6 NaN NaN
7 8.0 16.0
8 9.0 18.0
9 NaN 20.0
10 11.0 NaN
11 12.0 24.0

The missing data displays as NaN an abbreviation for Not an Number. The presence of the NaN values can distort the data especially when used in numerical calculations.

Checking for NaN Values

Let's look at how we can handle NaN values. Firstly we can use the pandas function isna to get a Boolean table to let us know what cells contain missing data. Alternatively we can also use the pandas function notna to let us know what cells contain good data.

In [98]:
pd.isna(xlsx_file)
Out[98]:
a b
0 False False
1 False True
2 False False
3 False False
4 False False
5 False False
6 True True
7 False False
8 False False
9 True False
10 False True
11 False False
In [99]:
pd.notna(xlsx_file)
Out[99]:
a b
0 True True
1 True False
2 True True
3 True True
4 True True
5 True True
6 False False
7 True True
8 True True
9 False True
10 True False
11 True True

If we assign the output to a object name. We can see that the new object is also a dataframe.

In [100]:
na_values=pd.isna(xlsx_file)
In [101]:
type(na_values)
Out[101]:
pandas.core.frame.DataFrame

Therefore we can use dataframe methods and attributes on it. Boolean data can be used in mathematical calculations and True corresponds to 1 while False corresponds to 0.

na_values_dataframe_methods

Let's use the method sum. To find out more details about the input arguments we can once again press shift ⇧ and tab ↹. We see that it is equivalent to the numpy method sum. The default value for the keyword input argument axis is None. When defaulted to None the sum is performed on the index (rows). The keyword input argument can be set to 0 to work on rows and 1 to work on columns. We will leave all keyword input arguments as their default values.

na_values_dataframe_method_sum

In [102]:
na_values.sum()
Out[102]:
a    2
b    3
dtype: int64

We can combine the above in one line, the pandas function pd.isna(xlsx_file) is a dataframe and the .sum() is a dataframe method. This notation is a bit more concise but more shift ⇧ and tab ↹ won't yield details on the sum method input arguments like it would in the case where the output was assigned to a new object name and the method instead is directly called from that object name.

In [103]:
pd.isna(xlsx_file).sum()
Out[103]:
a    2
b    3
dtype: int64
In [104]:
pd.notna(xlsx_file).sum()
Out[104]:
a    10
b     9
dtype: int64

The summation of NaN and not NaN values gives 12 values for each column as expected.

We can use the dataframe method dropna to disregard not a number values. Let's have a look at the input arguments using shift ⇧ and tab ↹.

The default keyword arguments are axis and how are set to 0 and 'any' respectively. This means it will act on rows and remove any rows that have Not an Number value.

na_values_dataframe_method_dropna

Dropping NaN Values

In [105]:
xlsx_file.dropna()
Out[105]:
a b
0 1.0 2.0
2 3.0 6.0
3 4.0 8.0
4 5.0 10.0
5 6.0 12.0
7 8.0 16.0
8 9.0 18.0
11 12.0 24.0

We can update the keyword input argument how to be 'all' instead and in this case it will only remove rows where all the values are NaN values.

In [106]:
xlsx_file.dropna(how='all')
Out[106]:
a b
0 1.0 2.0
1 2.0 NaN
2 3.0 6.0
3 4.0 8.0
4 5.0 10.0
5 6.0 12.0
7 8.0 16.0
8 9.0 18.0
9 NaN 20.0
10 11.0 NaN
11 12.0 24.0

We can also use the keyword input argument subset, to restrict the search to only specific columns, in this case only column 'b'.

In [107]:
xlsx_file.dropna(how='any',subset=['b'])
Out[107]:
a b
0 1.0 2.0
2 3.0 6.0
3 4.0 8.0
4 5.0 10.0
5 6.0 12.0
7 8.0 16.0
8 9.0 18.0
9 NaN 20.0
11 12.0 24.0

Filling Values

We can also use the method fillna to replace the NaN with another value. We can use the keyword input argument value to replace all the blanks with a static value such as 0.

na_values_dataframe_method_filna

In [108]:
xlsx_file.fillna(value=0)
Out[108]:
a b
0 1.0 2.0
1 2.0 0.0
2 3.0 6.0
3 4.0 8.0
4 5.0 10.0
5 6.0 12.0
6 0.0 0.0
7 8.0 16.0
8 9.0 18.0
9 0.0 20.0
10 11.0 0.0
11 12.0 24.0

We can also use a dictionary to use a different value on each column. This may be useful if the different columns have a different scale and we wish to use values to denote saturation values for instance. The dictionary will use the column names as keys and the values will the desired value to fill each columns NaN values with.

In [109]:
fill_dict={'a':0,'b':99}
xlsx_file.fillna(value=fill_dict)
Out[109]:
a b
0 1.0 2.0
1 2.0 99.0
2 3.0 6.0
3 4.0 8.0
4 5.0 10.0
5 6.0 12.0
6 0.0 99.0
7 8.0 16.0
8 9.0 18.0
9 0.0 20.0
10 11.0 99.0
11 12.0 24.0

The keyword input argument method can alternatively be used to use a method to fill in the NaN. It can be assigned to forward fill 'ffill' or backfill 'bfill' where each missing value will take the previous non-missing value going forward down the array or next non-missing value going backward, up the array.

In [110]:
xlsx_file.fillna(method='ffill')
Out[110]:
a b
0 1.0 2.0
1 2.0 2.0
2 3.0 6.0
3 4.0 8.0
4 5.0 10.0
5 6.0 12.0
6 6.0 12.0
7 8.0 16.0
8 9.0 18.0
9 9.0 20.0
10 11.0 20.0
11 12.0 24.0
In [111]:
xlsx_file.fillna(method='bfill')
Out[111]:
a b
0 1.0 2.0
1 2.0 6.0
2 3.0 6.0
3 4.0 8.0
4 5.0 10.0
5 6.0 12.0
6 8.0 16.0
7 8.0 16.0
8 9.0 18.0
9 11.0 20.0
10 11.0 24.0
11 12.0 24.0

We can try to instead use the method interpolation to use interpolation across each column if the values are ordered and fill in the missing values using the interpolated function. Let's have a look at the input arguments using shift ⇧ and tab ↹. Once again there is a keyword input argument method which defaults to 'linear' meaning linear interpolation will be used.

na_values_dataframe_method_interpolate

Interpolation

Because the raw data was precisely linear, this performs a perfect interpolation as expected. We can update the keyword input argument method to be 'quadratic' or 'cubic' for more complicated data.

In [112]:
xlsx_file.interpolate()
Out[112]:
a b
0 1.0 2.0
1 2.0 4.0
2 3.0 6.0
3 4.0 8.0
4 5.0 10.0
5 6.0 12.0
6 7.0 14.0
7 8.0 16.0
8 9.0 18.0
9 10.0 20.0
10 11.0 22.0
11 12.0 24.0
In [113]:
xlsx_file.interpolate(method='quadratic')
Out[113]:
a b
0 1.0 2.0
1 2.0 4.0
2 3.0 6.0
3 4.0 8.0
4 5.0 10.0
5 6.0 12.0
6 7.0 14.0
7 8.0 16.0
8 9.0 18.0
9 10.0 20.0
10 11.0 22.0
11 12.0 24.0
In [114]:
xlsx_file.interpolate(method='cubic')
Out[114]:
a b
0 1.0 2.0
1 2.0 4.0
2 3.0 6.0
3 4.0 8.0
4 5.0 10.0
5 6.0 12.0
6 7.0 14.0
7 8.0 16.0
8 9.0 18.0
9 10.0 20.0
10 11.0 22.0
11 12.0 24.0

In all the cases above the output was just printed to the console, to perform an inplace update of the data, we can reassign the output to the object name xlsx_file.

In [115]:
xlsx_file=xlsx_file.interpolate()
In [116]:
xlsx_file
Out[116]:
a b
0 1.0 2.0
1 2.0 4.0
2 3.0 6.0
3 4.0 8.0
4 5.0 10.0
5 6.0 12.0
6 7.0 14.0
7 8.0 16.0
8 9.0 18.0
9 10.0 20.0
10 11.0 22.0
11 12.0 24.0

Writing Data to a File

Let's create a few additional dataframes from xlsx_file.

In [117]:
df=xlsx_file
In [118]:
df
Out[118]:
a b
0 1.0 2.0
1 2.0 4.0
2 3.0 6.0
3 4.0 8.0
4 5.0 10.0
5 6.0 12.0
6 7.0 14.0
7 8.0 16.0
8 9.0 18.0
9 10.0 20.0
10 11.0 22.0
11 12.0 24.0
In [119]:
df2=2*df
In [120]:
df2
Out[120]:
a b
0 2.0 4.0
1 4.0 8.0
2 6.0 12.0
3 8.0 16.0
4 10.0 20.0
5 12.0 24.0
6 14.0 28.0
7 16.0 32.0
8 18.0 36.0
9 20.0 40.0
10 22.0 44.0
11 24.0 48.0
In [121]:
df3=3*df
In [122]:
df3
Out[122]:
a b
0 3.0 6.0
1 6.0 12.0
2 9.0 18.0
3 12.0 24.0
4 15.0 30.0
5 18.0 36.0
6 21.0 42.0
7 24.0 48.0
8 27.0 54.0
9 30.0 60.0
10 33.0 66.0
11 36.0 72.0

We can use the dataframe methods to_csv and to_excel to save the data to a file.

pddataframe_method_to_excel

Writing a DataFrame to an Excel File

Let's look at writing to an Excel file first using the method to_excel. We can press ⇧ and tab ↹ to get details about the input arguments.

pddataframe_method_to_excel2

There are a large number of keyword input arguments. An excel_writer object needs to be created to write multiple sheets to a file. To write a single sheet, all we need to do is specify the name of the file. The sheet_name will use the default string 'sheet_1'.

In [123]:
df.to_excel('df_export_xlsx.xlsx')

The file can be found in the same folder as the script file.

df_export_xlsxImg

Note the xlsx file has an index and a header which were created automatically from the dataframe. Changing the keyword input arguments index and header to None will remove these.

In [124]:
df.to_excel('df_export_xlsx2.xlsx',header=None,index=None)

df_export_xlsx2Img

Additional blank rows or columns can be added to the excel file by using the keyword input arguments startrow and startcol respectively. For example let's set 3 blank rows and 2 blank columns respectively.

In [125]:
df.to_excel('df_export_xlsx3.xlsx',startrow=3,startcol=2)

df_export_xlsx3Img

Writing DataFrames to an Excel File with Multiple Sheets

To write a multiple sheet object it is slightly more complicated. Using the procedure above with different sheet names will simply overwrite the file each time opposed to appending a sheet. Instead we must create an pandas Excel writer class. Recall when instantantiating a class we need to use CamelCaseCapitalization.

pd_ExcelWriterClass1

The positional input argument path (which also works as a keyword input argument) must be assigned to a file name.

In [126]:
writer=pd.ExcelWriter(path='df_export_xlsx4_multisheet.xlsx')

Now we can use the dataframe method to_excel to write each dataframe to an excel sheet. Instead of specifying the file name we specify the writter class.

In [127]:
df.to_excel(writer,sheet_name='df')
df2.to_excel(writer,sheet_name='df2')
df3.to_excel(writer,sheet_name='df3')

The excel file will not be created until the writers method save is used. We can access the writers methods by typing in the writer name followed by a dot . and then pressing tab ↹.

pd_excelwriter_methods

To look at the input arguments we can use shift ⇧ and tab ↹. In this case we see there are none. So we need to call the method without specifying any.

pd_ExcelWriterClass2

In [128]:
writer.save()

We see that the multi-sheet Excel file is created.

df_export_xlsx4_multisheetImg

Writing a DataFrame to a CSV File

The to_csv dataframe method can be used to write to a csv or txt filewhich can be viewed by pressing shift ⇧ and tab ↹. CSV files are always a single sheet so there is no writer option, just the positional input argument path. Some of the other keyword input arguments shown here are common with those in the dataframe method to_excel. However most of the keyword input arguments relate to formatting and use the formatting as a CSV file by default.

pddataframe_method_to_csv.png

In [129]:
df.to_csv('df_export_csv.csv')

We can view this in Notepad++ to see that it is a CSV file as expected.

df_export_csvImg

Writing a DataFrame to a TXT File

The method to_csv can also be used to save a txt file.

In [130]:
df.to_csv('df_export_txt.txt')

The default keyword arguments are setup for CSV files by default and so the text file has the wrong delimiter a comma opposed to a ↹.

df_export_txtImg

We can use the keyword input argument sep to specify a tab as a delimiter.

In [131]:
df.to_csv('df_export_txt2.txt',sep='\t')

df_export_txt2Img.png

Indexing

Let's create two basic dataframes, one with both columns labelled and one with both columns and indexes labelled.

In [132]:
from scipy import random
random.seed(0)
matrix=random.randint(low=0,high=11,size=(12,5))
cnames=['c'+str(i) for i in range(5)]
rnames=['r'+str(i) for i in range(12)]
df=pd.DataFrame(data=matrix,columns=cnames)
df2=pd.DataFrame(data=matrix,columns=cnames,index=rnames)
In [133]:
df
Out[133]:
c0 c1 c2 c3 c4
0 5 0 3 3 7
1 9 3 5 2 4
2 7 6 8 8 10
3 1 6 7 7 8
4 1 5 9 8 9
5 4 3 0 3 5
6 0 2 3 8 1
7 3 3 3 7 0
8 1 9 9 0 10
9 4 7 3 2 7
10 2 0 0 4 5
11 5 6 8 4 1
In [134]:
df2
Out[134]:
c0 c1 c2 c3 c4
r0 5 0 3 3 7
r1 9 3 5 2 4
r2 7 6 8 8 10
r3 1 6 7 7 8
r4 1 5 9 8 9
r5 4 3 0 3 5
r6 0 2 3 8 1
r7 3 3 3 7 0
r8 1 9 9 0 10
r9 4 7 3 2 7
r10 2 0 0 4 5
r11 5 6 8 4 1

Indexing a Column

We have seen before that we can select a column using attribute indexing and square bracket indexing

In [135]:
df.c0
Out[135]:
0     5
1     9
2     7
3     1
4     1
5     4
6     0
7     3
8     1
9     4
10    2
11    5
Name: c0, dtype: int32
In [136]:
df['c0']
Out[136]:
0     5
1     9
2     7
3     1
4     1
5     4
6     0
7     3
8     1
9     4
10    2
11    5
Name: c0, dtype: int32

Then we can select a cell by using an index using square bracket indexing

In [137]:
df['c0'][1]
Out[137]:
9
In [138]:
df.c0[1]
Out[138]:
9

For the dataframe where the indexes are labelled, attribute indexing can also be used

In [139]:
df2.c0.r1
Out[139]:
9
In [140]:
df2['c0'].r1
Out[140]:
9

Indexing a Row

Let's have a look at selection of a row directly now. We can use the attribute loc (location) or iloc (integer location). loc is designed for use with string index names and iloc is designed for use with numeric indexes and will also work with indexes that are labelled.

In [141]:
df2.loc
Out[141]:
<pandas.core.indexing._LocIndexer at 0x1e32660f9f0>

When these are used, we get an instance of an Indexer object. We can index into this object using the row names.

In [142]:
df2.loc['r1']
Out[142]:
c0    9
c1    3
c2    5
c3    2
c4    4
Name: r1, dtype: int32

Note that the output is an instance as a series class and hence displayed as a column opposed to a row. This can be seen if we assign the object to a variable name.

In [143]:
row1=df2.loc['r1']
In [144]:
type(row1)
Out[144]:
pandas.core.series.Series

Therefore we can access the series attributes and methods by typing in the object name followed by a dot . and then pressing tab ↹.

row1-pdseries-attributes

Because this is a series, we can index into it using the series index names (which in this case are the original dataframes column names).

In [145]:
row1.c0
Out[145]:
9
In [146]:
df2.loc['r1'].c0
Out[146]:
9
In [147]:
df2.loc['r1']['c0']
Out[147]:
9

For df there is no index strings so loc will work with numeric indexes.

In [148]:
df.loc[1]
Out[148]:
c0    9
c1    3
c2    5
c3    2
c4    4
Name: 1, dtype: int32

The integer loc will work with both df and df2 using numeric integers.

In [149]:
df.iloc[1]
Out[149]:
c0    9
c1    3
c2    5
c3    2
c4    4
Name: 1, dtype: int32
In [150]:
df2.iloc[1]
Out[150]:
c0    9
c1    3
c2    5
c3    2
c4    4
Name: r1, dtype: int32

integer location iloc can be used to select multiple rows by using slices (similar to slicing in a list). The output will be a dataframe.

In [151]:
df.iloc[3:5]
Out[151]:
c0 c1 c2 c3 c4
3 1 6 7 7 8
4 1 5 9 8 9

The slice 3:5 "3 to 5" starts at 3 and goes up to but exclusing 5 in steps of 1 and therefore displays index 3 and 4.

Because the output is a dataframe a column can be indexed from it.

In [152]:
df.iloc[3:5].c0
Out[152]:
3    1
4    1
Name: c0, dtype: int32
In [153]:
df.iloc[3:5]['c0']
Out[153]:
3    1
4    1
Name: c0, dtype: int32

Both loc and iloc have the means of selecting a cell or selection of cells by use of a single square bracket to select the row and columns.

In [154]:
df2.loc['r1','c0']
Out[154]:
9
In [155]:
df2.iloc[1,0]
Out[155]:
9

Indexing a Cell

There are also the attributes at and integer at iat can be used to directly select a cell, these are the preferred attributes to use when updating the value fo a cell.

In [156]:
df2.at['r1','c0']
Out[156]:
9
In [157]:
df2
Out[157]:
c0 c1 c2 c3 c4
r0 5 0 3 3 7
r1 9 3 5 2 4
r2 7 6 8 8 10
r3 1 6 7 7 8
r4 1 5 9 8 9
r5 4 3 0 3 5
r6 0 2 3 8 1
r7 3 3 3 7 0
r8 1 9 9 0 10
r9 4 7 3 2 7
r10 2 0 0 4 5
r11 5 6 8 4 1
In [158]:
df2.at['r1','c0']=11
In [159]:
df2
Out[159]:
c0 c1 c2 c3 c4
r0 5 0 3 3 7
r1 11 3 5 2 4
r2 7 6 8 8 10
r3 1 6 7 7 8
r4 1 5 9 8 9
r5 4 3 0 3 5
r6 0 2 3 8 1
r7 3 3 3 7 0
r8 1 9 9 0 10
r9 4 7 3 2 7
r10 2 0 0 4 5
r11 5 6 8 4 1

Adding a New Column or Row

Adding a New Column

Let's have a look at the dataframe

In [160]:
df2
Out[160]:
c0 c1 c2 c3 c4
r0 5 0 3 3 7
r1 11 3 5 2 4
r2 7 6 8 8 10
r3 1 6 7 7 8
r4 1 5 9 8 9
r5 4 3 0 3 5
r6 0 2 3 8 1
r7 3 3 3 7 0
r8 1 9 9 0 10
r9 4 7 3 2 7
r10 2 0 0 4 5
r11 5 6 8 4 1

We can create a new column from a series or array that has a length that matches the number of rows in the dataframe. We can index a new column name by use of square brackets.

In [161]:
df2['c5']=random.randint(low=0,high=11,size=12)
In [162]:
df2
Out[162]:
c0 c1 c2 c3 c4 c5
r0 5 0 3 3 7 4
r1 11 3 5 2 4 9
r2 7 6 8 8 10 10
r3 1 6 7 7 8 10
r4 1 5 9 8 9 8
r5 4 3 0 3 5 1
r6 0 2 3 8 1 1
r7 3 3 3 7 0 7
r8 1 9 9 0 10 9
r9 4 7 3 2 7 9
r10 2 0 0 4 5 3
r11 5 6 8 4 1 6

Once the column has been created, it can be accessed by using the column name as an attribute.

In [163]:
df2.c5
Out[163]:
r0      4
r1      9
r2     10
r3     10
r4      8
r5      1
r6      1
r7      7
r8      9
r9      9
r10     3
r11     6
Name: c5, dtype: int32

Note that this attribute is created as a result of the column being created. We can only access attributes that already exist. If we attempt to assign the output to an attribute that doesn't already exist we get an error. This means that attribute dot notation cannot be used to create a new attribute and can only be used to access an existing attribute once it is created.

df-columns-cannot-be-created-from-a-new-attribute

Adding a New Row

A new row can be added by use of the attributes loc or iloc.

In [164]:
df2.loc['r12']=random.randint(low=0,high=11,size=6)
In [165]:
df2
Out[165]:
c0 c1 c2 c3 c4 c5
r0 5 0 3 3 7 4
r1 11 3 5 2 4 9
r2 7 6 8 8 10 10
r3 1 6 7 7 8 10
r4 1 5 9 8 9 8
r5 4 3 0 3 5 1
r6 0 2 3 8 1 1
r7 3 3 3 7 0 7
r8 1 9 9 0 10 9
r9 4 7 3 2 7 9
r10 2 0 0 4 5 3
r11 5 6 8 4 1 6
r12 7 2 0 3 5 9

The Append Method

Alternatively the append method may be used to append a row to a dataframe. The append method can append a panda series or dictionary to the dataframe. To get details about the input arguments press shift ⇧ and tab ↹.

pddataframe_method_append

The positional keyword argument is the row to appended. If the keyword input argument ignore_index is set to True other can be either a dictionary or pandas series which will be appended. When this ignore_index is set to True and all the index labels will be ignored giving instead a numeric index. The keys in the dictionary must correspond to the column name sand the values, correspond to the values to be added for each row.

In [166]:
df2
Out[166]:
c0 c1 c2 c3 c4 c5
r0 5 0 3 3 7 4
r1 11 3 5 2 4 9
r2 7 6 8 8 10 10
r3 1 6 7 7 8 10
r4 1 5 9 8 9 8
r5 4 3 0 3 5 1
r6 0 2 3 8 1 1
r7 3 3 3 7 0 7
r8 1 9 9 0 10 9
r9 4 7 3 2 7 9
r10 2 0 0 4 5 3
r11 5 6 8 4 1 6
r12 7 2 0 3 5 9
In [167]:
new_row={'c0':9,'c1':9,'c2':9,'c3':9,'c4':9,'c5':9}
In [168]:
df2.append(new_row,ignore_index=True)
Out[168]:
c0 c1 c2 c3 c4 c5
0 5 0 3 3 7 4
1 11 3 5 2 4 9
2 7 6 8 8 10 10
3 1 6 7 7 8 10
4 1 5 9 8 9 8
5 4 3 0 3 5 1
6 0 2 3 8 1 1
7 3 3 3 7 0 7
8 1 9 9 0 10 9
9 4 7 3 2 7 9
10 2 0 0 4 5 3
11 5 6 8 4 1 6
12 7 2 0 3 5 9
13 9 9 9 9 9 9

When ignore_index is set to False only a series with a name can be appended. All the previous names in the dataframe will be unaltered.

In [169]:
new_series=pd.Series(data={'c0':9,'c1':9,'c2':9,'c3':9,'c4':9,'c5':9},name='r13')
In [170]:
new_series
Out[170]:
c0    9
c1    9
c2    9
c3    9
c4    9
c5    9
Name: r13, dtype: int64

Although the series displays as a column, it will be appended as a row.

In [171]:
df2.append(new_series)
Out[171]:
c0 c1 c2 c3 c4 c5
r0 5 0 3 3 7 4
r1 11 3 5 2 4 9
r2 7 6 8 8 10 10
r3 1 6 7 7 8 10
r4 1 5 9 8 9 8
r5 4 3 0 3 5 1
r6 0 2 3 8 1 1
r7 3 3 3 7 0 7
r8 1 9 9 0 10 9
r9 4 7 3 2 7 9
r10 2 0 0 4 5 3
r11 5 6 8 4 1 6
r12 7 2 0 3 5 9
r13 9 9 9 9 9 9

Deleting a Column or a Row

The dataframe method drop can be used to delete a column or row. This method can be used with a label keyword input argument and axis keyword input argument where the label is the name of the row or column and axis works similar to many of the numpy functions where axis=0 acts on rows and axis=1 acts on columns. If used this way both the label and axis need to match. Alternatively it is simpler to use the index and columns keyword input arguments as they are a bit more intuitive.

pddataframe_method_drop

Deleting a Column

To delete a column we can use the keyword input argument labels to specify a list of column name of strings or a single column string. We also need to use axis 1 to work on columns.

In [172]:
df.drop(labels=['c0','c1'],axis=1)
Out[172]:
c2 c3 c4
0 3 3 7
1 5 2 4
2 8 8 10
3 7 7 8
4 9 8 9
5 0 3 5
6 3 8 1
7 3 7 0
8 9 0 10
9 3 2 7
10 0 4 5
11 8 4 1

Deleting a Row

To delete a row we can use the keyword input argument labels to specify a list of row name strings or integer indexes (if unlabelled) or a row column string or index (if unlabelled). We also need to use axis 0 to work on rows. This axis is unlabelled so the integer index 10 is used instead.

In [173]:
df.drop(labels=10,axis=0)
Out[173]:
c0 c1 c2 c3 c4
0 5 0 3 3 7
1 11 3 5 2 4
2 7 6 8 8 10
3 1 6 7 7 8
4 1 5 9 8 9
5 4 3 0 3 5
6 0 2 3 8 1
7 3 3 3 7 0
8 1 9 9 0 10
9 4 7 3 2 7
11 5 6 8 4 1

Deleting Columns and Rows

The above syntax can delete only columns or rows. The keyword input arguments index and columns is a bit more intuitive and can be used to delete both simultaneously.

In [174]:
df.drop(index=10,columns=['c0','c2'])
Out[174]:
c1 c3 c4
0 0 3 7
1 3 2 4
2 6 8 10
3 6 7 8
4 5 8 9
5 3 3 5
6 2 8 1
7 3 7 0
8 9 0 10
9 7 2 7
11 6 4 1

So far the output has only been printed to the console and the dataframe itself has not been updated. To perform an inplace update we can use the keyword input argument inplace and assign it to True.

In [175]:
df2.drop(index='r10',columns=['c0','c2'],inplace=True)
In [176]:
df2
Out[176]:
c1 c3 c4 c5
r0 0 3 7 4
r1 3 2 4 9
r2 6 8 10 10
r3 6 7 8 10
r4 5 8 9 8
r5 3 3 5 1
r6 2 8 1 1
r7 3 7 0 7
r8 9 0 10 9
r9 7 2 7 9
r11 6 4 1 6
r12 2 3 5 9

Alternatively we can reassign the output to the dataframes name.

In [177]:
df2=df2.drop(index=['r1','r9'],columns=['c1','c3'])
In [178]:
df2
Out[178]:
c4 c5
r0 7 4
r2 10 10
r3 8 10
r4 9 8
r5 5 1
r6 1 1
r7 0 7
r8 10 9
r11 1 6
r12 5 9

When the keyword input argument inplace is set to True, an inplace update to the dataframe will occur and the method will therefore have no input argument i.e. nothing printed to the cell. Care should be taken to avoid using both the keyword input argument inplace and reassignment of the object name as it will to a NoneType instead of an updated dataframe.

In [179]:
df2=df2.drop(index=['r3','r4'],inplace=True)
In [180]:
df2

We can see that we get nothing when we ask for the output of df2. If we check its type we can see it is NoneType.

In [181]:
type(df2)
Out[181]:
NoneType

Reordering Columns or Rows

Let's first restore our dataframes.

In [182]:
random.seed(0)
matrix=random.randint(low=0,high=11,size=(12,5))
cnames=['c'+str(i) for i in range(5)]
rnames=['r'+str(i) for i in range(12)]
df=pd.DataFrame(data=matrix,columns=cnames)
df2=pd.DataFrame(data=matrix,columns=cnames,index=rnames)
In [183]:
df2
Out[183]:
c0 c1 c2 c3 c4
r0 5 0 3 3 7
r1 9 3 5 2 4
r2 7 6 8 8 10
r3 1 6 7 7 8
r4 1 5 9 8 9
r5 4 3 0 3 5
r6 0 2 3 8 1
r7 3 3 3 7 0
r8 1 9 9 0 10
r9 4 7 3 2 7
r10 2 0 0 4 5
r11 5 6 8 4 1

Reordering Columns

We can reorder the columns in a dataframe by indexing into it with a reordered list of column strings. For example we want c0 to be at the end we would use.

In [184]:
newcolorder=['c1','c2','c3','c4','c0']
In [185]:
df2[newcolorder]
Out[185]:
c1 c2 c3 c4 c0
r0 0 3 3 7 5
r1 3 5 2 4 9
r2 6 8 8 10 7
r3 6 7 7 8 1
r4 5 9 8 9 1
r5 3 0 3 5 4
r6 2 3 8 1 0
r7 3 3 7 0 3
r8 9 9 0 10 1
r9 7 3 2 7 4
r10 0 0 4 5 2
r11 6 8 4 1 5

To update df2, we can reassign the object name df2 to the output.

In [186]:
df2=df2[newcolorder]
In [187]:
df2
Out[187]:
c1 c2 c3 c4 c0
r0 0 3 3 7 5
r1 3 5 2 4 9
r2 6 8 8 10 7
r3 6 7 7 8 1
r4 5 9 8 9 1
r5 3 0 3 5 4
r6 2 3 8 1 0
r7 3 3 7 0 3
r8 9 9 0 10 1
r9 7 3 2 7 4
r10 0 0 4 5 2
r11 6 8 4 1 5

Reordering Rows

To reorder rows, we can use the attribute loc or iloc.

In [188]:
newroworder=['r1','r2','r3','r4','r5','r6','r7','r8','r9','r10','r11','r0']
In [189]:
df2.loc[newroworder]
Out[189]:
c1 c2 c3 c4 c0
r1 3 5 2 4 9
r2 6 8 8 10 7
r3 6 7 7 8 1
r4 5 9 8 9 1
r5 3 0 3 5 4
r6 2 3 8 1 0
r7 3 3 7 0 3
r8 9 9 0 10 1
r9 7 3 2 7 4
r10 0 0 4 5 2
r11 6 8 4 1 5
r0 0 3 3 7 5

Note we can also use this to drop columns or rows by instead only making a selection.

In [190]:
newrowselorder=newroworder=['r1','r2','r3','r4','r0']
In [191]:
df2.loc[newrowselorder]
Out[191]:
c1 c2 c3 c4 c0
r1 3 5 2 4 9
r2 6 8 8 10 7
r3 6 7 7 8 1
r4 5 9 8 9 1
r0 0 3 3 7 5

Relabelling Columns or Rows

Relabelling Columns

Column names can be accessed by use of the columns attribute.

In [192]:
df2.columns
Out[192]:
Index(['c1', 'c2', 'c3', 'c4', 'c0'], dtype='object')

The columns can be renamed by assigning it to a new list, or index of equal length to the number of existing columns.

In [193]:
new_cols=['c_1','c_2','c_3','c_4','c_0']
In [194]:
df2.columns=new_cols

This updates the column names.

In [195]:
df2
Out[195]:
c_1 c_2 c_3 c_4 c_0
r0 0 3 3 7 5
r1 3 5 2 4 9
r2 6 8 8 10 7
r3 6 7 7 8 1
r4 5 9 8 9 1
r5 3 0 3 5 4
r6 2 3 8 1 0
r7 3 3 7 0 3
r8 9 9 0 10 1
r9 7 3 2 7 4
r10 0 0 4 5 2
r11 6 8 4 1 5

The above requires a list of all the columns. It is possible to rename some columns without specifying them all using the dataframe method rename. To get more details about the dataframe method renames positional input arguments, press shift ⇧ and tab ↹.

pddataframe_method_rename

We can use the keyword columns and assign it to a dictionary where the dictionary keys consist of the old column names and the dicionary values correspond to the new column names. In this case we wish to rename 'c_0' to 'y'.

In [196]:
col_rename={'c_0':'y'}
df2.rename(columns=col_rename)
Out[196]:
c_1 c_2 c_3 c_4 y
r0 0 3 3 7 5
r1 3 5 2 4 9
r2 6 8 8 10 7
r3 6 7 7 8 1
r4 5 9 8 9 1
r5 3 0 3 5 4
r6 2 3 8 1 0
r7 3 3 7 0 3
r8 9 9 0 10 1
r9 7 3 2 7 4
r10 0 0 4 5 2
r11 6 8 4 1 5

To apply the changes we can either reassign the output to df2 or alternatively use the keyword input argument inplace and setting it to True. Recall that using the keyword input argument inplace will perform an inplace update of the dataframe and lead to no return statement. Therefore using inplace and setting it to True and reassigning the output to the dataframe name will lead to a NoneType object opposed to the updated dataframe.

In [197]:
df2.rename(columns=col_rename,inplace=True)

We can also use the string attribute on the dataframe attribute columns, to access a number of string methods to carry out on the column names.

In [198]:
df2.columns.str
Out[198]:
<pandas.core.strings.accessor.StringMethods at 0x1e3265c8ac0>

We can type in this attribute followed by a dot . and then access the list of string attrbutes and methods by pressing tab ↹.

pddataframe_columns_attribute_str_attributes

We can use the str method capitalize the first letter of each column. To get details about the input arguments press shift ⇧ and tab ↹.

pddataframe_columns_attribute_str_attributes_capitalize_method

In this case the method capitalize has no input arguments.

In [199]:
df2.columns.str.capitalize()
Out[199]:
Index(['C_1', 'C_2', 'C_3', 'C_4', 'Y'], dtype='object')

The output displays in the cell. To update the columns the columns attribute of the dataframe can be reassigned.

In [200]:
df2.columns=df2.columns.str.capitalize()

The dataframe now has upper case column names.

In [201]:
df2
Out[201]:
C_1 C_2 C_3 C_4 Y
r0 0 3 3 7 5
r1 3 5 2 4 9
r2 6 8 8 10 7
r3 6 7 7 8 1
r4 5 9 8 9 1
r5 3 0 3 5 4
r6 2 3 8 1 0
r7 3 3 7 0 3
r8 9 9 0 10 1
r9 7 3 2 7 4
r10 0 0 4 5 2
r11 6 8 4 1 5

The str methods upper and lower will set every letter in the column name to be upper and lower case respectively.

In [202]:
df2.columns.str.upper()
Out[202]:
Index(['C_1', 'C_2', 'C_3', 'C_4', 'Y'], dtype='object')
In [203]:
df2.columns.str.lower()
Out[203]:
Index(['c_1', 'c_2', 'c_3', 'c_4', 'y'], dtype='object')

Reassigning the dataframe attribute to the method.

In [204]:
df2.columns=df2.columns.str.lower()

Looking at the output of the dataframe

In [205]:
df2
Out[205]:
c_1 c_2 c_3 c_4 y
r0 0 3 3 7 5
r1 3 5 2 4 9
r2 6 8 8 10 7
r3 6 7 7 8 1
r4 5 9 8 9 1
r5 3 0 3 5 4
r6 2 3 8 1 0
r7 3 3 7 0 3
r8 9 9 0 10 1
r9 7 3 2 7 4
r10 0 0 4 5 2
r11 6 8 4 1 5

Sometimes the str method replace is useful if column names are read from a file and contain spaces or underscores that we wish to remove. To get details about the input arguments press shift ⇧ and tab ↹.

pddataframe_columns_attribute_str_method_replace

The positional input arguments pat and repl are abbreviation for pattern and replacement. These can also be explicitly expressed as keyword input arguments. We can look at replacing c_ with c.

In [206]:
df2.columns.str.replace(pat='c_',repl='c')
Out[206]:
Index(['c1', 'c2', 'c3', 'c4', 'y'], dtype='object')

df2.columns.str.replace(pat='c',repl='c_') df2.columns.str.upper()

To apply the changes we can assign the output to the dataframe columns attribute.

In [207]:
df2.columns=df2.columns.str.replace(pat='c_',repl='c')

Let's see the updated dataframe.

In [208]:
df2
Out[208]:
c1 c2 c3 c4 y
r0 0 3 3 7 5
r1 3 5 2 4 9
r2 6 8 8 10 7
r3 6 7 7 8 1
r4 5 9 8 9 1
r5 3 0 3 5 4
r6 2 3 8 1 0
r7 3 3 7 0 3
r8 9 9 0 10 1
r9 7 3 2 7 4
r10 0 0 4 5 2
r11 6 8 4 1 5

Relabelling Rows

The dataframe row names or index can be accessed using the attribute index.

In [209]:
df2.index
Out[209]:
Index(['r0', 'r1', 'r2', 'r3', 'r4', 'r5', 'r6', 'r7', 'r8', 'r9', 'r10',
       'r11'],
      dtype='object')

As both the attributes columns and index are instances of the Index class, the same attributes and methods can be accessed from them.

In [210]:
type(df2.index)
Out[210]:
pandas.core.indexes.base.Index
In [211]:
type(df2.columns)
Out[211]:
pandas.core.indexes.base.Index

This can be reassigned to a new list of string names of equal length as seen earlier with columns.

In [212]:
new_rows=['r_'+str(r) for r in range(12)]
new_rows
Out[212]:
['r_0',
 'r_1',
 'r_2',
 'r_3',
 'r_4',
 'r_5',
 'r_6',
 'r_7',
 'r_8',
 'r_9',
 'r_10',
 'r_11']
In [213]:
df2.index=new_rows

Viewing the dataframe, we can see the changes have been made.

In [214]:
df2
Out[214]:
c1 c2 c3 c4 y
r_0 0 3 3 7 5
r_1 3 5 2 4 9
r_2 6 8 8 10 7
r_3 6 7 7 8 1
r_4 5 9 8 9 1
r_5 3 0 3 5 4
r_6 2 3 8 1 0
r_7 3 3 7 0 3
r_8 9 9 0 10 1
r_9 7 3 2 7 4
r_10 0 0 4 5 2
r_11 6 8 4 1 5

We can also rename select row names using a dictionary and the dataframe method rename with the keyword input argument index.

In [215]:
row_rename={'r_11':'x'}
df2=df2.rename(index=row_rename)

Viewing the dataframe, once again we can see that the changes have been made.

In [216]:
df2
Out[216]:
c1 c2 c3 c4 y
r_0 0 3 3 7 5
r_1 3 5 2 4 9
r_2 6 8 8 10 7
r_3 6 7 7 8 1
r_4 5 9 8 9 1
r_5 3 0 3 5 4
r_6 2 3 8 1 0
r_7 3 3 7 0 3
r_8 9 9 0 10 1
r_9 7 3 2 7 4
r_10 0 0 4 5 2
x 6 8 4 1 5

str methods can once again be accessed from instances of the Index class such as replace allowing one to replace a repeating pattern.

In [217]:
df2.index=df2.index.str.replace(pat='r_',repl='r')
In [218]:
df2
Out[218]:
c1 c2 c3 c4 y
r0 0 3 3 7 5
r1 3 5 2 4 9
r2 6 8 8 10 7
r3 6 7 7 8 1
r4 5 9 8 9 1
r5 3 0 3 5 4
r6 2 3 8 1 0
r7 3 3 7 0 3
r8 9 9 0 10 1
r9 7 3 2 7 4
r10 0 0 4 5 2
x 6 8 4 1 5

Sorting Data

Let's recreate our dataframes and have a look at how we can sort data.

In [219]:
random.seed(0)
matrix=random.randint(low=0,high=11,size=(12,5))
cnames=['c'+str(i) for i in range(5)]
rnames=['r'+str(i) for i in range(12)]
df=pd.DataFrame(data=matrix,columns=cnames)
df2=pd.DataFrame(data=matrix,columns=cnames,index=rnames)
In [220]:
df2
Out[220]:
c0 c1 c2 c3 c4
r0 5 0 3 3 7
r1 9 3 5 2 4
r2 7 6 8 8 10
r3 1 6 7 7 8
r4 1 5 9 8 9
r5 4 3 0 3 5
r6 0 2 3 8 1
r7 3 3 3 7 0
r8 1 9 9 0 10
r9 4 7 3 2 7
r10 2 0 0 4 5
r11 5 6 8 4 1

We can use the dataframe method sort_values to sort a dataframe. To get details about the input arguments we can press shift ⇧ and tab ↹.

pddataframe_method_sort_values

The positional input argument by (also acts as a keyword input argument) can be a list of str of column names or row names. The keyword input argument axis can either be 0 (meaning its working on sorting the rows within a column) or 1 (meaning its working on sorting the columns within a row). Both by and axis must match. The keyword input argument ascending has a default value of True meaning the data will be sorted from lowest to highest. If this is assigned to False, it will instead sort in descending order. The keyword input argument inplace can be set to True to apply inplace changes.

Sorting by Rows within a Column

axis has a default value of 0 meaning the method sort_values will default to sorting all the rows within a column. The code below will sort all the rows within column c0.

In [221]:
df2.sort_values(by=['c0'])
Out[221]:
c0 c1 c2 c3 c4
r6 0 2 3 8 1
r3 1 6 7 7 8
r4 1 5 9 8 9
r8 1 9 9 0 10
r10 2 0 0 4 5
r7 3 3 3 7 0
r5 4 3 0 3 5
r9 4 7 3 2 7
r0 5 0 3 3 7
r11 5 6 8 4 1
r2 7 6 8 8 10
r1 9 3 5 2 4

We see that c1 has some duplicate values (r3, r4 and r8) they have been sorted corresponding to the original row number. We can instead sort these by column c1 and if there are duplicate values in c0 and c1 we can then sort by c2. To do this we can specify a list containing the 0th, 1st and 2nd column to sort the values by.

In [222]:
df2.sort_values(by=['c0','c1','c2'])
Out[222]:
c0 c1 c2 c3 c4
r6 0 2 3 8 1
r4 1 5 9 8 9
r3 1 6 7 7 8
r8 1 9 9 0 10
r10 2 0 0 4 5
r7 3 3 3 7 0
r5 4 3 0 3 5
r9 4 7 3 2 7
r0 5 0 3 3 7
r11 5 6 8 4 1
r2 7 6 8 8 10
r1 9 3 5 2 4

Alternatively if we wanted to sort in descending order we can set the keyword argument ascending to be False.

In [223]:
df2.sort_values(by=['c0','c1','c2'],ascending =False)
Out[223]:
c0 c1 c2 c3 c4
r1 9 3 5 2 4
r2 7 6 8 8 10
r11 5 6 8 4 1
r0 5 0 3 3 7
r9 4 7 3 2 7
r5 4 3 0 3 5
r7 3 3 3 7 0
r10 2 0 0 4 5
r8 1 9 9 0 10
r3 1 6 7 7 8
r4 1 5 9 8 9
r6 0 2 3 8 1

To apply the changes we can either reassign the output to df2 or set the inplace keyword input argument to True. Recall that when the keyword input argument inplace is True the method will not return a value and therefore if the output is reassigned to the variable name and the keyword input argument inplace is assigned to True, NoneType will be assigned to the object name.

In [224]:
df2.sort_values(by=['c0','c1','c2'],ascending =False,inplace=True)

We can see the changes are made

In [225]:
df2
Out[225]:
c0 c1 c2 c3 c4
r1 9 3 5 2 4
r2 7 6 8 8 10
r11 5 6 8 4 1
r0 5 0 3 3 7
r9 4 7 3 2 7
r5 4 3 0 3 5
r7 3 3 3 7 0
r10 2 0 0 4 5
r8 1 9 9 0 10
r3 1 6 7 7 8
r4 1 5 9 8 9
r6 0 2 3 8 1

The dataframe method sort_index can be used to sort the rows by the index name. By default the keyword input argument axis is set to 0 meaning it acts on the row names (index). The keyword input arguments ascending and inplace are also present in this method.

pddataframe_method_sort_index

Note that the str in the index are sorting alphabetically ('r0', 'r1', 'r10', 'r11', 'r2', ...) and not naturally sorted ('r0', 'r1', 'r2', ... , 'r11')

In [226]:
df2.sort_index()
Out[226]:
c0 c1 c2 c3 c4
r0 5 0 3 3 7
r1 9 3 5 2 4
r10 2 0 0 4 5
r11 5 6 8 4 1
r2 7 6 8 8 10
r3 1 6 7 7 8
r4 1 5 9 8 9
r5 4 3 0 3 5
r6 0 2 3 8 1
r7 3 3 3 7 0
r8 1 9 9 0 10
r9 4 7 3 2 7

Natural sorting is unfortunately not natively supported within pandas but can be carried out using the third-party library natsort.

To use the third-party library natsort it must be installed. Open the Anaconda Powershell prompt and type in.

conda install natsort

Let's look at the dataframe.

In [227]:
df2
Out[227]:
c0 c1 c2 c3 c4
r1 9 3 5 2 4
r2 7 6 8 8 10
r11 5 6 8 4 1
r0 5 0 3 3 7
r9 4 7 3 2 7
r5 4 3 0 3 5
r7 3 3 3 7 0
r10 2 0 0 4 5
r8 1 9 9 0 10
r3 1 6 7 7 8
r4 1 5 9 8 9
r6 0 2 3 8 1

We can import natsorted which we can use to naturally reorder the row names in the index.

In [228]:
from natsort import natsorted
natsorted(df2.index)
Out[228]:
['r0', 'r1', 'r2', 'r3', 'r4', 'r5', 'r6', 'r7', 'r8', 'r9', 'r10', 'r11']

Once we can have them we can use them as to select rows from the dataframe using the loc attribute. This will give the dataframe in natural order.

In [229]:
df2.loc[natsorted(df2.index)]
Out[229]:
c0 c1 c2 c3 c4
r0 5 0 3 3 7
r1 9 3 5 2 4
r2 7 6 8 8 10
r3 1 6 7 7 8
r4 1 5 9 8 9
r5 4 3 0 3 5
r6 0 2 3 8 1
r7 3 3 3 7 0
r8 1 9 9 0 10
r9 4 7 3 2 7
r10 2 0 0 4 5
r11 5 6 8 4 1

Recalling that an index is in the form of a dictionary and the key is numeric with the row names corresponding to the values.

In [230]:
df2.index.values
Out[230]:
array(['r1', 'r2', 'r11', 'r0', 'r9', 'r5', 'r7', 'r10', 'r8', 'r3', 'r4',
       'r6'], dtype=object)

The key 3 for instance yields the str value 'r0'.

In [231]:
df2.index[3]
Out[231]:
'r0'

We can use index_natsorted to get the ordered numeric keys corresponding to each of the row name values.

In [232]:
from natsort import index_natsorted
index_natsorted(df2.index)
Out[232]:
[3, 0, 1, 9, 10, 5, 11, 6, 8, 4, 7, 2]

Once we can have them we can use them to select rows from the dataframe using the iloc attribute. This will give the dataframe in natural order.

In [233]:
df2.iloc[index_natsorted(df2.index)]
Out[233]:
c0 c1 c2 c3 c4
r0 5 0 3 3 7
r1 9 3 5 2 4
r2 7 6 8 8 10
r3 1 6 7 7 8
r4 1 5 9 8 9
r5 4 3 0 3 5
r6 0 2 3 8 1
r7 3 3 3 7 0
r8 1 9 9 0 10
r9 4 7 3 2 7
r10 2 0 0 4 5
r11 5 6 8 4 1

Going further we can use the numpy function argsort.

In [234]:
np.argsort(index_natsorted(df2.index))
Out[234]:
array([ 1,  2, 11,  0,  9,  5,  7, 10,  8,  3,  4,  6], dtype=int64)

Comparing directly with the dataframe attribute index we can see this returns the index names as numeric int values.

In [235]:
df2.index
Out[235]:
Index(['r1', 'r2', 'r11', 'r0', 'r9', 'r5', 'r7', 'r10', 'r8', 'r3', 'r4',
       'r6'],
      dtype='object')

Returning to the dataframe method sort_index. We can use the keyword input argument key and assign it to a lambda function with a single dummy argument x and the expression f.

In [236]:
f=np.argsort(index_natsorted(df2.index))
df2.sort_index(key= lambda x: f)
Out[236]:
c0 c1 c2 c3 c4
r0 5 0 3 3 7
r1 9 3 5 2 4
r2 7 6 8 8 10
r3 1 6 7 7 8
r4 1 5 9 8 9
r5 4 3 0 3 5
r6 0 2 3 8 1
r7 3 3 3 7 0
r8 1 9 9 0 10
r9 4 7 3 2 7
r10 2 0 0 4 5
r11 5 6 8 4 1

In the line above no matter what is supplied for the dummy argument x, the same output displays. i.e. x is not placed within the expression and the expression is therefore independent of the value of x.

In [237]:
la1=lambda x: f
la1(1)
Out[237]:
array([ 1,  2, 11,  0,  9,  5,  7, 10,  8,  3,  4,  6], dtype=int64)
In [238]:
la1('a')
Out[238]:
array([ 1,  2, 11,  0,  9,  5,  7, 10,  8,  3,  4,  6], dtype=int64)

The dataframe method arg_sort will use these numeric keys opposed to the str values of the index to sort the dataframe.

To apply the changes inplace, the keyword argument inplace can be set to True.

In [239]:
df2.sort_index(key= lambda x: f,inplace=True)

Sorting by Columns within a Row

Let's have a look at the dataframe.

In [240]:
df2
Out[240]:
c0 c1 c2 c3 c4
r0 5 0 3 3 7
r1 9 3 5 2 4
r2 7 6 8 8 10
r3 1 6 7 7 8
r4 1 5 9 8 9
r5 4 3 0 3 5
r6 0 2 3 8 1
r7 3 3 3 7 0
r8 1 9 9 0 10
r9 4 7 3 2 7
r10 2 0 0 4 5
r11 5 6 8 4 1

axis has a default value of 0 meaning the dataframe method sort_values will default to sorting all the rows within a column. To sort all the columns within a row, axis should be set to 1.

In [241]:
df2.sort_values(by=['r0','r1','r2'],axis=1)
Out[241]:
c1 c3 c2 c0 c4
r0 0 3 3 5 7
r1 3 2 5 9 4
r2 6 8 8 7 10
r3 6 7 7 1 8
r4 5 8 9 1 9
r5 3 3 0 4 5
r6 2 8 3 0 1
r7 3 7 3 3 0
r8 9 0 9 1 10
r9 7 2 3 4 7
r10 0 4 0 2 5
r11 6 4 8 5 1

The dataframe method sort_index can also be used with the keyword input argument axis set to 1 to sort the dataframe using the column names.

In [242]:
df2.sort_index(axis=1)
Out[242]:
c0 c1 c2 c3 c4
r0 5 0 3 3 7
r1 9 3 5 2 4
r2 7 6 8 8 10
r3 1 6 7 7 8
r4 1 5 9 8 9
r5 4 3 0 3 5
r6 0 2 3 8 1
r7 3 3 3 7 0
r8 1 9 9 0 10
r9 4 7 3 2 7
r10 2 0 0 4 5
r11 5 6 8 4 1

DataTypes

Let's recreate our dataframe.

In [243]:
random.seed(0)
matrix=random.randint(low=0,high=11,size=(12,5))
cnames=['c'+str(i) for i in range(5)]
rnames=['r'+str(i) for i in range(12)]
df=pd.DataFrame(data=matrix,columns=cnames)
df2=pd.DataFrame(data=matrix,columns=cnames,index=rnames)

The dataframe method head can be used to preview the top n rows in a dataframe. To get details about the input arguments press shift ⇧ and tab ↹. The dataframe method head has a keyword input argument n which has a default value of 5.

df_method_head

We see the top 5 rows (0 to 5 in steps of 1 inclusive of the lower bound and exclusive of the upper bound).

In [244]:
df2.head()
Out[244]:
c0 c1 c2 c3 c4
r0 5 0 3 3 7
r1 9 3 5 2 4
r2 7 6 8 8 10
r3 1 6 7 7 8
r4 1 5 9 8 9

Alternatively we can use the dataframe method tail to see the last n rows in a dataframe.

In [245]:
df2.tail()
Out[245]:
c0 c1 c2 c3 c4
r7 3 3 3 7 0
r8 1 9 9 0 10
r9 4 7 3 2 7
r10 2 0 0 4 5
r11 5 6 8 4 1

We can use the dataframe method describe to get statistical details about each column in a dataframe. To get details about the input arguments press shift ⇧ and tab ↹. We will leave them all as default.

df_method_describe

In [246]:
df.describe()
Out[246]:
c0 c1 c2 c3 c4
count 12.000000 12.000000 12.000000 12.000000 12.000000
mean 3.500000 4.166667 4.833333 4.666667 5.583333
std 2.713602 2.790677 3.298301 2.806918 3.528026
min 0.000000 0.000000 0.000000 0.000000 0.000000
25% 1.000000 2.750000 3.000000 2.750000 3.250000
50% 3.500000 4.000000 4.000000 4.000000 6.000000
75% 5.000000 6.000000 8.000000 7.250000 8.250000
max 9.000000 9.000000 9.000000 8.000000 10.000000

In a dataframe each column (which is an instance of a pandas series class) has an assigned datatype. This can be checked by using the dataframe attribute dtypes.

In [247]:
df2.dtypes
Out[247]:
c0    int32
c1    int32
c2    int32
c3    int32
c4    int32
dtype: object

We see that all the columns are int32 which is expected as the columns have automatically been created using the randint function.

Be careful when working with different datatypes as you may get unexpected results. For example if we attempt to reassigning the cell 'r0', 'c1' (in the integer series 'c1') to the float 21.2.

In [248]:
df2.at['r0','c1']=21.2

We get the int 21 when we read back the cell. This is because the panda series 'c1' is set to use int32 and therefore when a value in this series is assigned to a float anything following the decimal point is truncated.

In [249]:
df2.at['r0','c1']
Out[249]:
21

We can look at a series directly and use the series dtype to return the dtype of a single column.

In [250]:
df2.c1.dtype
Out[250]:
dtype('int32')

We can then use the method astype to change the dtype. To do so we can use the positional input argument (also acts as a keyword input argument) dtype to select the dtype. We can assign the input argument to a str 'float', 'bool' or 'str'. Alternatively as all of these including 'int' are classes, we can assign the class directly so we can use int, float, bool or str (without quotations).

pdseries_method_dtype

Note the decimal point followed by a 0 will display to enphasise its a float.

In [251]:
df2.c1.astype(dtype='float')
Out[251]:
r0     21.0
r1      3.0
r2      6.0
r3      6.0
r4      5.0
r5      3.0
r6      2.0
r7      3.0
r8      9.0
r9      7.0
r10     0.0
r11     6.0
Name: c1, dtype: float64

When it is a bool, any non-zero values will be proscribed to as True and any 0 values will be proscribed to False.

In [252]:
df2.c1.astype(dtype='bool')
Out[252]:
r0      True
r1      True
r2      True
r3      True
r4      True
r5      True
r6      True
r7      True
r8      True
r9      True
r10    False
r11     True
Name: c1, dtype: bool

When it is a str there is no immediate difference in the form output to the console. The dtype however will be object.

In [253]:
df2.c1.astype(dtype='str')
Out[253]:
r0     21
r1      3
r2      6
r3      6
r4      5
r5      3
r6      2
r7      3
r8      9
r9      7
r10     0
r11     6
Name: c1, dtype: object

To perform an inplace update, we need to reassign the output to the pandas series. For this we can use attribute indexing or square bracket indexing to reference an exist series.

In [254]:
df2.c1=df2.c1.astype(dtype='float')

Recall however if we want to add a new series to the dataframe we have to use square bracket indexing.

In [255]:
df2['c5']=df2.c1.astype(dtype='float')

Let's return the original series back to an int datatype.

In [256]:
df2.c1=df2.c1.astype(dtype='int')

Let's now have a look at the dataframe.

In [257]:
df2
Out[257]:
c0 c1 c2 c3 c4 c5
r0 5 21 3 3 7 21.0
r1 9 3 5 2 4 3.0
r2 7 6 8 8 10 6.0
r3 1 6 7 7 8 6.0
r4 1 5 9 8 9 5.0
r5 4 3 0 3 5 3.0
r6 0 2 3 8 1 2.0
r7 3 3 3 7 0 3.0
r8 1 9 9 0 10 9.0
r9 4 7 3 2 7 7.0
r10 2 0 0 4 5 0.0
r11 5 6 8 4 1 6.0

If we now update the cells.

In [258]:
df2.at['r0','c1']=21.2
df2.at['r0','c5']=21.2

We see that the column c1 is an int and c5 is a float.

In [259]:
df2
Out[259]:
c0 c1 c2 c3 c4 c5
r0 5 21 3 3 7 21.2
r1 9 3 5 2 4 3.0
r2 7 6 8 8 10 6.0
r3 1 6 7 7 8 6.0
r4 1 5 9 8 9 5.0
r5 4 3 0 3 5 3.0
r6 0 2 3 8 1 2.0
r7 3 3 3 7 0 3.0
r8 1 9 9 0 10 9.0
r9 4 7 3 2 7 7.0
r10 2 0 0 4 5 0.0
r11 5 6 8 4 1 6.0

A row can be selected from the dataframe (as a pandas series) using the attribute loc. The dtype of the pandas series will be the most complicated datatype. In this case since one of the elements in the series is a float the datatype of the pandas series will therefore be float.

In [260]:
df2.loc['r0']
Out[260]:
c0     5.0
c1    21.0
c2     3.0
c3     3.0
c4     7.0
c5    21.2
Name: r0, dtype: float64

If an additional panda series of the datatype str is added to the dataframe.

In [261]:
df2['c5']=df2.c1.astype(dtype='str')

The row (which is taken as a pandas series) will take the most complicated datatype which is the str and therefore the datatype will display as an object.

In [262]:
df2.loc['r0']
Out[262]:
c0     5
c1    21
c2     3
c3     3
c4     7
c5    21
Name: r0, dtype: object

Additional datatypes are available such as the category datatype for working with categorical data and the datetime or timedelta datatypes when working with time or time differences respectively.

Category DataType

Let's create a dataframe emulating a users selecting and throwing a dice from a box of randomly colored dices. The dice has 6 numbers and 3 colors are available.

In [263]:
random.seed(0)
dice=random.choice((1,2,3,4,5,6),size=12)
color=random.choice(('red','green','blue'),size=12)
rnames=['r'+str(i) for i in range(12)]
df=pd.DataFrame(data=dice,columns=['dice'],
                 index=rnames)
df['color']=color
In [264]:
df
Out[264]:
dice color
r0 5 red
r1 6 red
r2 1 blue
r3 4 green
r4 4 blue
r5 4 blue
r6 2 red
r7 4 green
r8 6 green
r9 3 green
r10 5 green
r11 1 red

Let's use the dataframe attribute dtypes to look at the datatypes of each column. The datatypes are int and object (str).

In [265]:
df.dtypes
Out[265]:
dice      int32
color    object
dtype: object

Setting the DataType to Category

To use the category datatype we need to assign the keyword input argument to the str 'category'.

In [266]:
df.dice.astype(dtype='category')
Out[266]:
r0     5
r1     6
r2     1
r3     4
r4     4
r5     4
r6     2
r7     4
r8     6
r9     3
r10    5
r11    1
Name: dice, dtype: category
Categories (6, int64): [1, 2, 3, 4, 5, 6]

Note that unlike int, float, bool and str used earlier, the category datatype is not a seperate class used in core Python so using category without quotations will give a NameError.

df_pdseries_astype_category_nameerror

We can reassign both the series dice and color to the new datatype.

In [267]:
df.dice=df.dice.astype(dtype='category')
In [268]:
df.color=df.color.astype(dtype='category')

Now if we check the datatypes we can see that both series are of the type category.

In [269]:
df.dtypes
Out[269]:
dice     category
color    category
dtype: object

A category pandas series will have the attribute cat which can be used to access category attributes and methods.

pdcatseries_attribute_cat

In [270]:
df.color.cat
Out[270]:
<pandas.core.arrays.categorical.CategoricalAccessor object at 0x000001E3266D7490>

To get a list of these attributes and methods, type in an additional dot . followed by a tab ↹.

pdcatseries_attribute_cat_methods_attributes

Categories

The categories attribute will display the categories.

In [271]:
df.color.cat.categories
Out[271]:
Index(['blue', 'green', 'red'], dtype='object')

The categories are by default alphabetically ordered. Custom ordering can be applied, this can be checked using the attribute order which returns False in this case as no custom ordering has been applied.

In [272]:
df.color.cat.ordered
Out[272]:
False

The cat codes proscribe an integer to each cat. In this case blue is 0, green is 1 and red is 2 as they are default alphabetically ordered.

In [273]:
df.color.cat.codes
Out[273]:
r0     2
r1     2
r2     0
r3     1
r4     0
r5     0
r6     2
r7     1
r8     1
r9     1
r10    1
r11    2
dtype: int8
In [274]:
df.color
Out[274]:
r0       red
r1       red
r2      blue
r3     green
r4      blue
r5      blue
r6       red
r7     green
r8     green
r9     green
r10    green
r11      red
Name: color, dtype: category
Categories (3, object): ['blue', 'green', 'red']

Adding New Categories

We can index into a cell that belongs to a pandas category series to get the assigned category.

In [275]:
df.color.r0
Out[275]:
'red'

And we can replace it with an existing category using a str format.

In [276]:
df.color.r0='green'

By default we cannot replace it with a new category, for example if we try adding the value 'yellow' we get a ValueError.

pdcatseries_newcat_value_error

This is setup this way by default to prevent users from inputting several varations of categories which should be a single category by accident. For example by use of str capitlization, missing, including or using a hyphen or dash as a space. These are common human errors when dealing with categorical data and the likes of Excel spreadsheets.

pdcatseries_humaninputcats

Let's have a look at the existing categories.

In [277]:
df.color.cat.categories
Out[277]:
Index(['blue', 'green', 'red'], dtype='object')

Let's use the cat method add_categories to add the category 'yellow'. Type in the method and then press shift ⇧ and tab ↹ to get details about the input arguments. We see that we can specify either a single category or a list of categories.

pdcatseries_attribute_cat_method_add_categories

In [278]:
df.color.cat.add_categories('yellow')
Out[278]:
r0     green
r1       red
r2      blue
r3     green
r4      blue
r5      blue
r6       red
r7     green
r8     green
r9     green
r10    green
r11      red
Name: color, dtype: category
Categories (4, object): ['blue', 'green', 'red', 'yellow']

To apply an inplace update we will need to reassign the output to the series df.color

In [279]:
df.color=df.color.cat.add_categories('yellow')

Now we can select the cell within the column color and the row r0 and update it to 'yellow'.

In [280]:
df.color.r0='yellow'
In [281]:
df.color
Out[281]:
r0     yellow
r1        red
r2       blue
r3      green
r4       blue
r5       blue
r6        red
r7      green
r8      green
r9      green
r10     green
r11       red
Name: color, dtype: category
Categories (4, object): ['blue', 'green', 'red', 'yellow']

Renaming Categories

Let's have a look at our categories.

In [282]:
df.color.cat.categories
Out[282]:
Index(['blue', 'green', 'red', 'yellow'], dtype='object')

We can rename them, for example to 1 letter characters using the cat method rename_categories. To get details about the input arguments type in the method and press shift ⇧ and tab ↹.

pdcatseries_attribute_cat_method_rename_categories

We can see that we can use a list or a dict of the new values. Let's have a look using a list.

In [283]:
newcats=['b','g','r','y']
In [284]:
df.color.cat.rename_categories(newcats)
Out[284]:
r0     y
r1     r
r2     b
r3     g
r4     b
r5     b
r6     r
r7     g
r8     g
r9     g
r10    g
r11    r
Name: color, dtype: category
Categories (4, object): ['b', 'g', 'r', 'y']

When using a list. The values need to be specified in the correct order and the list needs to match the length of existing categories, otherwise we get a ValueError.

newcats=['b','g','r']

df.color.cat.rename_categories(newcats)

pdcatseries_attribute_cat_method_rename_categories_ValueError

A dictionary is more flexible, for example if we want to rename just the 'y' category.

In [285]:
newcatd={'yellow':'y'}
In [286]:
df.color.cat.rename_categories(newcatd)
Out[286]:
r0         y
r1       red
r2      blue
r3     green
r4      blue
r5      blue
r6       red
r7     green
r8     green
r9     green
r10    green
r11      red
Name: color, dtype: category
Categories (4, object): ['blue', 'green', 'red', 'y']

We can create a dictionary of the old categories and new categories by use of the zip function.

In [287]:
oldcats=df.color.cat.categories
In [288]:
newcatsd=dict(zip(oldcats,newcats))
In [289]:
newcatsd
Out[289]:
{'blue': 'b', 'green': 'g', 'red': 'r', 'yellow': 'y'}
In [290]:
df.color.cat.rename_categories(newcatsd)
Out[290]:
r0     y
r1     r
r2     b
r3     g
r4     b
r5     b
r6     r
r7     g
r8     g
r9     g
r10    g
r11    r
Name: color, dtype: category
Categories (4, object): ['b', 'g', 'r', 'y']

Merging Categories

Although a dictionary can have multiple keys with the same value.

In [291]:
newcatsd['yellow']='g'
In [292]:
newcatsd
Out[292]:
{'blue': 'b', 'green': 'g', 'red': 'r', 'yellow': 'g'}

We get an error if we try to use the dictionary to merge 2 old cats to a new category.

pdcatseries_attribute_cat_method_rename_categories_ValueError2

Fortunately this is easy enough to do by indexing and reassigning a value. Let's first update the pandas series to have single letter values.

In [293]:
newcatsd['yellow']='y'
In [294]:
df.color=df.color.cat.rename_categories(newcatsd)
In [295]:
df.color
Out[295]:
r0     y
r1     r
r2     b
r3     g
r4     b
r5     b
r6     r
r7     g
r8     g
r9     g
r10    g
r11    r
Name: color, dtype: category
Categories (4, object): ['b', 'g', 'r', 'y']

We can index into the series by using the condition df.color=='y'.

In [296]:
df.color[df.color=='y']
Out[296]:
r0    y
Name: color, dtype: category
Categories (4, object): ['b', 'g', 'r', 'y']

We can reassign this selection to an existing category.

In [297]:
df.color[df.color=='y']='g'

If we look at this series we see that all the values that were 'y' are now 'g'.

In [298]:
df.color
Out[298]:
r0     g
r1     r
r2     b
r3     g
r4     b
r5     b
r6     r
r7     g
r8     g
r9     g
r10    g
r11    r
Name: color, dtype: category
Categories (4, object): ['b', 'g', 'r', 'y']

4 categories are listed. We can remove unused categories using the method remove_unused_categories. To get details about the input arguments type in the method and press shift ⇧ and tab ↹. We see the only input argument is depreacted, so we can call it without any.

pdcatseries_attribute_cat_method_remove_unused_categories

In [299]:
df.color.cat.remove_unused_categories()
Out[299]:
r0     g
r1     r
r2     b
r3     g
r4     b
r5     b
r6     r
r7     g
r8     g
r9     g
r10    g
r11    r
Name: color, dtype: category
Categories (3, object): ['b', 'g', 'r']
In [300]:
df.color=df.color.cat.remove_unused_categories()

Ordering Categories

The categories are ordered alphabetically. In some cases this may not be desirable. Let's make a new pandas series called grade, recall we have to index usign square brackets to create a new series. We will copy the values in df.color and then rename the categories to low med and high.

In [301]:
df['grade']=df.color
In [302]:
df.grade=df.grade.cat.rename_categories({'g':'low','b':'med','r':'high'})
In [303]:
df.grade
Out[303]:
r0      low
r1     high
r2      med
r3      low
r4      med
r5      med
r6     high
r7      low
r8      low
r9      low
r10     low
r11    high
Name: grade, dtype: category
Categories (3, object): ['med', 'low', 'high']

If we have a look at the categories we see that they were originally alphabetically ordered from 'blue', 'green' and 'red'. In this case the order 'med', 'low' and 'high' does not make sense.

In [304]:
df.grade.cat.categories
Out[304]:
Index(['med', 'low', 'high'], dtype='object')

To get details about the input arguments type in the method and press shift ⇧ and tab ↹. We see the input argument can either be index like or a list.

pdcatseries_attribute_cat_method_reorder_categories

In [305]:
new_order=['low','med','high']
In [306]:
df.grade.cat.reorder_categories(new_order)
Out[306]:
r0      low
r1     high
r2      med
r3      low
r4      med
r5      med
r6     high
r7      low
r8      low
r9      low
r10     low
r11    high
Name: grade, dtype: category
Categories (3, object): ['low', 'med', 'high']
In [307]:
new_order_index=pd.Index(new_order)
In [308]:
df.grade.cat.reorder_categories(new_order_index)
Out[308]:
r0      low
r1     high
r2      med
r3      low
r4      med
r5      med
r6     high
r7      low
r8      low
r9      low
r10     low
r11    high
Name: grade, dtype: category
Categories (3, object): ['low', 'med', 'high']

Let's reassign the output to the pd series df.grade.

In [309]:
df.grade=df.grade.cat.reorder_categories(new_order_index)

Although we have ordered the names of the categories. The categories aren't ordered. If we look at the categories we can see they are seperated by a comma infering they are unordered.

In [310]:
df.grade.cat.categories
Out[310]:
Index(['low', 'med', 'high'], dtype='object')

We can use the cat method as_ordered to set the series as ordered. To get details about the input arguments type in the method and press shift ⇧ and tab ↹. We see we don't need to provide any.

pdcatseries_attribute_cat_method_as_ordered

Note the < sign is in place of the , indicating the categories are ordered.

In [311]:
df.grade.cat.as_ordered()
Out[311]:
r0      low
r1     high
r2      med
r3      low
r4      med
r5      med
r6     high
r7      low
r8      low
r9      low
r10     low
r11    high
Name: grade, dtype: category
Categories (3, object): ['low' < 'med' < 'high']
In [312]:
df.grade=df.grade.cat.as_ordered()

Creating Categories from Numeric Data

Let's have a look at our dataframe.

In [313]:
df
Out[313]:
dice color grade
r0 5 g low
r1 6 r high
r2 1 b med
r3 4 g low
r4 4 b med
r5 4 b med
r6 2 r high
r7 4 g low
r8 6 g low
r9 3 g low
r10 5 g low
r11 1 r high

Let's drop the series df.grade and instead create a new grade series based on the numeric data in dice.

In [314]:
df=df.drop(columns='grade')
In [315]:
df
Out[315]:
dice color
r0 5 g
r1 6 r
r2 1 b
r3 4 g
r4 4 b
r5 4 b
r6 2 r
r7 4 g
r8 6 g
r9 3 g
r10 5 g
r11 1 r

We want to create 3 grades from the numeric data in dice. Because we have n grades, we need n+1 boundaries. The list is excusive of the lower bound and inclusive of the upper bound i.e. 0 to 2 (excluding 0 and including 2], 2 to 4 (excluding 2 and including 4], 4 to 6 (excluding 4 and including 6]. Note the categories are seperated by < indicating they are ordered.

In [316]:
bins=[0,2,4,6]
pd.cut(df.dice,bins)
Out[316]:
r0     (4, 6]
r1     (4, 6]
r2     (0, 2]
r3     (2, 4]
r4     (2, 4]
r5     (2, 4]
r6     (0, 2]
r7     (2, 4]
r8     (4, 6]
r9     (2, 4]
r10    (4, 6]
r11    (0, 2]
Name: dice, dtype: category
Categories (3, interval[int64]): [(0, 2] < (2, 4] < (4, 6]]

We can assign the output to a enw series using square brackets.

In [317]:
df['grade']=pd.cut(df.dice,bins)
In [318]:
df
Out[318]:
dice color grade
r0 5 g (4, 6]
r1 6 r (4, 6]
r2 1 b (0, 2]
r3 4 g (2, 4]
r4 4 b (2, 4]
r5 4 b (2, 4]
r6 2 r (0, 2]
r7 4 g (2, 4]
r8 6 g (4, 6]
r9 3 g (2, 4]
r10 5 g (4, 6]
r11 1 r (0, 2]
In [319]:
oldcats=df.grade.cat.categories
newcats=['low','med','high']
renamecats=dict(zip(oldcats,newcats))
df.grade=df.grade.cat.rename_categories(renamecats)
In [320]:
df
Out[320]:
dice color grade
r0 5 g high
r1 6 r high
r2 1 b low
r3 4 g med
r4 4 b med
r5 4 b med
r6 2 r low
r7 4 g med
r8 6 g high
r9 3 g med
r10 5 g high
r11 1 r low
In [321]:
df.grade
Out[321]:
r0     high
r1     high
r2      low
r3      med
r4      med
r5      med
r6      low
r7      med
r8     high
r9      med
r10    high
r11     low
Name: grade, dtype: category
Categories (3, object): ['low' < 'med' < 'high']
In [322]:
df.grade=df.grade

Grouping by Category

We can create a groupby object using the dataframe method groupby. If we call this using open parenthesis we can see details about the input arguments. The default value for the keyword argument axis is 0 meaning the method will act on grouping rows in an individual column. The keyword input argument by (which also acts as a positional input argument) can therefore be used to select a column.

df_groupby

In [323]:
df.groupby(by='color')
Out[323]:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001E3266F53A0>

The output displays information tell us this is an object assigned to a specific location in memory. To inspect it in more detail I will assigned it to a new object name:

In [324]:
gbo=df.groupby(by='color')

If we type in the groupby object name followed by a dot . and then a tab ↹, a list of attributes and methods will display:

gbo_methods_attributes

If we select the method count. To get details about the input arguments type in the method and press shift ⇧ and tab ↹. We see there are no input arguments.

gbo_method_count

In [325]:
gbo.count()
Out[325]:
dice grade
color
b 3 3
g 6 6
r 3 3

Let's compare this with the dataframe.

In [326]:
df
Out[326]:
dice color grade
r0 5 g high
r1 6 r high
r2 1 b low
r3 4 g med
r4 4 b med
r5 4 b med
r6 2 r low
r7 4 g med
r8 6 g high
r9 3 g med
r10 5 g high
r11 1 r low

In the series df.color the value 'b' shows in 'r2', 'r4' and 'r5' i.e. 3 times.

In the series df.color the value 'g' shows in 'r0', 'r3', 'r7'. 'r8', 'r9' and 'r10' i.e. 6 times.

In the series df.color the value 'r' shows in 'r1', 'r6' and 'r11' i.e. 3 times.

In [327]:
gbo.count()
Out[327]:
dice grade
color
b 3 3
g 6 6
r 3 3

This can be combined into a single line.

In [328]:
df.groupby(by='color').count()
Out[328]:
dice grade
color
b 3 3
g 6 6
r 3 3

Although the above is a bit more concise, there is a major drawback when it comes to looking up attributes and methods. Typing a dot . and then tab ↹ after the first method is input doesn't display the list of available attributes and methods. Moreover typing in shift ⇧ and tab ↹ does not give details about the groupby method count.

The output can also be assigned to an object name and is an instance of the DataFrame class.

In [329]:
colorcount=df.groupby(by='color').count()
In [330]:
type(colorcount)
Out[330]:
pandas.core.frame.DataFrame

We can use the groupby method to group by the series color and then select another column as an attribute to perform groupby calculations on. We need a numeric series to perform numerical operations on. Let's return the series dice to the type int (recall it was set to category).

In [331]:
df.dice=df.dice.astype(int)
In [332]:
df.groupby(by='color').dice.sum()
Out[332]:
color
b     9
g    27
r     9
Name: dice, dtype: int64

Leaving the series as category and trying the above gives a TypeError.

gbo_attribute_pdseries_dice_method_sum_TypeError

Date and Time

When working with time we generally use a datetime time stamp and the numpy library has support for these. numpy was imported at the start of this notebook. If we type in the class name followed by shift ⇧ and tab ↹ we get the docstring for the classes Init signature unfortunately we see this lacking. However we use a str of the time stamp. For simplicity let's look at a year.

numpy_datetime64

In [333]:
np.datetime64('2019')
Out[333]:
numpy.datetime64('2019')

We can also assign it to the object name date:

In [334]:
date=np.datetime64('2019')

Fortunately the pandas series handles this datatype much better and we can view the datatype within the variable explorer properly. We can easily convert this value into a series:

In [335]:
date=pd.Series(date)
In [336]:
date
Out[336]:
0   2019-01-01
dtype: datetime64[ns]

We see the str displays not only the year but also the month and day i.e. is of the format:

'YYYY-MM-DD'

We can also include a time after this.

'YYYY-MM-DD hh:mm:ss'

We can create another date:

In [337]:
date2=pd.Series(np.datetime64('2021-02-05 20:27:30'))
In [338]:
date2
Out[338]:
0   2021-02-05 20:27:30
dtype: datetime64[ns]

We now have two different time stamps and can now calculate the time difference between them:

In [339]:
date2-date
Out[339]:
0   766 days 20:27:30
dtype: timedelta64[ns]

This returns a slightly different object, a timedelta64 object i.e. an instance of the timedelta64 class.

Note the units years and the unit months are omitted as they are inconsistent units. Most years have 365 days but a leap year has 366 days and months range from 28-29 days in March and 31 days in January. Therefore it is more accurate to state time in days.

We can assign this output to a pandas series:

In [340]:
timediff=pd.Series(date2-date)

We can also create a timedelta64 object directly from the numpy library. If we type in the class name followed by shift ⇧ and tab ↹ we get the docstring for the classes Init signature unfortunately we see this is also lacking.

numpy_timedelta64

The first positional input argument is the value and the second is the unit. For example:

1 Day:

In [341]:
np.timedelta64(1,'D')
Out[341]:
numpy.timedelta64(1,'D')

1 hour:

In [342]:
np.timedelta64(1,'h')
Out[342]:
numpy.timedelta64(1,'h')

1 minute:

In [343]:
np.timedelta64(1,'m')
Out[343]:
numpy.timedelta64(1,'m')

1 second:

In [344]:
np.timedelta64(1,'s')
Out[344]:
numpy.timedelta64(1,'s')

We can combine these units together:

In [345]:
np.timedelta64(1,'D')+np.timedelta64(1,'s')
Out[345]:
numpy.timedelta64(86401,'s')

We can use the numpy function arange to create an array of numbers. If we type in the function name followed by shift ⇧ and tab ↹ we get the docstring for the function and see the keyword input arguments start, stop and step.

numpy_arange

In [346]:
a=np.arange(start=0,stop=6,step=1)

This creates an array starting at the value 0 and going up to (but not including) 6 in steps of 1:

In [347]:
a
Out[347]:
array([0, 1, 2, 3, 4, 5])

We can also do this with time stamps or time differences:

In [348]:
times=np.arange(start=np.timedelta64(0,'s'),stop=np.timedelta64(6,'s'),step=np.timedelta64(1,'s'))
In [349]:
times
Out[349]:
array([0, 1, 2, 3, 4, 5], dtype='timedelta64[s]')

Let's convert this into a pandas series.

In [350]:
times=pd.Series(times)
In [351]:
times
Out[351]:
0   0 days 00:00:00
1   0 days 00:00:01
2   0 days 00:00:02
3   0 days 00:00:03
4   0 days 00:00:04
5   0 days 00:00:05
dtype: timedelta64[ns]

The start and stop can instead be specified as a timestamp.

In [352]:
dates=np.arange(start=np.datetime64('2021-01-01'),stop=np.datetime64('2021-01-06'),step=np.timedelta64(1,'D'))
dates=pd.Series(dates)
In [353]:
dates
Out[353]:
0   2021-01-01
1   2021-01-02
2   2021-01-03
3   2021-01-04
4   2021-01-05
dtype: datetime64[ns]

We can create a dataframe from these series using:

In [354]:
df=pd.DataFrame(data=a)
df['times']=times
df['dates']=dates
In [355]:
df
Out[355]:
0 times dates
0 0 0 days 00:00:00 2021-01-01
1 1 0 days 00:00:01 2021-01-02
2 2 0 days 00:00:02 2021-01-03
3 3 0 days 00:00:03 2021-01-04
4 4 0 days 00:00:04 2021-01-05
5 5 0 days 00:00:05 NaT

Note that the last value of dates is NaT (Not a Time) and is due to this series only having 5 values so no value was assigned for the last index.

Once familiar with the Python and Data Analysis Library it is recommended to move onto the matplotlib Python ploting library. I have an additional JupyterLab notebook covering the matplotlib library here.

The Python Plotting Library