Department of Data Science

Course: Tools and Techniques for Data Science

# # --- #

Instructor: Muhammad Arif Butt, Ph.D.

Lecture 3.11 (Pandas-03)

# # ## _Overview of Pandas Dataframe Data Structure_ # #### Read about Pandas Data Structures: https://pandas.pydata.org/docs/user_guide/dsintro.html#dsintro # In[ ]: # ## Learning agenda of this notebook # 1. Anatomy of a Dataframe # 2. Creating Dataframe # - An empty dataframe # - Two-Dimensional NumPy Array # - Dictionary of Python Lists # - Dictionary of Panda Series # 2. Attributes of a Dataframe # 3. Bonus # In[ ]: # To install this library in Jupyter notebook #import sys #!{sys.executable} -m pip install pandas # In[1]: import pandas as pd pd.__version__ , pd.__path__ # In[ ]: #

# # ## 1. Creating a Dataframe #

# >**A Pandas Dataframe is a two-dimensional labeled data structure (like SQL table) with heterogeneously typed columns, having both a row and a column index.** # #

# # **```pd.DataFrame(data=None, index=None, columns=None, dtype=None)```** # - Where, # - `data`: It can be a 2-D NumPy Array, a Dictionary of Python Lists, or a Dictionary of Panda Series (You can also create a dataframe from a file in CSV, Excel, JSON, HTML format or may be from a database table as well). # - `index`: These are the row indices. Will default to RangeIndex (0, 1, 2, ..., n), if index argument is not passed and no indexing information is part of input data. # - `columns`: These are the column indices or labels. Will default to RangeIndex (0, 1, 2, ..., n), if index argument is not passed and no indexing information is part of input data. # - `dtype`: Data type to force. Only a single dtype is allowed. If None, infer. # In[ ]: # ### a. Creating an Empty Dataframe # In[2]: import pandas as pd import numpy as np df = pd.DataFrame() print(df) # In[ ]: # In[ ]: # ### b. Creating a Dataframe from a 2-D NumPy Array # In[4]: arr = np.random.randint(10,100, size= (6,5)) print("Numpy Array:\n",arr) df = pd.DataFrame(data=arr) print("Pandas Dataframe:\n",df) # - Note that both the row indices and the column labels/indices are implicitly set to numerical values from 0 to n-1, since neither of the two is provided while creating the dataframe object. They are also not considered as part of data in the dataframe. # - In majority of the cases the row label is left as default, i.e., 0,1,2,3.... However, the column labels are changed from 0,1,2,3,... to some meaningful values. # In[ ]: # In[ ]: # Let us name the column labels of our choice, while creating it col_labels=['Col1', 'Col2', 'Col3', 'Col4', 'Col5'] df = pd.DataFrame(data=arr, columns=col_labels) df # In[ ]: # In[ ]: # Let us name the row labels of our choice, while creating it df = pd.DataFrame(data=arr, index=['Row0', 'Row1', 'Row2', 'Row3', 'Row4', 'Row5']) df # In[ ]: # In[5]: # Let us name the both row labels and column labels to strings of our choice, while creating it row_labels = ['Row0', 'Row1', 'Row2', 'Row3', 'Row4', 'Row5'] col_labels = ['Col0', 'Col1', 'Col2', 'Col3', 'Col4'] df = pd.DataFrame(data=arr, index=row_labels, columns=col_labels) df # In[6]: df['Row0'] # - You can do this later as well, i.e., after the dataframe has been created with default indices. # - This is done by assigning a list of labels/values to `index` and `columns` attributes of a dataframe object. # In[ ]: # In[ ]: arr = np.random.randint(10,100, size= (6,5)) df = pd.DataFrame(data=arr) df # In[ ]: row_labels = ['Row0', 'Row1', 'Row2', 'Row3', 'Row4', 'Row5'] col_labels = ['Col0', 'Col1', 'Col2', 'Col3', 'Col4'] df.columns = col_labels df.index = row_labels df # In[ ]: # ### c. Creating a Dataframe from a Dictionary of Python Lists # - You can create a dataframe object from a dictionary of Python Lists # - The dictionary `Keys` become the column names, and # - The dictionary `Values` are lists/arrays containing data for the respective columns. # In[ ]: people = { "name" : ["Rauf", "Arif", "Maaz", "Hadeed", "Mujahid", "Mohid"], "age" : [52, 51, 26, 22, 18, 17], "address": ["Lahore", "Karachi", "Lahore", "Islamabad", "Kakul", "Karachi"], "cell" : ["321-123", "320-431", "321-478", "324-446", "321-967", "320-678"], "bg": ["B+", "A-", "B+", "O-", "A-", "B+"] } people # In[ ]: # Pass this Dictionary of Python Lists to pd.Dataframe() df_people = pd.DataFrame(data=people) df_people # - Note that column labels are set as per the keys inside the dictionary object, while the row labels/indices are set to default numerical values. # - You can set the row indices while creating the dataframe by passing the index argument to `pd.DataFrame()` method, or can do that later by assigning the new values to the `index` and `columns` attributes of a dataframe object. # In[ ]: # Let us change the row labels of above dataframe row_labels = ['MS01', 'MS02', 'MS03', 'MS04', 'MS05', 'MS06'] df_people.index = row_labels df_people # In[ ]: # ### d. Creating a Dataframe from Dictionary of Panda Series # One can think of a dataframe as a dictionary of Panda Series: # - `Keys` are column names, and # - `Values` are Series object for the respective columns. # In[ ]: dict = { "name": pd.Series(['Arif', 'Hadeed', 'Mujahid']), "age": pd.Series([50, 22, 18]), "addr": pd.Series(['Lahore', 'Islamabad','Karachi']), } df = pd.DataFrame(data=dict) df # >Note from the above output, that every series object becomes the data of the appropriate column. Moreover, the keys of the dictionary become the column labels. # In[ ]: # In[ ]: dict = { "name": pd.Series(data=['Arif', 'Hadeed', 'Mujahid', 'Maaz'], index=['a','b','c', 'd']), "age": pd.Series(data=[50, 22,np.nan, 18], index=['a','b','c','d']), "addr": pd.Series(data=['Lahore', '', 'Peshawer','Karachi'], index=['a','b','c', 'd']), } df = pd.DataFrame(dict) df # >- In the above code and its output, note that every series object has four data values and four corresponding indices. # >- Also note that in the `age` series, we have a NaN value, and in the `addr` series we have an empty string. # >- Another point to note that the row indices of the three series exactly match, in number as well as in sequence/value. # >- A question arise, what if the indices of series are different. See the following code to understand this concept. # In[ ]: # In[4]: import pandas as pd import numpy as np # In[5]: dict = { "name": pd.Series(['Arif', 'Hadeed', 'Mujahid', 'Maaz'], index=['a','b','c', 'd']), "age": pd.Series([50, 22,np.nan, 18], index=['a','x','y','d']), "addr": pd.Series(['Lahore', '','Karachi'], index=['a', 'd', 'x']), } df = pd.DataFrame(dict) df # In[18]: df.head(-4) # In[19]: df.tail(-4) # In[ ]: # >- In the above code and its output, note that first series object has four data values and four corresponding indices. Similarly, second series object has four data values (with one `np.nan` value) and four corresponding indices, which are a bit different from the first series object. Third series has three data values (with one empty string) and three indices. # >- Note the resulting Dataframe has six rows and three columns. # - For index 'a' we have value in all the three series objects or columns. # - For index 'b' we have a value in first series object, and NaN for the second and third column, since the second and third series object has no value corresponding to row index 'b. # In[ ]: # In[ ]: # ## 3. Attributes of Pandas Dataframe # - Like Series, we can access properties/attributes of a dataframe by using dot `.` notation # In[6]: people = { "name" : ["Rauf", "Arif", "Maaz", "Hadeed", "Mujahid", "Mohid"], "age" : [52, 51, 26, 22, 18, 17], "address": ["Lahore", "Karachi", "Lahore", "Islamabad", "Kakul", "Karachi"], "cell" : ["321-123", "320-431", "321-478", "324-446", "321-967", "320-678"], "bg": ["B+", "A-", "B+", "O-", "A-", "B+"] } df_people = pd.DataFrame(data=people, index=['MS01', 'MS02', 'MS03', 'MS04', 'MS05', 'MS06']) df_people # In[7]: # `shape` attribute of a dataframe object return a two value tuple containing rows and columns # Note the rows count does not include the column labels and column count does not include the row index df_people.shape # In[ ]: # In[8]: # `ndim` attribute of a dataframe object returns number of dimensions (which is always 2) df_people.ndim # In[ ]: # In[9]: # `size` attribute of a dataframe object returns the number of elements in the underlying data df_people.size # In[ ]: # In[10]: # `index` attribute of a dataframe object return the list of row indices and its datatype df_people.index # In[ ]: # In[11]: # `columns` attribute of a dataframe object return the list of column labels and its datatype df_people.columns # In[12]: #This attribute is used to fetch both index and column names. df_people.axes # In[13]: # `values` attribute of a dataframe object returns a NumPy 2-D having all the values in the DataFrame, # without the row indices and column labels df_people.values # In[14]: df.empty # In[15]: # `dtypes` attribute of a dataframe object return the data type of each column in the dataframe df_people.dtypes # In[ ]: # In[ ]: # To check number on non-NA values df_people.count() # In[ ]: # # Bonus # #### The `df.info()` Method # In[1]: #This method prints information about a DataFrame including the row indices, column labels, # non-null values count in each column, datatype and memory usage df_people.info() # In[ ]: # In[ ]: