Notebook

Python practical for Data Science¶

Introduction¶

This guide explores best practices and fundamental knowledge that will allow you to perform data analysis using Python. In this guide, you will learn how to use Jupyter notebooks and Python libraries such as Pandas, Matplotlib or Numpy to easily and transparently explore and analyze your dataset.

What is Data Science?¶

Essentially, Data Science is about* extracting knowledge* from massive amounts of data to generate information. This is essentially done using several disciplines such as Mathematics and Computer Science such as statistics, probabilistic models, machine learning, data storage, computer programming, and so on.

For more information, We invite readers to take a look at this book.

Prerequisites¶

It is necessary to be initiated into Python programming before attending this guide. We will also refer to the basic notions of probability and statistics, as well as relational algebra.

Main purposes¶

After following this guide, readers should familiarize themselves with the following tasks:

Creating and installing Python modules;
Modification and use of Jupyter notebooks;
Data visualization with Matplotlib and Seaborn;
Handling tables with Numpy;
Handling a dataset with Pandas.

Guide program¶

We will propose a hierarchical content according to the following plan:

Setup your work environment;
Get started with Python;
Manipulate data using Numpy and Matplotlib;
Deal with a large amount of data using Pandas library.

Let's get started.

1. Setup your work environment;¶

In order to begin analyzing data with Python, we need to have some background, as with all other relevant topics. Now, we will try to explain how to install Jupyter on your own machine.

In case you don't have Python yet, it is possible to install Python, including all the necessary libraries, and the Jupyter notebook directly, by using this solution. Once Python has been previously installed, you will need to install the Jupyter notebook, as well as the following libraries: Pandas , Matplotlib , Numpy and SciPy.

In practice, there are two possible solutions to install the Jupyter notebook and its necessary libraries:

Full installation in conjunction with the Anaconda distribution;
Installation of the Jupyter notebook exclusively (without Anaconda).

1.1. Full installation in conjunction with the Anaconda distribution

For those who have never installed Python, it is recommended to install the Anaconda distribution directly. This is what we will do in this section. Note that Anaconda is therefore a Python distribution, developed for Data Science.

Everything we need, Anaconda will install everything we need, certainly, but it can install a little too much (libraries that we won't use, and so on). You can download an improved version of Anaconda here: Download. However, libraries, especially Jupyter, will not be installed automatically for this purpose. As a result, this is a solution for experienced users.

On this page you will find the formal installation instructions in case these instructions are no longer effective.

In order to install Anaconda on Windows or MacOs, you must:

(1) Download the setup file for Windows or MacOs, then launch it by double-clicking on the downloaded file;
(2) Once the installation is finished, make sure that everything has been done by executing the Jupyter program

In case of any issues when installing Anaconda, please discuss them on this community.

** 1.2. Installation of the Jupyter notebook exclusively (without Anaconda)**

To avoid installing Anaconda, you can follow the following instructions after you have installed Python:

Check that the pip program is installed on your machine. For that, you just type pip into a console. Usually, the pip program was started at the same time as Python. Next, input these codelines successively:

In [9]:

!python -m pip install --upgrade pip    
!python -m pip install jupyter

Requirement already up-to-date: pip in /usr/local/lib/python3.6/dist-packages (19.0.3)
Requirement already satisfied: jupyter in /usr/local/lib/python3.6/dist-packages (1.0.0)
Requirement already satisfied: ipywidgets in /usr/local/lib/python3.6/dist-packages (from jupyter) (7.4.2)
Requirement already satisfied: notebook in /usr/local/lib/python3.6/dist-packages (from jupyter) (5.2.2)
Requirement already satisfied: jupyter-console in /usr/local/lib/python3.6/dist-packages (from jupyter) (6.0.0)
Requirement already satisfied: qtconsole in /usr/local/lib/python3.6/dist-packages (from jupyter) (4.4.3)
Requirement already satisfied: nbconvert in /usr/local/lib/python3.6/dist-packages (from jupyter) (5.4.1)
Requirement already satisfied: ipykernel in /usr/local/lib/python3.6/dist-packages (from jupyter) (4.6.1)
Requirement already satisfied: ipython>=4.0.0; python_version >= "3.3" in /usr/local/lib/python3.6/dist-packages (from ipywidgets->jupyter) (5.5.0)
Requirement already satisfied: widgetsnbextension~=3.4.0 in /usr/local/lib/python3.6/dist-packages (from ipywidgets->jupyter) (3.4.2)
Requirement already satisfied: nbformat>=4.2.0 in /usr/local/lib/python3.6/dist-packages (from ipywidgets->jupyter) (4.4.0)
Requirement already satisfied: traitlets>=4.3.1 in /usr/local/lib/python3.6/dist-packages (from ipywidgets->jupyter) (4.3.2)
Requirement already satisfied: jupyter-client in /usr/local/lib/python3.6/dist-packages (from notebook->jupyter) (5.2.4)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.6/dist-packages (from notebook->jupyter) (2.10)
Requirement already satisfied: tornado>=4 in /usr/local/lib/python3.6/dist-packages (from notebook->jupyter) (4.5.3)
Requirement already satisfied: terminado>=0.3.3; sys_platform != "win32" in /usr/local/lib/python3.6/dist-packages (from notebook->jupyter) (0.8.1)
Requirement already satisfied: ipython-genutils in /usr/local/lib/python3.6/dist-packages (from notebook->jupyter) (0.2.0)
Requirement already satisfied: jupyter-core in /usr/local/lib/python3.6/dist-packages (from notebook->jupyter) (4.4.0)
Requirement already satisfied: pygments in /usr/local/lib/python3.6/dist-packages (from jupyter-console->jupyter) (2.1.3)
Collecting prompt-toolkit<2.1.0,>=2.0.0 (from jupyter-console->jupyter)
  Downloading https://files.pythonhosted.org/packages/f7/a7/9b1dd14ef45345f186ef69d175bdd2491c40ab1dfa4b2b3e4352df719ed7/prompt_toolkit-2.0.9-py3-none-any.whl (337kB)
    100% |████████████████████████████████| 337kB 26.5MB/s 
Requirement already satisfied: testpath in /usr/local/lib/python3.6/dist-packages (from nbconvert->jupyter) (0.4.2)
Requirement already satisfied: bleach in /usr/local/lib/python3.6/dist-packages (from nbconvert->jupyter) (3.1.0)
Requirement already satisfied: defusedxml in /usr/local/lib/python3.6/dist-packages (from nbconvert->jupyter) (0.5.0)
Requirement already satisfied: pandocfilters>=1.4.1 in /usr/local/lib/python3.6/dist-packages (from nbconvert->jupyter) (1.4.2)
Requirement already satisfied: entrypoints>=0.2.2 in /usr/local/lib/python3.6/dist-packages (from nbconvert->jupyter) (0.3)
Requirement already satisfied: mistune>=0.8.1 in /usr/local/lib/python3.6/dist-packages (from nbconvert->jupyter) (0.8.4)
Requirement already satisfied: pexpect; sys_platform != "win32" in /usr/local/lib/python3.6/dist-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets->jupyter) (4.6.0)
Requirement already satisfied: simplegeneric>0.8 in /usr/local/lib/python3.6/dist-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets->jupyter) (0.8.1)
Requirement already satisfied: setuptools>=18.5 in /usr/local/lib/python3.6/dist-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets->jupyter) (40.8.0)
Requirement already satisfied: pickleshare in /usr/local/lib/python3.6/dist-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets->jupyter) (0.7.5)
Requirement already satisfied: decorator in /usr/local/lib/python3.6/dist-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets->jupyter) (4.4.0)
Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in /usr/local/lib/python3.6/dist-packages (from nbformat>=4.2.0->ipywidgets->jupyter) (2.6.0)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from traitlets>=4.3.1->ipywidgets->jupyter) (1.11.0)
Requirement already satisfied: pyzmq>=13 in /usr/local/lib/python3.6/dist-packages (from jupyter-client->notebook->jupyter) (17.0.0)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.6/dist-packages (from jupyter-client->notebook->jupyter) (2.5.3)
Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.6/dist-packages (from jinja2->notebook->jupyter) (1.1.1)
Requirement already satisfied: ptyprocess; os_name != "nt" in /usr/local/lib/python3.6/dist-packages (from terminado>=0.3.3; sys_platform != "win32"->notebook->jupyter) (0.6.0)
Requirement already satisfied: wcwidth in /usr/local/lib/python3.6/dist-packages (from prompt-toolkit<2.1.0,>=2.0.0->jupyter-console->jupyter) (0.1.7)
Requirement already satisfied: webencodings in /usr/local/lib/python3.6/dist-packages (from bleach->nbconvert->jupyter) (0.5.1)
ipython 5.5.0 has requirement prompt-toolkit<2.0.0,>=1.0.4, but you'll have prompt-toolkit 2.0.9 which is incompatible.
Installing collected packages: prompt-toolkit
  Found existing installation: prompt-toolkit 1.0.15
    Uninstalling prompt-toolkit-1.0.15:
      Successfully uninstalled prompt-toolkit-1.0.15
Successfully installed prompt-toolkit-2.0.9

By typing the following command in your console, you can verify if the setup went well:

In [10]:

  !jupyter notebook

|INFO|google.colab serverextension initialized.
|INFO|Serving notebooks from local directory: /content
|INFO|0 active kernels
|INFO|The Jupyter Notebook is running at:
|INFO|http://localhost:8888/
|INFO|Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
|CRITICAL|received signal 2, stopping
|INFO|Shutting down 0 kernels

Now you can create a new notebook.

Check that the pip program is installed on your machine. For that, you just type pip into a console. Then, type these commands into your console:

In [11]:

   !pip install scipy
   !pip install numpy
   !pip install matplotlib
   !pip install pandas

Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (1.1.0)
Requirement already satisfied: numpy>=1.8.2 in /usr/local/lib/python3.6/dist-packages (from scipy) (1.14.6)
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (1.14.6)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.6/dist-packages (3.0.3)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib) (1.0.1)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.6/dist-packages (from matplotlib) (0.10.0)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib) (2.5.3)
Requirement already satisfied: numpy>=1.10.0 in /usr/local/lib/python3.6/dist-packages (from matplotlib) (1.14.6)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib) (2.3.1)
Requirement already satisfied: setuptools in /usr/local/lib/python3.6/dist-packages (from kiwisolver>=1.0.1->matplotlib) (40.8.0)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from cycler>=0.10->matplotlib) (1.11.0)
Requirement already satisfied: pandas in /usr/local/lib/python3.6/dist-packages (0.22.0)
Requirement already satisfied: numpy>=1.9.0 in /usr/local/lib/python3.6/dist-packages (from pandas) (1.14.6)
Requirement already satisfied: pytz>=2011k in /usr/local/lib/python3.6/dist-packages (from pandas) (2018.9)
Requirement already satisfied: python-dateutil>=2 in /usr/local/lib/python3.6/dist-packages (from pandas) (2.5.3)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.6/dist-packages (from python-dateutil>=2->pandas) (1.11.0)

As an alternative, you can test Jupyter now, without installing anything via this link.

Let us do our first experiment in a Jupyter notebook:

To create a new notebook, click on New from the main Jupyter window and then on Python 3, like this:

create a new notebook

In the directory from which you started Jupyter, a file named Untitled.ipynb has normally been created. All Python commands must be typed in the field beside the In [ ] label. To do this, simply type several instructions simultaneously. Functions can even be defined. All variables produced in each cell will be accessible in all cells of the notebook. When you are finished entering them, press Shift+Enter to execute them.

To test it, type for example 2 + 5 in the empty cell located in the center of the window. Then click on this button: Test

2. Get started with Python¶

In this section, we remember the basics of Python programming. In addition, we will not make a list of everything we need to master, but we will simulate a whole problem. Well, we first heard about the Monty Hall problem. The game involves a presenter against a competitor (the player). This player is positioned in front of three closed doors. There is a car behind one of them and a goat behind each of the others. First, it must indicate a door. Then the presenter must open a door that is neither the one chosen by the candidate nor the one hiding the car. The competitor then has the right to open the door or he initially chose, or to open the third door.

First of all, let's set up our workspace in a notebook:

In [0]:

# For displaying graphics in the code sequence, 
# and not in a separate window:
%matplotlib inline

# By using the randint function, which generates numbers
# whole in a random way:
from random import randint, seed

# An Enum is a data structure that consists of a 
# set of appointed elements. This type of variable can be
# have as value one of these elements.
from enum import Enum

# For displaying graphs:
import matplotlib.pyplot as plt

Here we define a subclass of Enum that will contain the possible strategies. This code involves the use of object-oriented programming concepts, including classes and heritages.

In [0]:

class Strategy(Enum):
    CHANGE = 1
    KEEP = 2

It is possible to define our function. We will create a very simple function. It only represents a part of the game for a particular strategy.

In [0]:

# Uses the system clock to initialize the generator of 
# pseudo-random numbers.
seed()

def Hall_game(strategy):
    ''' Simulates part of the Monty Hall game. This function simulates the participant's choice of door, the elimination of a bad door by the presenter, and the final choice. It only returns the result of the game, because that we will only need the result to perform our calculations.
    
    Args:
        strategy (strategy): The player's strategy
        
    Returns:
        bool: Has the player won?
 '''
    doors =[0, 1, 2]
    
    good_door = randint(0,2)
    
    # Choice of player
    first_choice = randint(0,2)
    
    # We have two doors left
    doors.remove(first_choice)
    
    # The presenter eliminates a door
    if first_choice == good_door:
        doors.remove(doors[randint(0,1)])
    else:
        doors =[good_door]
    
    second_choice = 0
    # The second choice depends on the strategy
    if strategy == Strategy.CHANGE:
        second_choice = doors[0]
    elif strategy == Strategy.KEEP:
        second_choice = first_choice
    else:
        raise ValueError("Strategy not recognized!")
    
    return second_choice == good_door

Randint function returns a random integer between its two arguments. For example, randint(0,1) will return 0 or 1.

We will now test our function. Run the next line several times to make sure the result is random.

In [36]:

Hall_game(Strategy)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-36-be9b6a54cd73> in <module>()
----> 1 Hall_game(Strategy)
      2 

<ipython-input-30-7d16c132ca98> in Hall_game(strategy)
     33         second_choice = first_choice
     34     else:
---> 35         raise ValueError("Strategy not recognized!")
     36 
     37     return second_choice == good_door

ValueError: Strategy not recognized!

In this case, it will be necessary to define a function that will launch the game over and over again, and return the result of each game in a list.

In order to make calculations on these results, we will no longer keep them as Boolean variables (True or False) but according to the player's victory (1 if he has won, 0 if he has lost).

In [0]:

def Play(strategy, nb_turns):
    '''Simulates a sequence of game turns. This function returns the results of several games of the Monty Hall game in the form of a list of winnings by the player.
    
    Args:
        strategy (strategy): The player's strategy
        nb_turns (int): Number of revolutions
        
    Returns:
        list: List of players' winnings for each game
    '''

    return [1 if Hall_game(strategy) else 0 for i in range(nb_turns)]

For a given number of games (10000), it is first necessary to know which strategy is the most effective for the player. We have a list containing up to 1 as the number of games won by the player. It is just necessary to calculate the sum of all the items in this list, with the sum function, to know the number of 1.

In [43]:

print("By changing doors, the player has won {} on 10,000 games."
      .format(sum(Play(Strategy.CHANGE, 10000))))
      
print("By keeping his initial choice, the player has won {} out of 10,000 games."
      .format(sum(Play(Strategy.KEEP, 10000))))

By changing doors, the player has won 6661 on 10,000 games.
By keeping his initial choice, the player has won 3369 out of 10,000 games.

3. Manipulate data using Numpy and Matplotlib¶

3.1. Manipulate data using Numpy

This section will focus on how to efficiently load, store and manipulate data. They can be found in a wide variety of sources, but they can always be considered as arrays of numbers. We' re going to see a tool to manipulate these arrays: Numpy. NumPy (Numerical Python) provides an interface to store and process data. Numpy arrays are like Python lists, but Numpy allows to make things much more efficient, especially for larger arrays.

Let's start by importing Numpy:

In [0]:

import numpy as np

Creating Numpy arrays

Numpy arrays, unlike Python lists, can only contain one type of members. There are several ways to create arrays in Numpy:

In [45]:

# Array of integers:
np.array([1, 2, 3])

Out[45]:

array([1, 2, 3])

In case there are different types of data in the starting list, Numpy will try to convert them all to the most generic type. For example, integers int will be converted to float numbers:

In [46]:

np.array([3.1, 4, 5, 6])

Out[46]:

array([3.1, 4. , 5. , 6. ])

As an alternative, it is also possible to manually set a type:

In [47]:

np.array([1, 2, 3], dtype='float32')

Out[47]:

array([1., 2., 3.], dtype=float32)

In many instances, it is more effective, particularly for large arrays, by creating them directly. Numpy has several functions for this task:

In [49]:

# An array of length 10, filled with integers that are worth 0
np.zeros(10, dtype=int)

# A 3x5 size array filled with floating point numbers of value 1
np.ones((3, 5), dtype=float)

# A 3x5 table filled with 3.14
np.full((3, 5), 3.14)

# A table filled with a linear sequence
# starting from 0 and ending from 20, with a step of 2
np.arange(0, 20, 2)

# A table of 5 values, uniformly spaced between 0 and 1
np.linspace(0, 1, 5)

# This one you already know! Try also "randint" and "normal"
#np.random((3, 3))

# The identity matrix size 3x3 
np.eye(3)

Out[49]:

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

Array properties, Indexation and Slicing

Each Numpy array has some properties that are often useful.

In [50]:

np.random.seed(0)
x1 = np.random.randint(10, size=6) # Dimension table 1
print("dimensions number of x1:", x1.ndim)
print("shape of x1:", x1.shape)
print("size of x1:", x1.size)
print("type of x1:", x1.dtype)

dimensions number of x1: 1
shape of x1: (6,)
size of x1: 6
type of x1: int64

In order to do this, we will often need to access one or more elements of a array.

In [51]:

print(x1)

# To get access to the first element
print(x1[0])

# to get access to the last element
print(x1[-1])

x2 = np.random.randint(10, size=(3, 4)) # Dimension table of 2
print(x2[0,1])

# We can also set different values
x1[1] = "1000"
print(x1)

x1[1] = 3.14
print(x1)

[5 0 3 3 7 9]
5
9
5
[   5 1000    3    3    7    9]
[5 3 3 3 7 9]

In the same way that we can index elements using [], we can access to a set of elements by combining [] and :. There is a simple rule for syntax: x[start:end:step].

In [52]:

print(x1[:5]) # The first five elements

print(x1[5:]) # Elements starting at index 5

print(x1[::2]) # One of two elements

[5 3 3 3 7]
[9]
[5 3 7]

Two or more arrays can be concatenated:

In [53]:

x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
np.concatenate([x, y])

Out[53]:

array([1, 2, 3, 3, 2, 1])

So far, we have seen very basic things on Numpy arrays. Starting from here, we'll see what makes Numpy really essential. The benchmark implementation of Python, also called CPython, is very flexible, but this flexibility makes it unable to use all the possible optimizations.

In [54]:

def reverse_calculation(values):
    output = np.empty(len(values))
    for i in range(len(values)):
        output[i] = 1.0 / values[i]
    return output
        
values = np.random.randint(1, 10, size=5)
print(reverse_calculation(values))

wide_array = np.random.randint(1, 100, size=1000000)

[0.11111111 0.5        0.16666667 0.11111111 0.2       ]

We now provide an overview of such operations:

In [55]:

# First of all, there are simple mathematical operations
x = np.arange(4)
print("x =", x)
print("x + 5 =", x + 5)
print("x - 5 =", x - 5)
print("x * 2 =", x * 2)
print("x / 2 =", x / 2)
print("x // 2 =", x // 2) 

x = [0 1 2 3]
x + 5 = [5 6 7 8]
x - 5 = [-5 -4 -3 -2]
x * 2 = [0 2 4 6]
x / 2 = [0.  0.5 1.  1.5]
x // 2 = [0 0 1 1]

Alternatively, you can also call functions on Numpy arrays, and even on Python lists.

In [58]:

x = [-6, -3, 2, 5]
print("Absolute value:", np.abs(x))
print("Exponential:", np.exp(x))
print("Logarithm:", np.log(np.abs(x)))

Absolute value: [6 3 2 5]
Exponential: [2.47875218e-03 4.97870684e-02 7.38905610e+00 1.48413159e+02]
Logarithm: [1.79175947 1.09861229 0.69314718 1.60943791]

You can also execute Boolean operations on your arrays:

In [59]:

x = np.random.rand(4,4)
x > 0.7

Out[59]:

array([[ True,  True, False,  True],
       [ True, False, False, False],
       [False, False, False, False],
       [False,  True, False, False]])

Numpy has functions to sum up these data on his arrays:

In [60]:

L = np.random.random(10)
np.sum(L)

Out[60]:

4.698357490059848

3.2. Manipulate data using Matplotlib

Matplotlib was created to generate graphics directly from Python. In this section, we focus on using Matplotlib as a visualization tool in Jupyter notebooks.

To begin the process, let's set up the workspace:

In [0]:

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import numpy as np

Let's start by examining a simple case-like drawing the curve of a sin function:

In [62]:

fig = plt.figure()
ax = plt.axes()
x = np.linspace(0, 10, 1000)
ax.plot(x, np.sin(x));

he fig variable refers to a container that contains all items (axes, labels, data, etc.). The axes match the square shown above, which will then contain the data from the graph.

In the case of discrete datasets (points), we frequently use error bars to represent, for each point, the incertitude as to its exact value:

In [63]:

x = np.linspace(0, 30, 80)
dy = 0.3
y = np.sin(x) + dy * np.random.randn(80)

plt.errorbar(x, y, yerr=dy, fmt='.k');

4. Deal with a large amount of data using Pandas library¶

The Pandas library is one of the basic libraries for data science in Python. Pandas offers easy-to-use and robust data structures and the means to use them quickly. In this section, we will discuss what is of interest to the Pandas library, as well as the basic operations on the main object of this library, the Dataframe.

This pandas can be represented by a numpy array:

In [64]:

import numpy as np
panda_numpy = np.array([200,50,100,80])
panda_numpy

Out[64]:

array([200,  50, 100,  80])

It is possible to do this:

In [0]:

family = [
    np.array([100, 5, 20, 80]), # mom 
    np.array([50, 2.5, 10, 40]), # baby 
    np.array([110, 6, 22, 80]), # daddy 
]

Let's represent the family with Pandas:

In [67]:

import pandas as pd
family_df = pd.DataFrame(family)
family_df

Out[67]:

	0	1	2	3
0	100.0	5.0	20.0	80.0
1	50.0	2.5	10.0	40.0
2	110.0	6.0	22.0	80.0

The object that can be used to represent arrays is the DataFrame object

In fact, we can do even better, by specifying column names and row names:

In [70]:

family_df = pd.DataFrame(family,
                                index = ['mom','baby','dad'],
                                columns = ['legs','hair','hands','belly'])
family_df

Out[70]:

	legs	hair	hands	belly
mom	100.0	5.0	20.0	80.0
baby	50.0	2.5	10.0	40.0
dad	110.0	6.0	22.0	80.0

Here are some of the little features of Dataframes. Firstly, to access to belly column of our array. There are two possible syntaxes, which return exactly the same result:

In [71]:

family_df.belly
family_df["belly"]

Out[71]:

mom     80.0
baby    40.0
dad     80.0
Name: belly, dtype: float64

We will now see the whole family one by one, via the iterrows method by returning a tuple that has as its first element the index and the content of the line:

In [72]:

for ind, content in family_df.iterrows():
    print("Family:  %s :" % ind)
    print(content)

Family:  mom :
legs     100.0
hair       5.0
hands     20.0
belly     80.0
Name: mom, dtype: float64
Family:  baby :
legs     50.0
hair      2.5
hands    10.0
belly    40.0
Name: baby, dtype: float64
Family:  dad :
legs     110.0
hair       6.0
hands     22.0
belly     80.0
Name: dad, dtype: float64

In [74]:

# Create a CSV file called dataset

%%writefile dataset.csv
John,Doe,120 jefferson st.,Riverside, NJ, 08075
Jack,McGinnis,220 hobo Av.,Phila, PA,09119
"John ""Da Man""",Repici,120 Jefferson St.,Riverside, NJ,08075
Stephen,Tyler,"7452 Terrace ""At the Plaza"" road",SomeTown,SD, 91234
,Blankman,,SomeTown, SD, 00298
"Joan ""the bone"", Anne",Jet,"9th, at Terrace plc",Desert City,CO,00123

Writing dataset.csv

The Pandas library is the one that is pointed in the manipulation of arrays. It is therefore possible to read a CSV file with Pandas: it only takes one line to create a dataframe from a CSV:

In [0]:

data = pd.read_csv("dataset.csv", sep=";")

The data variable now contains a dataframe containing the data from the csv file;
The values in our CSV file are separated by the symbol ; ;
As a default, pd.read_csv expects values that are separated by a comma