Introduction to Python for Data Sciences

2- Dataframes¶

Operations¶

In [1]:

import numpy as np
import pandas as pd

Numpy operations¶

If we apply a NumPy function on a Pandas datframe, the result will be another Pandas dataframe with the indices preserved.

In [2]:

df = pd.DataFrame(np.random.randint(0, 10, (3, 4)), columns=['A', 'B', 'C', 'D'])
df

Out[2]:

	A	B	C	D
0	9	8	1	5
1	2	9	1	2
2	6	0	2	5

In [3]:

np.cos(df * np.pi/2 ) - 1

Out[3]:

	A	B	C	D
0	-1.0	0.0	-1.0	-1.0
1	-2.0	-1.0	-1.0	-2.0
2	-2.0	0.0	-2.0	-1.0

Arithmetic operations¶

Arithmetic operations can also be performed either with + - / * or with dedicated add multiply etc methods

In [4]:

A = pd.DataFrame(np.random.randint(0, 20, (2, 2)), columns=list('AB'))
A

Out[4]:

	A	B
0	4	10
1	2	2

In [5]:

B = pd.DataFrame(np.random.randint(0, 10, (3, 3)), columns=list('BAC'))
B

Out[5]:

	B	A	C
0	2	5	6
1	3	3	6
2	7	4	0

In [6]:

A+B

Out[6]:

	A	B	C
0	9.0	12.0	NaN
1	5.0	5.0	NaN
2	NaN	NaN	NaN

The pandas arithmetic functions also have an option to fill missing values by replacing the missing one in either of the dataframes by some value.

In [7]:

A.add(B, fill_value=0.0)

Out[7]:

	A	B	C
0	9.0	12.0	6.0
1	5.0	5.0	6.0
2	4.0	7.0	0.0

Appending, Concatenating, and Merging¶

Thanks to naming, dataframes can be easily added, merged, etc. However, if some entries are missing (columns or indices), the operations may get complicated. Here the most standard situations are covered, take a look at the documentation (notably this one on merging, appending, and concatenating )

Appending is for adding the lines of one dataframe with another one with the same columns.

In [8]:

A = pd.DataFrame(np.random.randint(0, 20, (2, 2)), columns=list('AB'))
A2 = pd.DataFrame(np.random.randint(0, 20, (3, 2)), columns=list('AB'))
print("A:\n",A,"\nA2:\n",A2)

In [9]:

A.append(A2) # this does not "append to A" but creates a new dataframe

Out[9]:

	A	B
0	11	15
1	2	19
0	17	8
1	18	8
2	3	12

Sometimes, indexes do not matter, they can be resetted using ignore_index=True.

In [10]:

A.append(A2,ignore_index=True)

Out[10]:

	A	B
0	11	15
1	2	19
2	17	8
3	18	8
4	3	12

Concatenating is for adding lines and/or columns of multiples datasets (it is a generalization of appending)

In [11]:

A = pd.DataFrame(np.random.randint(0, 20, (2, 2)), columns=list('AB'))
A2 = pd.DataFrame(np.random.randint(0, 20, (3, 2)), columns=list('AB'))
A3 = pd.DataFrame(np.random.randint(0, 20, (1, 3)), columns=list('CAD'))
print("A:\n",A,"\nA2:\n",A2,"\nA3:\n",A3)

A:
     A   B
0  18  16
1   7  13 
A2:
    A   B
0  7  16
1  4   5
2  3   7 
A3:
     C   A  D
0  13  17  1

The most important settings of the concat function are pd.concat(objs, axis=0, join='outer',ignore_index=False) where
. objs is the list of dataframes to concatenate
. axis is the axis on which to concatenate 0 (default) for the lines and 1 for the columns
. join is to decide if we keep all columns/indices on the other axis ('outer' ,default), or the intersection ( 'inner')
. ignore_index is to decide is we keep the previous names (False, default) or give new ones (True)

For a detailed view see this doc on merging, appending, and concatenating

In [12]:

pd.concat([A,A2,A3],ignore_index=True)

Out[12]:

	A	B	C	D
0	18	16.0	NaN	NaN
1	7	13.0	NaN	NaN
2	7	16.0	NaN	NaN
3	4	5.0	NaN	NaN
4	3	7.0	NaN	NaN
5	17	NaN	13.0	1.0

In [13]:

pd.concat([A,A2,A3],axis=1)

Out[13]:

	A	B	A	B	C	A	D
0	18.0	16.0	7	16	13.0	17.0	1.0
1	7.0	13.0	4	5	NaN	NaN	NaN
2	NaN	NaN	3	7	NaN	NaN	NaN

In [14]:

pd.concat([A,A2,A3],axis=1,ignore_index=True,join='inner')

Out[14]:

	0	1	2	3	4	5	6
0	18	16	7	16	13	17	1

Merging is for putting together two dataframes with hopefully common data

For a detailed view see this doc on merging, appending, and concatenating

In [15]:

df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
df1

Out[15]:

	employee	group
0	Bob	Accounting
1	Jake	Engineering
2	Lisa	Engineering
3	Sue	HR

In [16]:

df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
                    'hire_date': [2004, 2008, 2012, 2014]})
df2

Out[16]:

	employee	hire_date
0	Lisa	2004
1	Bob	2008
2	Jake	2012
3	Sue	2014

In [17]:

df3 = pd.merge(df1,df2)
df3

Out[17]:

	employee	group	hire_date
0	Bob	Accounting	2008
1	Jake	Engineering	2012
2	Lisa	Engineering	2004
3	Sue	HR	2014

In [18]:

df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],
                    'supervisor': ['Carly', 'Guido', 'Steve']})
df4

Out[18]:

	group	supervisor
0	Accounting	Carly
1	Engineering	Guido
2	HR	Steve

In [19]:

pd.merge(df3,df4)

Out[19]:

	employee	group	hire_date	supervisor
0	Bob	Accounting	2008	Carly
1	Jake	Engineering	2012	Guido
2	Lisa	Engineering	2004	Guido
3	Sue	HR	2014	Steve

Preparing the Data¶

Before exploring the data, it is primordial to verify its soundness, indeed if it has missing or replicated data, the results of our test may not be accurate. Pandas provides a collection of methodes to verify the sanity of the data (recall that when data is missing for an entry, it is noted as NaN, and thus any further operation including this will be NaN).

To explore some typical problems in a dataset, I messed with a small part of the MovieLens dataset. The ratings_mess.csv file contains 4 columns:

userId id of the user, integer greater than 1
movieId id of the user, integer greater than 1
rating rating of the user to the movie, float between 0.0 and 5.0
timestamp timestamp, integer

and features (man-made!) errors, some of them minor some of them major.

In [20]:

ratings = pd.read_csv('data/ml-small/ratings_mess.csv')
ratings.head(7) # enables to display the top n lines of a dataframe, 5 by default

Out[20]:

	userId	movieId	rating	timestamp
0	1.0	31.0	2.5	1.260759e+09
1	1.0	1029.0	3.0	1.260759e+09
2	2.0	10.0	4.0	NaN
3	2.0	52.0	83.0	8.353560e+08
4	NaN	62.0	3.0	8.353557e+08
5	2.0	144.0	NaN	8.353560e+08
6	0.0	616.0	3.0	8.353559e+08

Missing values¶

Pandas provides functions that check if the values are missing:

isnull(): Generate a boolean mask indicating missing values
notnull(): Opposite of isnull()

In [21]:

ratings.isnull().head(5)

Out[21]:

	userId	movieId	rating	timestamp
0	False	False	False	False
1	False	False	False	False
2	False	False	False	True
3	False	False	False	False
4	True	False	False	False

Carefully pruning data¶

Now that we have to prune lines of our data, this will be done using dropna() through dataframe.dropna(subset=["col_1","col_2"],inplace=True) which drops all rows with at least one missing value in the columns col1, col2 of dataframe in place that is without copy.

Warning: this function deletes any line with at least one missing data, which is not always wishable. Also, with inplace=True, it is applied in place, meaning that they modify the dataframe it is applied to, it is thus an irreversible operation; drop inplace=True to create a copy or see the result before apllying it.

For instance here, userId,movieId,rating are essential whereas the timestamp is not (it can be dropped for the prediciton process). Thus, we will delete the lines where one of userId,movieId,rating is missing and fill the timestamp with 0 when it is missing.

In [22]:

ratings.dropna(subset=["userId","movieId","rating"],inplace=True)
ratings.head(5)

Out[22]:

	userId	movieId	rating	timestamp
0	1.0	31.0	2.5	1.260759e+09
1	1.0	1029.0	3.0	1.260759e+09
2	2.0	10.0	4.0	NaN
3	2.0	52.0	83.0	8.353560e+08
6	0.0	616.0	3.0	8.353559e+08

To fill missing data (from a certain column), the recommended way is to use fillna() through dataframe["col"].fillna(value,inplace=True) which replace all missing values in the column col of dataframe by value in place that is without copy (again this is irreversible, to use the copy version use inplace=False).

In [23]:

ratings["timestamp"].fillna(0,inplace=True)
ratings.head(7)

Out[23]:

	userId	movieId	rating	timestamp
0	1.0	31.0	2.5	1.260759e+09
1	1.0	1029.0	3.0	1.260759e+09
2	2.0	10.0	4.0	0.000000e+00
3	2.0	52.0	83.0	8.353560e+08
6	0.0	616.0	3.0	8.353559e+08
8	2.0	720.0	4.0	8.353560e+08
9	3.0	60.0	3.0	1.298862e+09

This indeed gives the correct result, however, the line indexing presents missing number. The indexes can be resetted with reset_index(inplace=True,drop=True)

In [24]:

ratings.reset_index(inplace=True,drop=True)
ratings.head(7)

Out[24]:

	userId	movieId	rating	timestamp
0	1.0	31.0	2.5	1.260759e+09
1	1.0	1029.0	3.0	1.260759e+09
2	2.0	10.0	4.0	0.000000e+00
3	2.0	52.0	83.0	8.353560e+08
4	0.0	616.0	3.0	8.353559e+08
5	2.0	720.0	4.0	8.353560e+08
6	3.0	60.0	3.0	1.298862e+09

Improper values¶

Even without the missing values, some lines are problematic as they feature values outside of prescribed range (userId id of the user, integer greater than 1; movieId id of the user, integer greater than 1; rating rating of the user to the movie, float between 0.0 and 5.0; imestamp timestamp, integer )

In [25]:

ratings[ratings["userId"]<1]  # Identifying a problem

Out[25]:

	userId	movieId	rating	timestamp
4	0.0	616.0	3.0	835355932.0

Now, we drop the corresponding line, with drop by drop(problematic_row.index, inplace=True).

Warning: Do not forget .index and inplace=True

In [26]:

ratings.drop(ratings[ratings["userId"]<1].index, inplace=True)
ratings.head(7)

Out[26]:

	userId	movieId	rating	timestamp
0	1.0	31.0	2.5	1.260759e+09
1	1.0	1029.0	3.0	1.260759e+09
2	2.0	10.0	4.0	0.000000e+00
3	2.0	52.0	83.0	8.353560e+08
5	2.0	720.0	4.0	8.353560e+08
6	3.0	60.0	3.0	1.298862e+09
7	3.0	110.0	4.0	1.298922e+09

In [27]:

pb_rows = ratings[ratings["movieId"]<1]
pb_rows

Out[27]:

	userId	movieId	rating	timestamp
12	3.0	0.0	3.0	1.298924e+09
24	4.0	-1.0	2.0	9.499822e+08

In [28]:

ratings.drop(pb_rows.index, inplace=True)

And finally the ratings.

In [29]:

pb_rows = ratings[ratings["rating"]<0]
pb_rows2 = ratings[ratings["rating"]>5]
tot_pb_rows = pb_rows.append(pb_rows2 )
tot_pb_rows

Out[29]:

	userId	movieId	rating	timestamp
15	3.0	2858.0	-4.0	1.298922e+09
3	2.0	52.0	83.0	8.353560e+08
25	5.0	3176.0	123.5	1.163374e+09
29	6.0	111.0	6.0	1.109258e+09

In [30]:

ratings.drop(tot_pb_rows.index, inplace=True)
ratings.reset_index(inplace=True,drop=True)

We finally have our dataset cured! Let us save it for further use.

to_csv saves as CSV into some file, index=False drops the index names as we did not specify it.

In [31]:

ratings.to_csv("data/ml-small/ratings_cured.csv",index=False)

Basic Statistics¶

With our cured dataset, we can begin exploring.

In [32]:

ratings = pd.read_csv('data/ml-small/ratings_cured.csv')
ratings.head()

Out[32]:

	userId	movieId	rating	timestamp
0	1.0	31.0	2.5	1.260759e+09
1	1.0	1029.0	3.0	1.260759e+09
2	2.0	10.0	4.0	0.000000e+00
3	2.0	720.0	4.0	8.353560e+08
4	3.0	60.0	3.0	1.298862e+09

The following table summarizes some other built-in Pandas aggregations:

Aggregation	Description
`count()`	Total number of items
`first()`, `last()`	First and last item
`mean()`, `median()`	Mean and median
`min()`, `max()`	Minimum and maximum
`std()`, `var()`	Standard deviation and variance
`mad()`	Mean absolute deviation
`prod()`	Product of all items
`sum()`	Sum of all items

These are all methods of DataFrame and Series objects, and description also provides a quick overview.

In [33]:

ratings.describe()

Out[33]:

	userId	movieId	rating	timestamp
count	24.000000	24.000000	24.000000	2.400000e+01
mean	3.416667	2465.916667	4.020833	1.056019e+09
std	1.212854	6171.507686	0.926414	3.610777e+08
min	1.000000	10.000000	2.000000	0.000000e+00
25%	3.000000	213.250000	3.375000	9.498962e+08
50%	3.000000	1028.500000	4.000000	1.163375e+09
75%	4.000000	1628.000000	5.000000	1.298862e+09
max	6.000000	30749.000000	5.000000	1.298923e+09

We see that these statistics do not make sense for all rows. Let us drop the timestamp and examine the ratings.

In [34]:

ratings.drop("timestamp",axis=1,inplace=True)
ratings.head()

Out[34]:

	userId	movieId	rating
0	1.0	31.0	2.5
1	1.0	1029.0	3.0
2	2.0	10.0	4.0
3	2.0	720.0	4.0
4	3.0	60.0	3.0

In [35]:

ratings["rating"].describe()

Out[35]:

count    24.000000
mean      4.020833
std       0.926414
min       2.000000
25%       3.375000
50%       4.000000
75%       5.000000
max       5.000000
Name: rating, dtype: float64

GroupBy¶

These ratings are linked to users and movies, in order to have a separate view per user/movie, grouping has to be used.

The GroupBy operation (that comes from SQL) accomplishes:

The split step involves breaking up and grouping a DataFrame depending on the value of the specified key.
The apply step involves computing some function, usually an sum, median, means etc within the individual groups.
The combine step merges the results of these operations into an output array.

Source: [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas

In [36]:

ratings.head()

Out[36]:

	userId	movieId	rating
0	1.0	31.0	2.5
1	1.0	1029.0	3.0
2	2.0	10.0	4.0
3	2.0	720.0	4.0
4	3.0	60.0	3.0

So to get the mean of the ratings per user, the command is

In [37]:

ratings.groupby("userId")["rating"].mean()

Out[37]:

userId
1.0    2.750000
2.0    4.000000
3.0    3.444444
4.0    5.000000
5.0    4.166667
6.0    4.500000
Name: rating, dtype: float64

Filtering¶

Filtering is the action of deleting rows depending on a boolean function. For instance, the following removes the user with a rating of only one movie.

In [38]:

ratings.groupby("userId")["rating"].count()

Out[38]:

userId
1.0    2
2.0    2
3.0    9
4.0    7
5.0    3
6.0    1
Name: rating, dtype: int64

In [39]:

def filter_func(x):
    return x["rating"].count() >= 2

filtered = ratings.groupby("userId").filter(filter_func)
filtered

Out[39]:

	userId	movieId	rating
0	1.0	31.0	2.5
1	1.0	1029.0	3.0
2	2.0	10.0	4.0
3	2.0	720.0	4.0
4	3.0	60.0	3.0
5	3.0	110.0	4.0
6	3.0	247.0	3.5
7	3.0	592.0	3.0
8	3.0	593.0	3.0
9	3.0	595.0	2.0
10	3.0	2762.0	3.5
11	3.0	2841.0	4.0
12	3.0	2959.0	5.0
13	4.0	34.0	5.0
14	4.0	112.0	5.0
15	4.0	1028.0	5.0
16	4.0	1030.0	5.0
17	4.0	1079.0	5.0
18	4.0	1089.0	5.0
19	4.0	1097.0	5.0
20	5.0	4447.0	4.5
21	5.0	4718.0	3.5
22	5.0	30749.0	4.5

In [40]:

filtered.groupby("userId")["rating"].count()

Out[40]:

userId
1.0    2
2.0    2
3.0    9
4.0    7
5.0    3
Name: rating, dtype: int64

Transformations¶

Transforming is the actions of applying a transformation (sic).

For instance, let us normalize the ratings so that they have zero mean for each user.

In [41]:

ratings.groupby("userId")["rating"].mean()

Out[41]:

userId
1.0    2.750000
2.0    4.000000
3.0    3.444444
4.0    5.000000
5.0    4.166667
6.0    4.500000
Name: rating, dtype: float64

In [42]:

def center_ratings(x):
    x["rating"] = x["rating"] - x["rating"].mean()
    return x

centered = ratings.groupby("userId").apply(center_ratings)

In [43]:

centered.groupby("userId")["rating"].mean()

Out[43]:

userId
1.0    0.000000e+00
2.0    0.000000e+00
3.0   -1.973730e-16
4.0    0.000000e+00
5.0   -2.960595e-16
6.0    0.000000e+00
Name: rating, dtype: float64

Aggregations [*]¶

Aggregations let you aggreagate several operations.

In [44]:

ratings.groupby("userId")["rating"].aggregate([min,max,np.mean,np.median,len])

Out[44]:

	min	max	mean	median	len
userId
1.0	2.5	3.0	2.750000	2.75	2.0
2.0	4.0	4.0	4.000000	4.00	2.0
3.0	2.0	5.0	3.444444	3.50	9.0
4.0	5.0	5.0	5.000000	5.00	7.0
5.0	3.5	4.5	4.166667	4.50	3.0
6.0	4.5	4.5	4.500000	4.50	1.0

Exercises¶

Exercise: Bots Discovery

In the dataset ratings_bots.csv, some users may be bots. To help a movie sucess they add ratings (favorable ones often). To get a better recommendation, we try to remove them.

Count the users with a mean rating above 4.7/5 and delete them

hint: the nunique function may be helpful to count

Delete multiples reviews of a movie by a single user by replacing them with only the first one. What is the proportion of potential bots among the users?

hint: the groupby function can be applied to several columns, also reset_index(drop=True) removes the grouby indexing.

hint: remember the loc function, e.g. df.loc[df['userId'] == 128] returns a dataframe of the rows where the userId is 128; and df.loc[df['userId'] == 128].loc[samerev['movieId'] == 3825] returns a dataframe of the rows where the userId is 128 and the movieID is 3825.

In total , 17 ratings have to be removed. For instance, user 128 has 3 ratings of the movie 3825

This dataset has around 100 000 ratings so hand picking won't do!

In [45]:

import pandas as pd
import numpy as np

ratings_bots = pd.read_csv('data/ml-small/ratings_bots.csv')

In [ ]:

Exercise: Planets discovery

We will use the Planets dataset, available via the Seaborn package. It provides information on how astronomers found new planets around stars, exoplanets.

Display median, mean and quantile informations for these planets orbital periods, masses, and distances.

For each method, display statistic on the years planets were discovered using this technique.

In [46]:

import pandas as pd
import numpy as np

planets = pd.read_csv('data/planets.csv')
print(planets.shape)
planets.head()

(1035, 6)

Out[46]:

	method	number	orbital_period	mass	distance	year
0	Radial Velocity	1	269.300	7.10	77.40	2006
1	Radial Velocity	1	874.774	2.21	56.95	2008
2	Radial Velocity	1	763.000	2.60	19.84	2011
3	Radial Velocity	1	326.030	19.40	110.62	2007
4	Radial Velocity	1	516.220	10.50	119.47	2009

In [ ]: