Introduction to Python for Data Sciences |
Franck Iutzeler |
import numpy as np
import pandas as pd
If we apply a NumPy function on a Pandas datframe, the result will be another Pandas dataframe with the indices preserved.
df = pd.DataFrame(np.random.randint(0, 10, (3, 4)), columns=['A', 'B', 'C', 'D'])
df
A | B | C | D | |
---|---|---|---|---|
0 | 9 | 8 | 1 | 5 |
1 | 2 | 9 | 1 | 2 |
2 | 6 | 0 | 2 | 5 |
np.cos(df * np.pi/2 ) - 1
A | B | C | D | |
---|---|---|---|---|
0 | -1.0 | 0.0 | -1.0 | -1.0 |
1 | -2.0 | -1.0 | -1.0 | -2.0 |
2 | -2.0 | 0.0 | -2.0 | -1.0 |
Arithmetic operations can also be performed either with + - / * or with dedicated add multiply etc methods
A = pd.DataFrame(np.random.randint(0, 20, (2, 2)), columns=list('AB'))
A
A | B | |
---|---|---|
0 | 4 | 10 |
1 | 2 | 2 |
B = pd.DataFrame(np.random.randint(0, 10, (3, 3)), columns=list('BAC'))
B
B | A | C | |
---|---|---|---|
0 | 2 | 5 | 6 |
1 | 3 | 3 | 6 |
2 | 7 | 4 | 0 |
A+B
A | B | C | |
---|---|---|---|
0 | 9.0 | 12.0 | NaN |
1 | 5.0 | 5.0 | NaN |
2 | NaN | NaN | NaN |
The pandas arithmetic functions also have an option to fill missing values by replacing the missing one in either of the dataframes by some value.
A.add(B, fill_value=0.0)
A | B | C | |
---|---|---|---|
0 | 9.0 | 12.0 | 6.0 |
1 | 5.0 | 5.0 | 6.0 |
2 | 4.0 | 7.0 | 0.0 |
Thanks to naming, dataframes can be easily added, merged, etc. However, if some entries are missing (columns or indices), the operations may get complicated. Here the most standard situations are covered, take a look at the documentation (notably this one on merging, appending, and concatenating )
A = pd.DataFrame(np.random.randint(0, 20, (2, 2)), columns=list('AB'))
A2 = pd.DataFrame(np.random.randint(0, 20, (3, 2)), columns=list('AB'))
print("A:\n",A,"\nA2:\n",A2)
A: A B 0 11 15 1 2 19 A2: A B 0 17 8 1 18 8 2 3 12
A.append(A2) # this does not "append to A" but creates a new dataframe
A | B | |
---|---|---|
0 | 11 | 15 |
1 | 2 | 19 |
0 | 17 | 8 |
1 | 18 | 8 |
2 | 3 | 12 |
Sometimes, indexes do not matter, they can be resetted using ignore_index=True.
A.append(A2,ignore_index=True)
A | B | |
---|---|---|
0 | 11 | 15 |
1 | 2 | 19 |
2 | 17 | 8 |
3 | 18 | 8 |
4 | 3 | 12 |
A = pd.DataFrame(np.random.randint(0, 20, (2, 2)), columns=list('AB'))
A2 = pd.DataFrame(np.random.randint(0, 20, (3, 2)), columns=list('AB'))
A3 = pd.DataFrame(np.random.randint(0, 20, (1, 3)), columns=list('CAD'))
print("A:\n",A,"\nA2:\n",A2,"\nA3:\n",A3)
A: A B 0 18 16 1 7 13 A2: A B 0 7 16 1 4 5 2 3 7 A3: C A D 0 13 17 1
The most important settings of the concat function are pd.concat(objs, axis=0, join='outer',ignore_index=False) where
. objs is the list of dataframes to concatenate
. axis is the axis on which to concatenate 0 (default) for the lines and 1 for the columns
. join is to decide if we keep all columns/indices on the other axis ('outer' ,default), or the intersection ( 'inner')
. ignore_index is to decide is we keep the previous names (False, default) or give new ones (True)
For a detailed view see this doc on merging, appending, and concatenating
pd.concat([A,A2,A3],ignore_index=True)
A | B | C | D | |
---|---|---|---|---|
0 | 18 | 16.0 | NaN | NaN |
1 | 7 | 13.0 | NaN | NaN |
2 | 7 | 16.0 | NaN | NaN |
3 | 4 | 5.0 | NaN | NaN |
4 | 3 | 7.0 | NaN | NaN |
5 | 17 | NaN | 13.0 | 1.0 |
pd.concat([A,A2,A3],axis=1)
A | B | A | B | C | A | D | |
---|---|---|---|---|---|---|---|
0 | 18.0 | 16.0 | 7 | 16 | 13.0 | 17.0 | 1.0 |
1 | 7.0 | 13.0 | 4 | 5 | NaN | NaN | NaN |
2 | NaN | NaN | 3 | 7 | NaN | NaN | NaN |
pd.concat([A,A2,A3],axis=1,ignore_index=True,join='inner')
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
0 | 18 | 16 | 7 | 16 | 13 | 17 | 1 |
For a detailed view see this doc on merging, appending, and concatenating
df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
df1
employee | group | |
---|---|---|
0 | Bob | Accounting |
1 | Jake | Engineering |
2 | Lisa | Engineering |
3 | Sue | HR |
df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
'hire_date': [2004, 2008, 2012, 2014]})
df2
employee | hire_date | |
---|---|---|
0 | Lisa | 2004 |
1 | Bob | 2008 |
2 | Jake | 2012 |
3 | Sue | 2014 |
df3 = pd.merge(df1,df2)
df3
employee | group | hire_date | |
---|---|---|---|
0 | Bob | Accounting | 2008 |
1 | Jake | Engineering | 2012 |
2 | Lisa | Engineering | 2004 |
3 | Sue | HR | 2014 |
df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],
'supervisor': ['Carly', 'Guido', 'Steve']})
df4
group | supervisor | |
---|---|---|
0 | Accounting | Carly |
1 | Engineering | Guido |
2 | HR | Steve |
pd.merge(df3,df4)
employee | group | hire_date | supervisor | |
---|---|---|---|---|
0 | Bob | Accounting | 2008 | Carly |
1 | Jake | Engineering | 2012 | Guido |
2 | Lisa | Engineering | 2004 | Guido |
3 | Sue | HR | 2014 | Steve |
Before exploring the data, it is primordial to verify its soundness, indeed if it has missing or replicated data, the results of our test may not be accurate. Pandas provides a collection of methodes to verify the sanity of the data (recall that when data is missing for an entry, it is noted as NaN
, and thus any further operation including this will be NaN
).
To explore some typical problems in a dataset, I messed with a small part of the MovieLens dataset. The ratings_mess.csv
file contains 4 columns:
userId
id of the user, integer greater than 1movieId
id of the user, integer greater than 1rating
rating of the user to the movie, float between 0.0 and 5.0timestamp
timestamp, integerand features (man-made!) errors, some of them minor some of them major.
ratings = pd.read_csv('data/ml-small/ratings_mess.csv')
ratings.head(7) # enables to display the top n lines of a dataframe, 5 by default
userId | movieId | rating | timestamp | |
---|---|---|---|---|
0 | 1.0 | 31.0 | 2.5 | 1.260759e+09 |
1 | 1.0 | 1029.0 | 3.0 | 1.260759e+09 |
2 | 2.0 | 10.0 | 4.0 | NaN |
3 | 2.0 | 52.0 | 83.0 | 8.353560e+08 |
4 | NaN | 62.0 | 3.0 | 8.353557e+08 |
5 | 2.0 | 144.0 | NaN | 8.353560e+08 |
6 | 0.0 | 616.0 | 3.0 | 8.353559e+08 |
Pandas provides functions that check if the values are missing:
isnull()
: Generate a boolean mask indicating missing valuesnotnull()
: Opposite of isnull()
ratings.isnull().head(5)
userId | movieId | rating | timestamp | |
---|---|---|---|---|
0 | False | False | False | False |
1 | False | False | False | False |
2 | False | False | False | True |
3 | False | False | False | False |
4 | True | False | False | False |
Now that we have to prune lines of our data, this will be done using dropna()
through dataframe.dropna(subset=["col_1","col_2"],inplace=True)
which drops all rows with at least one missing value in the columns col1, col2
of dataframe
in place that is without copy.
Warning: this function deletes any line with at least one missing data, which is not always wishable. Also, with inplace=True, it is applied in place, meaning that they modify the dataframe it is applied to, it is thus an irreversible operation; drop inplace=True
to create a copy or see the result before apllying it.
For instance here, userId,movieId,rating
are essential whereas the timestamp
is not (it can be dropped for the prediciton process). Thus, we will delete the lines where one of userId,movieId,rating
is missing and fill the timestamp
with 0 when it is missing.
ratings.dropna(subset=["userId","movieId","rating"],inplace=True)
ratings.head(5)
userId | movieId | rating | timestamp | |
---|---|---|---|---|
0 | 1.0 | 31.0 | 2.5 | 1.260759e+09 |
1 | 1.0 | 1029.0 | 3.0 | 1.260759e+09 |
2 | 2.0 | 10.0 | 4.0 | NaN |
3 | 2.0 | 52.0 | 83.0 | 8.353560e+08 |
6 | 0.0 | 616.0 | 3.0 | 8.353559e+08 |
To fill missing data (from a certain column), the recommended way is to use fillna()
through dataframe["col"].fillna(value,inplace=True)
which replace all missing values in the column col
of dataframe
by value
in place that is without copy (again this is irreversible, to use the copy version use inplace=False
).
ratings["timestamp"].fillna(0,inplace=True)
ratings.head(7)
userId | movieId | rating | timestamp | |
---|---|---|---|---|
0 | 1.0 | 31.0 | 2.5 | 1.260759e+09 |
1 | 1.0 | 1029.0 | 3.0 | 1.260759e+09 |
2 | 2.0 | 10.0 | 4.0 | 0.000000e+00 |
3 | 2.0 | 52.0 | 83.0 | 8.353560e+08 |
6 | 0.0 | 616.0 | 3.0 | 8.353559e+08 |
8 | 2.0 | 720.0 | 4.0 | 8.353560e+08 |
9 | 3.0 | 60.0 | 3.0 | 1.298862e+09 |
This indeed gives the correct result, however, the line indexing presents missing number. The indexes can be resetted with reset_index(inplace=True,drop=True)
ratings.reset_index(inplace=True,drop=True)
ratings.head(7)
userId | movieId | rating | timestamp | |
---|---|---|---|---|
0 | 1.0 | 31.0 | 2.5 | 1.260759e+09 |
1 | 1.0 | 1029.0 | 3.0 | 1.260759e+09 |
2 | 2.0 | 10.0 | 4.0 | 0.000000e+00 |
3 | 2.0 | 52.0 | 83.0 | 8.353560e+08 |
4 | 0.0 | 616.0 | 3.0 | 8.353559e+08 |
5 | 2.0 | 720.0 | 4.0 | 8.353560e+08 |
6 | 3.0 | 60.0 | 3.0 | 1.298862e+09 |
Even without the missing values, some lines are problematic as they feature values outside of prescribed range (userId
id of the user, integer greater than 1; movieId
id of the user, integer greater than 1; rating
rating of the user to the movie, float between 0.0 and 5.0; imestamp
timestamp, integer )
ratings[ratings["userId"]<1] # Identifying a problem
userId | movieId | rating | timestamp | |
---|---|---|---|---|
4 | 0.0 | 616.0 | 3.0 | 835355932.0 |
Now, we drop the corresponding line, with drop
by drop(problematic_row.index, inplace=True)
.
Warning: Do not forget .index
and inplace=True
ratings.drop(ratings[ratings["userId"]<1].index, inplace=True)
ratings.head(7)
userId | movieId | rating | timestamp | |
---|---|---|---|---|
0 | 1.0 | 31.0 | 2.5 | 1.260759e+09 |
1 | 1.0 | 1029.0 | 3.0 | 1.260759e+09 |
2 | 2.0 | 10.0 | 4.0 | 0.000000e+00 |
3 | 2.0 | 52.0 | 83.0 | 8.353560e+08 |
5 | 2.0 | 720.0 | 4.0 | 8.353560e+08 |
6 | 3.0 | 60.0 | 3.0 | 1.298862e+09 |
7 | 3.0 | 110.0 | 4.0 | 1.298922e+09 |
pb_rows = ratings[ratings["movieId"]<1]
pb_rows
userId | movieId | rating | timestamp | |
---|---|---|---|---|
12 | 3.0 | 0.0 | 3.0 | 1.298924e+09 |
24 | 4.0 | -1.0 | 2.0 | 9.499822e+08 |
ratings.drop(pb_rows.index, inplace=True)
And finally the ratings.
pb_rows = ratings[ratings["rating"]<0]
pb_rows2 = ratings[ratings["rating"]>5]
tot_pb_rows = pb_rows.append(pb_rows2 )
tot_pb_rows
userId | movieId | rating | timestamp | |
---|---|---|---|---|
15 | 3.0 | 2858.0 | -4.0 | 1.298922e+09 |
3 | 2.0 | 52.0 | 83.0 | 8.353560e+08 |
25 | 5.0 | 3176.0 | 123.5 | 1.163374e+09 |
29 | 6.0 | 111.0 | 6.0 | 1.109258e+09 |
ratings.drop(tot_pb_rows.index, inplace=True)
ratings.reset_index(inplace=True,drop=True)
We finally have our dataset cured! Let us save it for further use.
to_csv
saves as CSV into some file, index=False
drops the index names as we did not specify it.
ratings.to_csv("data/ml-small/ratings_cured.csv",index=False)
With our cured dataset, we can begin exploring.
ratings = pd.read_csv('data/ml-small/ratings_cured.csv')
ratings.head()
userId | movieId | rating | timestamp | |
---|---|---|---|---|
0 | 1.0 | 31.0 | 2.5 | 1.260759e+09 |
1 | 1.0 | 1029.0 | 3.0 | 1.260759e+09 |
2 | 2.0 | 10.0 | 4.0 | 0.000000e+00 |
3 | 2.0 | 720.0 | 4.0 | 8.353560e+08 |
4 | 3.0 | 60.0 | 3.0 | 1.298862e+09 |
The following table summarizes some other built-in Pandas aggregations:
Aggregation | Description |
---|---|
count() |
Total number of items |
first() , last() |
First and last item |
mean() , median() |
Mean and median |
min() , max() |
Minimum and maximum |
std() , var() |
Standard deviation and variance |
mad() |
Mean absolute deviation |
prod() |
Product of all items |
sum() |
Sum of all items |
These are all methods of DataFrame
and Series
objects, and description
also provides a quick overview.
ratings.describe()
userId | movieId | rating | timestamp | |
---|---|---|---|---|
count | 24.000000 | 24.000000 | 24.000000 | 2.400000e+01 |
mean | 3.416667 | 2465.916667 | 4.020833 | 1.056019e+09 |
std | 1.212854 | 6171.507686 | 0.926414 | 3.610777e+08 |
min | 1.000000 | 10.000000 | 2.000000 | 0.000000e+00 |
25% | 3.000000 | 213.250000 | 3.375000 | 9.498962e+08 |
50% | 3.000000 | 1028.500000 | 4.000000 | 1.163375e+09 |
75% | 4.000000 | 1628.000000 | 5.000000 | 1.298862e+09 |
max | 6.000000 | 30749.000000 | 5.000000 | 1.298923e+09 |
We see that these statistics do not make sense for all rows. Let us drop the timestamp and examine the ratings.
ratings.drop("timestamp",axis=1,inplace=True)
ratings.head()
userId | movieId | rating | |
---|---|---|---|
0 | 1.0 | 31.0 | 2.5 |
1 | 1.0 | 1029.0 | 3.0 |
2 | 2.0 | 10.0 | 4.0 |
3 | 2.0 | 720.0 | 4.0 |
4 | 3.0 | 60.0 | 3.0 |
ratings["rating"].describe()
count 24.000000 mean 4.020833 std 0.926414 min 2.000000 25% 3.375000 50% 4.000000 75% 5.000000 max 5.000000 Name: rating, dtype: float64
These ratings are linked to users and movies, in order to have a separate view per user/movie, grouping has to be used.
The GroupBy
operation (that comes from SQL) accomplishes:
DataFrame
depending on the value of the specified key.Source: [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas
ratings.head()
userId | movieId | rating | |
---|---|---|---|
0 | 1.0 | 31.0 | 2.5 |
1 | 1.0 | 1029.0 | 3.0 |
2 | 2.0 | 10.0 | 4.0 |
3 | 2.0 | 720.0 | 4.0 |
4 | 3.0 | 60.0 | 3.0 |
So to get the mean of the ratings per user, the command is
ratings.groupby("userId")["rating"].mean()
userId 1.0 2.750000 2.0 4.000000 3.0 3.444444 4.0 5.000000 5.0 4.166667 6.0 4.500000 Name: rating, dtype: float64
Filtering is the action of deleting rows depending on a boolean function. For instance, the following removes the user with a rating of only one movie.
ratings.groupby("userId")["rating"].count()
userId 1.0 2 2.0 2 3.0 9 4.0 7 5.0 3 6.0 1 Name: rating, dtype: int64
def filter_func(x):
return x["rating"].count() >= 2
filtered = ratings.groupby("userId").filter(filter_func)
filtered
userId | movieId | rating | |
---|---|---|---|
0 | 1.0 | 31.0 | 2.5 |
1 | 1.0 | 1029.0 | 3.0 |
2 | 2.0 | 10.0 | 4.0 |
3 | 2.0 | 720.0 | 4.0 |
4 | 3.0 | 60.0 | 3.0 |
5 | 3.0 | 110.0 | 4.0 |
6 | 3.0 | 247.0 | 3.5 |
7 | 3.0 | 592.0 | 3.0 |
8 | 3.0 | 593.0 | 3.0 |
9 | 3.0 | 595.0 | 2.0 |
10 | 3.0 | 2762.0 | 3.5 |
11 | 3.0 | 2841.0 | 4.0 |
12 | 3.0 | 2959.0 | 5.0 |
13 | 4.0 | 34.0 | 5.0 |
14 | 4.0 | 112.0 | 5.0 |
15 | 4.0 | 1028.0 | 5.0 |
16 | 4.0 | 1030.0 | 5.0 |
17 | 4.0 | 1079.0 | 5.0 |
18 | 4.0 | 1089.0 | 5.0 |
19 | 4.0 | 1097.0 | 5.0 |
20 | 5.0 | 4447.0 | 4.5 |
21 | 5.0 | 4718.0 | 3.5 |
22 | 5.0 | 30749.0 | 4.5 |
filtered.groupby("userId")["rating"].count()
userId 1.0 2 2.0 2 3.0 9 4.0 7 5.0 3 Name: rating, dtype: int64
Transforming is the actions of applying a transformation (sic).
For instance, let us normalize the ratings so that they have zero mean for each user.
ratings.groupby("userId")["rating"].mean()
userId 1.0 2.750000 2.0 4.000000 3.0 3.444444 4.0 5.000000 5.0 4.166667 6.0 4.500000 Name: rating, dtype: float64
def center_ratings(x):
x["rating"] = x["rating"] - x["rating"].mean()
return x
centered = ratings.groupby("userId").apply(center_ratings)
centered.groupby("userId")["rating"].mean()
userId 1.0 0.000000e+00 2.0 0.000000e+00 3.0 -1.973730e-16 4.0 0.000000e+00 5.0 -2.960595e-16 6.0 0.000000e+00 Name: rating, dtype: float64
Aggregations let you aggreagate several operations.
ratings.groupby("userId")["rating"].aggregate([min,max,np.mean,np.median,len])
min | max | mean | median | len | |
---|---|---|---|---|---|
userId | |||||
1.0 | 2.5 | 3.0 | 2.750000 | 2.75 | 2.0 |
2.0 | 4.0 | 4.0 | 4.000000 | 4.00 | 2.0 |
3.0 | 2.0 | 5.0 | 3.444444 | 3.50 | 9.0 |
4.0 | 5.0 | 5.0 | 5.000000 | 5.00 | 7.0 |
5.0 | 3.5 | 4.5 | 4.166667 | 4.50 | 3.0 |
6.0 | 4.5 | 4.5 | 4.500000 | 4.50 | 1.0 |
Exercise: Bots Discovery
In the dataset
ratings_bots.csv
, some users may be bots. To help a movie sucess they add ratings (favorable ones often). To get a better recommendation, we try to remove them.
Count the users with a mean rating above 4.7/5 and delete them
hint: the nunique function may be helpful to count
Delete multiples reviews of a movie by a single user by replacing them with only the first one. What is the proportion of potential bots among the users?
hint: the
groupby
function can be applied to several columns, alsoreset_index(drop=True)
removes the grouby indexing.hint: remember the
loc
function, e.g.df.loc[df['userId'] == 128]
returns a dataframe of the rows where the userId is 128; anddf.loc[df['userId'] == 128].loc[samerev['movieId'] == 3825]
returns a dataframe of the rows where the userId is 128 and the movieID is 3825.In total , 17 ratings have to be removed. For instance, user 128 has 3 ratings of the movie 3825
This dataset has around 100 000 ratings so hand picking won't do!
import pandas as pd
import numpy as np
ratings_bots = pd.read_csv('data/ml-small/ratings_bots.csv')
Exercise: Planets discovery
We will use the Planets dataset, available via the Seaborn package. It provides information on how astronomers found new planets around stars, exoplanets.
- Display median, mean and quantile informations for these planets orbital periods, masses, and distances.
- For each method, display statistic on the years planets were discovered using this technique.
import pandas as pd
import numpy as np
planets = pd.read_csv('data/planets.csv')
print(planets.shape)
planets.head()
(1035, 6)
method | number | orbital_period | mass | distance | year | |
---|---|---|---|---|---|---|
0 | Radial Velocity | 1 | 269.300 | 7.10 | 77.40 | 2006 |
1 | Radial Velocity | 1 | 874.774 | 2.21 | 56.95 | 2008 |
2 | Radial Velocity | 1 | 763.000 | 2.60 | 19.84 | 2011 |
3 | Radial Velocity | 1 | 326.030 | 19.40 | 110.62 | 2007 |
4 | Radial Velocity | 1 | 516.220 | 10.50 | 119.47 | 2009 |