Auteur: Joseph Salmon joseph.salmon@umontpellier.fr , adapté en francais du travail de Joris Van den Bossche:
https://github.com/jorisvandenbossche/pandas-tutorial/blob/master/01-pandas_introduction.ipynb
%matplotlib notebook
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
pd.options.display.max_rows = 8
from download import download
url = "http://josephsalmon.eu/enseignement/datasets/titanic.csv"
path_target = "./titanic.csv"
download(url, path_target, replace=False)
# df: data frame
df_titanic_raw = pd.read_csv("titanic.csv")
file_sizes: 100%|██████████████████████████| 61.2k/61.2k [00:00<00:00, 1.38MB/s]
Downloading data from http://josephsalmon.eu/enseignement/datasets/titanic.csv (60 kB) Successfully downloaded file to ./titanic.csv
df_titanic_raw.tail(n=3)
df_titanic_raw.describe()
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
count | 183.000000 | 183.000000 | 183.000000 | 183.000000 | 183.000000 | 183.000000 | 183.000000 |
mean | 455.366120 | 0.672131 | 1.191257 | 35.674426 | 0.464481 | 0.475410 | 78.682469 |
std | 247.052476 | 0.470725 | 0.515187 | 15.643866 | 0.644159 | 0.754617 | 76.347843 |
min | 2.000000 | 0.000000 | 1.000000 | 0.920000 | 0.000000 | 0.000000 | 0.000000 |
25% | 263.500000 | 0.000000 | 1.000000 | 24.000000 | 0.000000 | 0.000000 | 29.700000 |
50% | 457.000000 | 1.000000 | 1.000000 | 36.000000 | 0.000000 | 0.000000 | 57.000000 |
75% | 676.000000 | 1.000000 | 1.000000 | 47.500000 | 1.000000 | 1.000000 | 90.000000 |
max | 890.000000 | 1.000000 | 3.000000 | 80.000000 | 3.000000 | 4.000000 | 512.329200 |
Pour faciliter la suite on ne garde que les observations qui sont complètes, on enlève donc ici les valeurs manquantes
df_titanic = df_titanic_raw.dropna()
df_titanic.tail(3)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
879 | 880 | 1 | 1 | Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) | female | 56.0 | 0 | 1 | 11767 | 83.1583 | C50 | C |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
df_titanic.describe()
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
count | 183.000000 | 183.000000 | 183.000000 | 183.000000 | 183.000000 | 183.000000 | 183.000000 |
mean | 455.366120 | 0.672131 | 1.191257 | 35.674426 | 0.464481 | 0.475410 | 78.682469 |
std | 247.052476 | 0.470725 | 0.515187 | 15.643866 | 0.644159 | 0.754617 | 76.347843 |
min | 2.000000 | 0.000000 | 1.000000 | 0.920000 | 0.000000 | 0.000000 | 0.000000 |
25% | 263.500000 | 0.000000 | 1.000000 | 24.000000 | 0.000000 | 0.000000 | 29.700000 |
50% | 457.000000 | 1.000000 | 1.000000 | 36.000000 | 0.000000 | 0.000000 | 57.000000 |
75% | 676.000000 | 1.000000 | 1.000000 | 47.500000 | 1.000000 | 1.000000 | 90.000000 |
max | 890.000000 | 1.000000 | 3.000000 | 80.000000 | 3.000000 | 4.000000 | 512.329200 |
Quelle est la répartition par âge des passagers?
plt.figure(figsize=(3,3))
plt.hist(df_titanic['Age'], density=False,bins=50)
plt.xlabel('Age')
plt.ylabel('Proportion')
plt.title("Histogramme de l'âge des passagers")
Text(0.5,1,"Histogramme de l'âge des passagers")
plt.figure()
ax = sns.kdeplot(df_titanic['Age'], shade=True, cut=0, bw=3)
plt.xlabel('Proportion')
plt.ylabel('Age')
ax.legend().set_visible(False)
plt.title("Estimation de la densité de l'âge des passagers")
Text(0.5,1,"Estimation de la densité de l'âge des passagers")
ax = sns.kdeplot
** Comment le taux de survie des passagers diffère-t-il entre les sexes?**
df_titanic_raw.groupby('Sex')[['Survived']].aggregate(lambda x: x.mean())
Survived | |
---|---|
Sex | |
female | 0.742038 |
male | 0.188908 |
** Ou en quoi diffère-t-il entre les différentes classes? **
df_titanic.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], dtype='object')
plt.figure()
df_titanic.groupby('Pclass')['Survived'].aggregate(lambda x: x.mean()).plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x7fbe903c3550>
sns.catplot(x=df_titanic_raw.columns[2], y="Age",
hue="Sex", data=df_titanic_raw, kind="violin", legend=False)
plt.title("Taux de survie par classe")
plt.legend?
plt.tight_layout()
/home/jo/anaconda3/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result. return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
url = "http://josephsalmon.eu/enseignement/datasets/20080421_20160927-PA13_auto.csv"
path_target = "./20080421_20160927-PA13_auto.csv"
download(url, path_target, replace=False)
Replace is False and data exists, so doing nothing. Use replace==True to re-download the data.
'./20080421_20160927-PA13_auto.csv'
polution_df = pd.read_csv('20080421_20160927-PA13_auto.csv', sep=';',
comment='#', na_values="n/d",
converters={'heure': str})
# check issues with 24:00:
# https://www.tutorialspoint.com/python/time_strptime.htm
# Pré-traitement:
polution_df['heure'] = polution_df['heure'].replace('24', '0')
time_improved = pd.to_datetime(polution_df['date'] +
' ' + polution_df['heure'] + ':00',
format='%d/%m/%Y %H:%M')
polution_df['DateTime'] = time_improved
del polution_df['heure']
del polution_df['date']
polution_ts = polution_df.set_index(['DateTime'])
polution_ts = polution_ts.sort_index()
polution_ts.head()
NO2 | O3 | |
---|---|---|
DateTime | ||
2008-04-21 00:00:00 | 28.0 | 36.0 |
2008-04-21 01:00:00 | 13.0 | 74.0 |
2008-04-21 02:00:00 | 11.0 | 73.0 |
2008-04-21 03:00:00 | 13.0 | 64.0 |
2008-04-21 04:00:00 | 23.0 | 46.0 |
## Pollution sur Paris au cours des années, Source: Airparif
polution_ts.describe()
NO2 | O3 | |
---|---|---|
count | 71008.000000 | 71452.000000 |
mean | 34.453414 | 39.610046 |
std | 20.380702 | 28.837333 |
min | 1.000000 | 0.000000 |
25% | 19.000000 | 16.000000 |
50% | 30.000000 | 38.000000 |
75% | 46.000000 | 58.000000 |
max | 167.000000 | 211.000000 |
fig, axes = plt.subplots(2, 1, figsize=(6, 4), sharex=True)
axes[0].plot(polution_ts['O3'].resample('D').mean())
axes[0].set_title("Polution à l'ozone: moyenne journalière sur Paris")
axes[0].set_ylabel("Concentration (µg/m³)")
axes[1].plot(polution_ts['NO2'].resample('D').mean())
axes[1].set_title("Polution à l'azote: moyenne journalière sur Paris")
axes[1].set_ylabel("Concentration (µg/m³)")
plt.show()
** La pollution atmosphérique montre-t-elle une tendance à la baisse au fil des ans?**
ax = polution_ts['2008':].resample('A').mean().plot(figsize=(4,4)) # échantillone par année (A pour Annual)
plt.ylim(0,50)
plt.title("Evolution de la pollution: \n moyenne annuelle sur Paris")
plt.ylabel("Concentration (µg/m³)")
plt.xlabel("Années")
Text(0.5,0,'Années')
# Chargement des couleurs
sns.set_palette("GnBu_d", n_colors=7)
polution_ts['weekday'] = polution_ts.index.weekday # Monday=0, Sunday=6
# polution_ts['weekend'] = polution_ts['weekday'].isin([5, 6])
days = ['Lundi', 'Mardi', 'Mercredi',
'Jeudi', 'Vendredi', 'Samedi', 'Dimanche']
polution_week_no2 = polution_ts.groupby(['weekday', polution_ts.index.hour])[
'NO2'].mean().unstack(level=0)
polution_week_03 = polution_ts.groupby(['weekday', polution_ts.index.hour])[
'O3'].mean().unstack(level=0)
Int64Index([4, 4, 4, 4, 4, 4, 4, 4, 4, 4, ... 9, 9, 9, 9, 9, 9, 9, 9, 9, 9], dtype='int64', name='DateTime', length=73920)
fig, axes = plt.subplots(2, 1, figsize=(7, 7), sharex=True)
polution_week_no2.plot(ax=axes[0])
axes[0].set_ylabel("Concentration (µg/m³)")
axes[0].set_xlabel("Heure de la journée")
axes[0].set_title(
"Profil journalier de la pollution au NO2: effet du weekend?")
axes[0].set_xticks(np.arange(0, 24))
axes[0].set_xticklabels(np.arange(0, 24), rotation=45)
axes[0].set_ylim(0, 60)
polution_week_03.plot(ax=axes[1])
axes[1].set_ylabel("Concentration (µg/m³)")
axes[1].set_xlabel("Heure de la journée")
axes[1].set_title("Profil journalier de la pollution au O3: effet du weekend?")
axes[1].set_xticks(np.arange(0, 24))
axes[1].set_xticklabels(np.arange(0, 24), rotation=45)
axes[1].set_ylim(0, 70)
axes[0].legend().set_visible(False)
# ax.legend()
axes[1].legend(labels=days, loc='lower left', bbox_to_anchor=(1, 0.1))
plt.tight_layout()
# XXX TODO quid des saisons?
import calendar
polution_ts['month'] = polution_ts.index.month # Janvier=0, .... Decembre=12
polution_ts['month'] = polution_ts['month'].apply(lambda x: calendar.month_abbr[x])
polution_ts.head()
NO2 | O3 | month | |
---|---|---|---|
DateTime | |||
2008-04-21 00:00:00 | 28.0 | 36.0 | Apr |
2008-04-21 01:00:00 | 13.0 | 74.0 | Apr |
2008-04-21 02:00:00 | 11.0 | 73.0 | Apr |
2008-04-21 03:00:00 | 13.0 | 64.0 | Apr |
2008-04-21 04:00:00 | 23.0 | 46.0 | Apr |
days = []
polution_month_no2 = polution_ts.groupby(['month', polution_ts.index.hour])[
'NO2'].mean().unstack(level=0)
polution_month_03 = polution_ts.groupby(['month', polution_ts.index.hour])[
'O3'].mean().unstack(level=0)
sns.set_palette("GnBu_d", n_colors=12)
fig, axes = plt.subplots(2, 1, figsize=(7, 7), sharex=True)
polution_month_no2.plot(ax=axes[0])
axes[0].set_ylabel("Concentration (µg/m³)")
axes[0].set_xlabel("Heure de la journée")
axes[0].set_title(
"Profil journalier de la pollution au NO2: effet du weekend?")
axes[0].set_xticks(np.arange(0, 24))
axes[0].set_xticklabels(np.arange(0, 24), rotation=45)
axes[0].set_ylim(0, 90)
polution_month_03.plot(ax=axes[1])
axes[1].set_ylabel("Concentration (µg/m³)")
axes[1].set_xlabel("Heure de la journée")
axes[1].set_title("Profil journalier de la pollution au O3: effet du weekend?")
axes[1].set_xticks(np.arange(0, 24))
axes[1].set_xticklabels(np.arange(0, 24), rotation=45)
axes[1].set_ylim(0, 90)
axes[0].legend().set_visible(False)
# ax.legend()
axes[1].legend(labels=calendar.month_name[1:], loc='lower left', bbox_to_anchor=(1, 0.1))
plt.tight_layout()
Pour les travaux intensifs en données en Python, la bibliothèque Pandas est devenue essentielle.
Qu'est ce que pandas? C'est un environnement qui gère des Data Frame:
Documentation: http://pandas.pydata.org/pandas-docs/stable/
Quand a-t-on besoin de Pandas? Quand on travaille avec des tableaux ou des structures de données(commme des dataframe R, SQL table, Excel, Spreadsheet, ...):
ATTENTION / LIMITES:
Pandas est bon pour travailler avec des données hétérogènes et des tableaux 1D/2D, mais tous les types de données ne correspondent pas à ces structures!
Contre-exemples:
Un DataFrame est une structure de données tabulaire (un objet multidimensionnel pouvant contenir des données étiquetées) composé de lignes et de colonnes, semblable à une feuille de calcul, une table de base de données ou à l'objet data.frame de R. Vous pouvez le considérer comme plusieurs objets Series partageant le même index.
df_titanic
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
10 | 11 | 1 | 3 | Sandstrom, Miss. Marguerite Rut | female | 4.0 | 1 | 1 | PP 9549 | 16.7000 | G6 | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
872 | 873 | 0 | 1 | Carlsson, Mr. Frans Olof | male | 33.0 | 0 | 0 | 695 | 5.0000 | B51 B53 B55 | S |
879 | 880 | 1 | 1 | Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) | female | 56.0 | 0 | 1 | 11767 | 83.1583 | C50 | C |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
183 rows × 12 columns
df_titanic.index
Int64Index([ 1, 3, 6, 10, 11, 21, 23, 27, 52, 54, ... 835, 853, 857, 862, 867, 871, 872, 879, 887, 889], dtype='int64', length=183)
df_titanic.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], dtype='object')
df_titanic.dtypes
PassengerId int64 Survived int64 Pclass int64 Name object ... Ticket object Fare float64 Cabin object Embarked object Length: 12, dtype: object
df_titanic.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 183 entries, 1 to 889 Data columns (total 12 columns): PassengerId 183 non-null int64 Survived 183 non-null int64 Pclass 183 non-null int64 Name 183 non-null object Sex 183 non-null object Age 183 non-null float64 SibSp 183 non-null int64 Parch 183 non-null int64 Ticket 183 non-null object Fare 183 non-null float64 Cabin 183 non-null object Embarked 183 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 23.6+ KB
# on voit que c'est la variable cabine qui n'est pas bien renseigné, suit après l'âge
df_titanic_raw.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.6+ KB
array_titanic = df_titanic.values # c'est la liste de valeur /array associé
array_titanic
array([[2, 1, 1, ..., 71.2833, 'C85', 'C'], [4, 1, 1, ..., 53.1, 'C123', 'S'], [7, 0, 1, ..., 51.8625, 'E46', 'S'], ..., [880, 1, 1, ..., 83.1583, 'C50', 'C'], [888, 1, 1, ..., 30.0, 'B42', 'S'], [890, 1, 1, ..., 30.0, 'C148', 'C']], dtype=object)
Une Series est un support de base pour les données étiquetées unidimensionnelles.
fare = df_titanic['Fare']
fare
1 71.2833 3 53.1000 6 51.8625 10 16.7000 ... 872 5.0000 879 83.1583 887 30.0000 889 30.0000 Name: Fare, Length: 183, dtype: float64
fare.values[:10]
array([ 71.2833, 53.1 , 51.8625, 16.7 , 26.55 , 13. , 35.5 , 263. , 76.7292, 61.9792])
fare[6] # existe mais fare[0] provoque une erreur, car on l'a enlevé du dataFrame, comme valeur manquante.
51.8625
Contrairement au tableau numpy, cet index peut être autre chose qu'un entier:
df_titanic = df_titanic.set_index('Name')
df_titanic
PassengerId | Survived | Pclass | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|
Name | ||||||||||
Cumings, Mrs. John Bradley (Florence Briggs Thayer) | 2 | 1 | 1 | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
Futrelle, Mrs. Jacques Heath (Lily May Peel) | 4 | 1 | 1 | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
McCarthy, Mr. Timothy J | 7 | 0 | 1 | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
Sandstrom, Miss. Marguerite Rut | 11 | 1 | 3 | 4.0 | 1 | 1 | PP 9549 | 16.7000 | G6 | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Carlsson, Mr. Frans Olof | 873 | 0 | 1 | 33.0 | 0 | 0 | 695 | 5.0000 | B51 B53 B55 | S |
Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) | 880 | 1 | 1 | 56.0 | 0 | 1 | 11767 | 83.1583 | C50 | C |
Graham, Miss. Margaret Edith | 888 | 1 | 1 | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
Behr, Mr. Karl Howell | 890 | 1 | 1 | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
183 rows × 10 columns
age = df_titanic['Age']
age['Behr, Mr. Karl Howell']
26.0
age.mean()
35.6744262295082
df_titanic[age <2]
PassengerId | Survived | Pclass | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|
Name | ||||||||||
Becker, Master. Richard F | 184 | 1 | 2 | 1.00 | 2 | 1 | 230136 | 39.00 | F4 | S |
Allison, Master. Hudson Trevor | 306 | 1 | 1 | 0.92 | 1 | 2 | 113781 | 151.55 | C22 C26 | S |
df_titanic_raw['Embarked'].value_counts()
S 644 C 168 Q 77 Name: Embarked, dtype: int64
pd.options.display.max_rows = 70
df_titanic[df_titanic['Embarked']=='C'] # Les passagers montés à Cherbourg n'ont pas des noms gaulois...
PassengerId | Survived | Pclass | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|
Name | ||||||||||
Cumings, Mrs. John Bradley (Florence Briggs Thayer) | 2 | 1 | 1 | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
Harper, Mrs. Henry Sleeper (Myna Haxtun) | 53 | 1 | 1 | 49.0 | 1 | 0 | PC 17572 | 76.7292 | D33 | C |
Ostby, Mr. Engelhart Cornelius | 55 | 0 | 1 | 65.0 | 0 | 1 | 113509 | 61.9792 | B30 | C |
Goldschmidt, Mr. George B | 97 | 0 | 1 | 71.0 | 0 | 0 | PC 17754 | 34.6542 | A5 | C |
Greenfield, Mr. William Bertram | 98 | 1 | 1 | 23.0 | 0 | 1 | PC 17759 | 63.3583 | D10 D12 | C |
Baxter, Mr. Quigg Edmond | 119 | 0 | 1 | 24.0 | 0 | 1 | PC 17558 | 247.5208 | B58 B60 | C |
Giglio, Mr. Victor | 140 | 0 | 1 | 24.0 | 0 | 0 | PC 17593 | 79.2000 | B86 | C |
Smith, Mr. James Clinch | 175 | 0 | 1 | 56.0 | 0 | 0 | 17764 | 30.6958 | A7 | C |
Isham, Miss. Ann Elizabeth | 178 | 0 | 1 | 50.0 | 0 | 0 | PC 17595 | 28.7125 | C49 | C |
Brown, Mrs. James Joseph (Margaret Tobin) | 195 | 1 | 1 | 44.0 | 0 | 0 | PC 17610 | 27.7208 | B4 | C |
Lurette, Miss. Elise | 196 | 1 | 1 | 58.0 | 0 | 0 | PC 17569 | 146.5208 | B80 | C |
Blank, Mr. Henry | 210 | 1 | 1 | 40.0 | 0 | 0 | 112277 | 31.0000 | A31 | C |
Newell, Miss. Madeleine | 216 | 1 | 1 | 31.0 | 1 | 0 | 35273 | 113.2750 | D36 | C |
Bazzani, Miss. Albina | 219 | 1 | 1 | 32.0 | 0 | 0 | 11813 | 76.2917 | D15 | C |
Natsch, Mr. Charles H | 274 | 0 | 1 | 37.0 | 0 | 1 | PC 17596 | 29.7000 | C118 | C |
Bishop, Mrs. Dickinson H (Helen Walton) | 292 | 1 | 1 | 19.0 | 1 | 0 | 11967 | 91.0792 | B49 | C |
Levy, Mr. Rene Jacques | 293 | 0 | 2 | 36.0 | 0 | 0 | SC/Paris 2163 | 12.8750 | D | C |
Baxter, Mrs. James (Helene DeLaudeniere Chaput) | 300 | 1 | 1 | 50.0 | 0 | 1 | PC 17558 | 247.5208 | B58 B60 | C |
Penasco y Castellana, Mrs. Victor de Satode (Maria Josefa Perez de Soto y Vallejo) | 308 | 1 | 1 | 17.0 | 1 | 0 | PC 17758 | 108.9000 | C65 | C |
Francatelli, Miss. Laura Mabel | 310 | 1 | 1 | 30.0 | 0 | 0 | PC 17485 | 56.9292 | E36 | C |
Hays, Miss. Margaret Bechstein | 311 | 1 | 1 | 24.0 | 0 | 0 | 11767 | 83.1583 | C54 | C |
Ryerson, Miss. Emily Borie | 312 | 1 | 1 | 18.0 | 2 | 2 | PC 17608 | 262.3750 | B57 B59 B63 B66 | C |
Spedden, Mrs. Frederic Oakley (Margaretta Corning Stone) | 320 | 1 | 1 | 40.0 | 1 | 1 | 16966 | 134.5000 | E34 | C |
Young, Miss. Marie Grice | 326 | 1 | 1 | 36.0 | 0 | 0 | PC 17760 | 135.6333 | C32 | C |
Hippach, Miss. Jean Gertrude | 330 | 1 | 1 | 16.0 | 0 | 1 | 111361 | 57.9792 | B18 | C |
Burns, Miss. Elizabeth Margaret | 338 | 1 | 1 | 41.0 | 0 | 0 | 16966 | 134.5000 | E40 | C |
Warren, Mrs. Frank Manley (Anna Sophia Atkinson) | 367 | 1 | 1 | 60.0 | 1 | 0 | 110813 | 75.2500 | D37 | C |
Aubart, Mme. Leontine Pauline | 370 | 1 | 1 | 24.0 | 0 | 0 | PC 17477 | 69.3000 | B35 | C |
Harder, Mr. George Achilles | 371 | 1 | 1 | 25.0 | 1 | 0 | 11765 | 55.4417 | E50 | C |
Widener, Mr. Harry Elkins | 378 | 0 | 1 | 27.0 | 0 | 2 | 113503 | 211.5000 | C82 | C |
Newell, Miss. Marjorie | 394 | 1 | 1 | 23.0 | 1 | 0 | 35273 | 113.2750 | D36 | C |
Foreman, Mr. Benjamin Laventall | 453 | 0 | 1 | 30.0 | 0 | 0 | 113051 | 27.7500 | C111 | C |
Goldenberg, Mr. Samuel L | 454 | 1 | 1 | 49.0 | 1 | 0 | 17453 | 89.1042 | C92 | C |
Jerwan, Mrs. Amin S (Marie Marthe Thuillard) | 474 | 1 | 2 | 23.0 | 0 | 0 | SC/AH Basle 541 | 13.7917 | D | C |
Bishop, Mr. Dickinson H | 485 | 1 | 1 | 25.0 | 1 | 0 | 11967 | 91.0792 | B49 | C |
Kent, Mr. Edward Austin | 488 | 0 | 1 | 58.0 | 0 | 0 | 11771 | 29.7000 | B37 | C |
Eustis, Miss. Elizabeth Mussey | 497 | 1 | 1 | 54.0 | 1 | 0 | 36947 | 78.2667 | D20 | C |
Penasco y Castellana, Mr. Victor de Satode | 506 | 0 | 1 | 18.0 | 1 | 0 | PC 17758 | 108.9000 | C65 | C |
Hippach, Mrs. Louis Albert (Ida Sophia Fischer) | 524 | 1 | 1 | 44.0 | 0 | 1 | 111361 | 57.9792 | B18 | C |
Frolicher, Miss. Hedwig Margaritha | 540 | 1 | 1 | 22.0 | 0 | 2 | 13568 | 49.5000 | B39 | C |
Douglas, Mr. Walter Donald | 545 | 0 | 1 | 50.0 | 1 | 0 | PC 17761 | 106.4250 | C86 | C |
Thayer, Mr. John Borland Jr | 551 | 1 | 1 | 17.0 | 0 | 2 | 17421 | 110.8833 | C70 | C |
Duff Gordon, Lady. (Lucille Christiana Sutherland) ("Mrs Morgan") | 557 | 1 | 1 | 48.0 | 1 | 0 | 11755 | 39.6000 | A16 | C |
Thayer, Mrs. John Borland (Marian Longstreth Morris) | 582 | 1 | 1 | 39.0 | 1 | 1 | 17421 | 110.8833 | C68 | C |
Ross, Mr. John Hugo | 584 | 0 | 1 | 36.0 | 0 | 0 | 13049 | 40.1250 | A10 | C |
Frolicher-Stehli, Mr. Maxmillian | 588 | 1 | 1 | 60.0 | 1 | 1 | 13567 | 79.2000 | B41 | C |
Stephenson, Mrs. Walter Bertram (Martha Eustis) | 592 | 1 | 1 | 52.0 | 1 | 0 | 36947 | 78.2667 | D20 | C |
Duff Gordon, Sir. Cosmo Edmund ("Mr Morgan") | 600 | 1 | 1 | 49.0 | 1 | 0 | PC 17485 | 56.9292 | A20 | C |
Stahelin-Maeglin, Dr. Max | 633 | 1 | 1 | 32.0 | 0 | 0 | 13214 | 30.5000 | B50 | C |
Sagesser, Mlle. Emma | 642 | 1 | 1 | 24.0 | 0 | 0 | PC 17477 | 69.3000 | B35 | C |
Harper, Mr. Henry Sleeper | 646 | 1 | 1 | 48.0 | 1 | 0 | PC 17572 | 76.7292 | D33 | C |
Simonius-Blumer, Col. Oberst Alfons | 648 | 1 | 1 | 56.0 | 0 | 0 | 13213 | 35.5000 | A26 | C |
Newell, Mr. Arthur Webster | 660 | 0 | 1 | 58.0 | 0 | 2 | 35273 | 113.2750 | D48 | C |
Cardeza, Mr. Thomas Drake Martinez | 680 | 1 | 1 | 36.0 | 0 | 1 | PC 17755 | 512.3292 | B51 B53 B55 | C |
Hassab, Mr. Hammad | 682 | 1 | 1 | 27.0 | 0 | 0 | PC 17572 | 76.7292 | D49 | C |
Thayer, Mr. John Borland | 699 | 0 | 1 | 49.0 | 1 | 1 | 17421 | 110.8833 | C68 | C |
Astor, Mrs. John Jacob (Madeleine Talmadge Force) | 701 | 1 | 1 | 18.0 | 1 | 0 | PC 17757 | 227.5250 | C62 C64 | C |
Mayne, Mlle. Berthe Antonine ("Mrs de Villiers") | 711 | 1 | 1 | 24.0 | 0 | 0 | PC 17482 | 49.5042 | C90 | C |
Endres, Miss. Caroline Louise | 717 | 1 | 1 | 38.0 | 0 | 0 | PC 17757 | 227.5250 | C45 | C |
Lesurer, Mr. Gustave J | 738 | 1 | 1 | 35.0 | 0 | 0 | PC 17755 | 512.3292 | B101 | C |
Ryerson, Miss. Susan Parker "Suzette" | 743 | 1 | 1 | 21.0 | 2 | 2 | PC 17608 | 262.3750 | B57 B59 B63 B66 | C |
Guggenheim, Mr. Benjamin | 790 | 0 | 1 | 46.0 | 0 | 0 | PC 17593 | 79.2000 | B82 B84 | C |
Compton, Miss. Sara Rebecca | 836 | 1 | 1 | 39.0 | 1 | 1 | PC 17756 | 83.1583 | E49 | C |
Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) | 880 | 1 | 1 | 56.0 | 0 | 1 | 11767 | 83.1583 | C50 | C |
Behr, Mr. Karl Howell | 890 | 1 | 1 | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
pd.options.display.max_rows = 8
df_titanic_raw['Survived'].sum() / df_titanic_raw['Survived'].count()
0.3838383838383838
df_titanic['Survived'].mean()
0.6721311475409836
** Quelle était la proportion de femmes sur le bateau? **
df_titanic_raw.groupby(['Sex']).size() / df_titanic_raw['Sex'].count()
Sex female 0.352413 male 0.647587 dtype: float64
df_titanic_raw.groupby(['Sex']).mean()
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
Sex | |||||||
female | 431.028662 | 0.742038 | 2.159236 | 27.915709 | 0.694268 | 0.649682 | 44.479818 |
male | 454.147314 | 0.188908 | 2.389948 | 30.726645 | 0.429809 | 0.235702 | 25.523893 |
Pandas supports nativement une large gamme de formats d'entrée / sortie:
# pd.read_csv?
df_titanic_raw.tail()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.00 | NaN | S |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.00 | B42 | S |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.45 | NaN | S |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.00 | C148 | C |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.75 | NaN | Q |
df_titanic_raw.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
sns.set_palette("colorblind")
sns.catplot(x='Pclass',y='Age',hue='Survived',data=df_titanic_raw, kind="violin")
/home/jo/anaconda3/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result. return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
<seaborn.axisgrid.FacetGrid at 0x7fbe9031bda0>
df_titanic_raw.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], dtype='object')
df_titanic.iloc[0:2,1:8]
Survived | Pclass | Sex | Age | SibSp | Parch | Ticket | |
---|---|---|---|---|---|---|---|
Name | |||||||
Cumings, Mrs. John Bradley (Florence Briggs Thayer) | 1 | 1 | female | 38.0 | 1 | 0 | PC 17599 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) | 1 | 1 | female | 35.0 | 1 | 0 | 113803 |
df_titanic.loc['Bonnell, Miss. Elizabeth', 'Fare']
26.55
df_titanic.loc['Bonnell, Miss. Elizabeth']
PassengerId 12 Survived 1 Pclass 1 Sex female ... Ticket 113783 Fare 26.55 Cabin C103 Embarked S Name: Bonnell, Miss. Elizabeth, Length: 11, dtype: object
df_titanic.loc['Bonnell, Miss. Elizabeth','Survived']= 100
df_titanic.loc['Bonnell, Miss. Elizabeth']
PassengerId 12 Survived 100 Pclass 1 Sex female ... Ticket 113783 Fare 26.55 Cabin C103 Embarked S Name: Bonnell, Miss. Elizabeth, Length: 11, dtype: object
df_titanic.loc['Bonnell, Miss. Elizabeth','Survived']= 1 # On remet la valeur comme avant
df_titanic.groupby('Sex').mean()
PassengerId | Survived | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|
Pclass | ||||||
1 | 467.221519 | 0.670886 | 37.591266 | 0.487342 | 0.468354 | 88.683228 |
2 | 343.266667 | 0.800000 | 25.266667 | 0.400000 | 0.400000 | 18.444447 |
3 | 436.200000 | 0.500000 | 21.000000 | 0.200000 | 0.700000 | 11.027500 |
df_titanic_raw.groupby('Sex').mean()['Pclass'] # attention ici on prend toutes les données, meme les manquantes...
Sex female 2.159236 male 2.389948 Name: Pclass, dtype: float64
df_titanic['AgeClass'] = pd.cut(df_titanic['Age'], bins=np.arange(0,90,10)) # créer des classes / découpes.
df_titanic['AgeClass']
Name Cumings, Mrs. John Bradley (Florence Briggs Thayer) (30, 40] Futrelle, Mrs. Jacques Heath (Lily May Peel) (30, 40] McCarthy, Mr. Timothy J (50, 60] Sandstrom, Miss. Marguerite Rut (0, 10] ... Carlsson, Mr. Frans Olof (30, 40] Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) (50, 60] Graham, Miss. Margaret Edith (10, 20] Behr, Mr. Karl Howell (20, 30] Name: AgeClass, Length: 183, dtype: category Categories (8, interval[int64]): [(0, 10] < (10, 20] < (20, 30] < (30, 40] < (40, 50] < (50, 60] < (60, 70] < (70, 80]]