Risk assessment analysis

In [19]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import math 
%matplotlib inline
Data loading:

We load fatal avalanche accidents datasets of the past 15 years.

In [2]:
df_accidents = pd.read_excel('../data/accidents/df_accidents_final.xlsx')

Off-piste skiing or Backcountry touring ?

For which activity do we see most of the accidents ?

In [20]:
fig, ax = plt.subplots(1, sharey=True);

sns.countplot(x="Activity", data=df_accidents, ax=ax);
ax.set_title('Avalanches accidents activity');
ax.set_xlabel('Activity');
ax.set_ylabel('Count');

More accidents are related to backcountry touring but we can not conclude which activity is safer.

In order to limit avalanche risk, it is important to be aware of the destructive power of avalanches, understand when and why they happen (snow and meteorological conditions). One of the main factor allowing to avoid accidents is increasing the awareness of the risks. And this begins with consulting the avalanche danger level before going on a ride.

Thus, it could be interesting to study skiiers' behaviour, and see if, depending on the activity (backcountry or off-piste), one group has a tendency to be more tempted to take unconsidered risks with regard to avalanche danger. Let's count the number of accidents per danger level, considering two groups: people doing backcountry touring and people doing off-piste skiing.

In [21]:
fig, ax = plt.subplots(1, sharey=True, figsize=(15,5));

g = sns.countplot(x="Activity", hue = 'Danger level', data=df_accidents, ax = ax);
ax.set_title('Avalanches accidents activity per danger level with propensity score matching');
ax.set_xlabel('Activity');
ax.set_ylabel('Count');

From this result, we see that backcountry related accidents have a mean danger level lower than off-piste related accidents. Thus it seems that people doing off-piste skiing have a tendancy to be take more risk considering avalanche danger levels or maybe not consider the danger level at all.

To be more accurate and give more weight to this statement, we will balance the two population according to the environmental features to let them only differ by their decision regarding the avalanche risk. We will match the altitude, group size, aspect and month of the year. Only danger levels 2 and 3 are taken into account in this analysis, as the other danger levels can be considered as outliers.

Propensity score matching:

But to draw valid conclusions, a propensity score matching is done. Propensity score purpose is to balance the dataset across treatment groups.

Propensity scores are used to match each data point from the treated group with exactly one data point from the control group. The matching allows to maximize the similarity between matched subjects, by balancing feature-value distributions between the two groups.

In [34]:
from sklearn import linear_model 

cardinals = ['N', 'NNE', 'NE', 'ENE', 'E', 'ESE', 'SE', 'SSE', 'S', 'SSW', 'SW', 'WSW', 'W', 'WNW', 'NW', 'NNW']

# Keep only 2 activities we want to compare
df = df_accidents[df_accidents.Activity != 'Transportation']
df = df[df.Activity != 'Building']

df = df[df['Danger level'].apply(lambda d: d in [2,3])] # Keep only danger levels 2 and 3
df['month'] = df.date.apply(lambda d: d.month)
df = df.set_index(['date','Longitude','Latitude'])

# Create dict of aspect to get numerical values from aspect categories
aspect_cos = {}
aspect_sin = {}
for val, aspect in enumerate(cardinals):
    aspect_cos[aspect]=math.cos(val*(math.pi/8))
    aspect_sin[aspect]=math.sin(val*(math.pi/8))

df['cos_aspect'] = df.Aspect.replace(aspect_cos)
df['sin_aspect'] = df.Aspect.replace(aspect_sin)
In [24]:
# Create feature matrix
features = df[['Activity','cos_aspect','sin_aspect','Elevation','month','caught','Danger level']]
features['Activity'] = features['Activity'].apply(lambda x: 1 if x == 'Backcountry touring' else 0)
features = features.dropna()

# Create data matrix
data = features
features = features.drop(['Danger level','Activity'], axis = 1)

Use logistic regression to estimate propensity scores for all points in the dataset.

In [25]:
model = linear_model.LogisticRegression()
model.fit(features, data.Activity)
pred = model.predict(features)

accuracy = sum(pred == data.Activity) / len(data.Activity)
print('accuracy: ', accuracy)
accuracy:  0.605809128631

This low accurcacy is not a problem, it means that the classifier cannot distinguish so well the two population and that our data is already well distributed.

The propensity scores are calculated and added to the dataframe

In [35]:
# Get propensity score
pscore = model.predict_proba(features)
data = data.assign(propensity_score = pscore[:,0])
data.head(3)
Out[35]:
Activity cos_aspect sin_aspect Elevation month caught Danger level propensity_score
date Longitude Latitude
2002-01-02 7.528077 46.546476 1 9.238795e-01 -0.382683 2360 1 1 3.0 0.502397
2002-01-03 9.815028 46.855337 0 -1.836970e-16 -1.000000 2400 1 1 3.0 0.393845
2002-01-13 9.830075 46.836754 0 6.123234e-17 1.000000 2240 1 1 2.0 0.467169

The propensity scores are used to match each data point from the backcountry skiing group with exactly one data point from the off-piste skiing group. To do that, we decided to use networkx package and work with DiGraph, which corresponds to directed graph with self loops. Each member of each group should then have a match, that had equally likely chances to be assigned to the treated group or the control group (according to the trained classifier).

In [36]:
import networkx as nx

G = nx.DiGraph()

# add a node for each sample
for i, row in data.iterrows():
    G.add_node(row.name, propensity=row['propensity_score'], Activity=row['Activity'])

# add an edge between treated control with propensity diff as weight
for n1, att1 in G.nodes(data=True):
    for n2, att2 in G.nodes(data=True):
        if att1['Activity'] == 1 and att2['Activity'] == 0:
            diff = abs(att1['propensity'] - att2['propensity'])
            G.add_edge(n1, n2, weight=1-diff)

matchings = nx.max_weight_matching(G)

Doublons are found for the matching. So we split the matching output to get only unique matches (93 in total, as it should).

In [37]:
matchings
true_matching = {}

for treat,no_treat in matchings.items():
    if treat not in true_matching.values():
        true_matching[treat] = no_treat

The new matching is applied to our dataset.

In [38]:
# Assign propensity score matching to the dataframe to keep only matched datapoints
data_offpiste = data.loc[[off_piste for backcountry, off_piste in true_matching.items()]]

# Assign propensity score matching to the dataframe to keep only matched datapoints
data_backcountry = data.loc[[backcountry for backcountry, off_piste in true_matching.items()]]
In [39]:
data_new = pd.concat([data_offpiste, data_backcountry])
fig, ax = plt.subplots(1, sharey=True, figsize=(15,5));

g = sns.countplot(x="Activity", hue = 'Danger level', data=data_new, ax = ax);
g.set(xticklabels=['Off-piste skiing','Backcountry touring'])
ax.set_title('Avalanches accidents activity per danger level with propensity score matching');
ax.set_xlabel('Activity');
ax.set_ylabel('Count');

From our results, accidents due to backcountry activity show in general reduced danger levels compared to accidents caused by off-piste skiing. This could suggest a difference of behaviour between both groups. Indeed, this result could be explained by the fact that people doing backcountry touring have a better knowledge and awareness of the risks than people doing off-piste skiing.

It could also be interesting to confirm our hypothesis by applying a statistical test. Thus, we investigate the mean difference of avalanche danger levels between the two groups, which are off-piste skiing group and backcountry touring group. To do that, we decided to perform an independant 2 sample t-test to test if, wether or not, the difference in danger level is significant between the groups. So our hypothesis are:

  • $H_0$: There is no difference between the two means
  • $H_1$: There is a difference between the two means
In [41]:
data_offpiste['Danger level'].mean()
Out[41]:
2.774193548387097
In [42]:
data_backcountry['Danger level'].mean()
Out[42]:
2.6129032258064515
In [40]:
# Student t-test
from scipy import stats
t, p = stats.ttest_ind(data_offpiste['Danger level'], data_backcountry['Danger level'])

if p < 0.05:
    print( 'p = %f -> The independent 2 sample student t-test rejects the null Hypothesis' % p)
p = 0.016938 -> The independent 2 sample student t-test rejects the null Hypothesis

The student t-test gives a p-value lower than 0.05, and let us reject $H_0$. We can conclude that a educated skiiers tends to take less risks and go out on safer conditions.