The purpose of this notebook is to allow users to become familiar with model fitting in a Google colaboratory (colab) environment using Jupyter notebooks. The programming code is R, not Python. For additional resources:
|
Start by going to "Runtime", select "Change Runtime Type", ensure that "R" is selected, and click "Save".
Next, run the following two boxes. The first checks that we can do basic calculations. The second is an "R" command (generate 10 standard normal random variates), to make sure that "R" is working. Then, try changing the numbers so you can convince yourself that this is an interactive environment.
1 + 4
rnorm(10)
You can make changes to a Colab notebook, and they will persist for as long as you keep your browser tab open. But once you close it, the changes will be lost. To avoid this, make sure you save a copy of the notebook to your Google Drive by selecting File → “Save a copy in Drive”.
As is the custom with Jupyter notebooks, we start by loading some libraries that are going to be helpful. In the following code, we also indicate where th libraries are needed with comments.
install.packages("Hmisc")
library(Hmisc)
install.packages("doBy")
library(doBy)
Installing package into ‘/usr/local/lib/R/site-library’ (as ‘lib’ is unspecified) also installing the dependencies ‘checkmate’, ‘gridExtra’, ‘htmlTable’, ‘viridis’, ‘colorspace’, ‘Formula’ Attaching package: ‘Hmisc’ The following objects are masked from ‘package:base’: format.pval, units Installing package into ‘/usr/local/lib/R/site-library’ (as ‘lib’ is unspecified) also installing the dependencies ‘cowplot’, ‘Deriv’, ‘microbenchmark’
This tutorial presents a study in the context of auto liability insurance focusing on claim counts.
The data set is well known in the actuarial literature. The data source is the book edited by Arthur Charpentier entitled Computational Actuarial Science with R, (2016), by CRC Press. In this tutorial, we utilize ideas from:
The dataset is fairly large, containing 678007 observations of 14 variables. So, this helps underscore the value of working on remote servers such as through Google colab.
As is the custom with Jupyter notebooks, we start by loading the libraries that are going to be helpful.
First, we import the data from Github course repo. At the repo "https://github.com/OpenActTextDev/ActuarialRegression", you will see other datasets.
dat_frq <- read.csv("https://raw.githubusercontent.com/OpenActTextDev/ActuarialRegression/refs/heads/main/CourseCSVData/freMTPL2freq.csv")
Just to make sure that data was read in properly, let's look at the structure, do a summary, and examine the first few observations.
str(dat_frq)
'data.frame': 678007 obs. of 14 variables: $ X : int 1 2 3 4 5 6 7 8 9 10 ... $ IDpol : num 1 3 5 10 11 13 15 17 18 21 ... $ Exposure : num 0.1 0.77 0.75 0.09 0.84 0.52 0.45 0.27 0.71 0.15 ... $ Area : chr "D" "D" "B" "B" ... $ VehPower : int 5 5 6 7 7 6 6 7 7 7 ... $ VehAge : int 0 0 2 0 0 2 2 0 0 0 ... $ DrivAge : int 55 55 52 46 46 38 38 33 33 41 ... $ BonusMalus: int 50 50 50 50 50 50 50 68 68 50 ... $ VehBrand : chr "B12" "B12" "B12" "B12" ... $ VehGas : chr "Regular" "Regular" "Diesel" "Diesel" ... $ Density : int 1217 1217 54 76 76 3003 3003 137 137 60 ... $ Region : chr "R82" "R82" "R22" "R72" ... $ ClaimTotal: num 0 0 0 0 0 0 0 0 0 0 ... $ ClaimNb : int 0 0 0 0 0 0 0 0 0 0 ...
Remove the 'X' variable (row label)
dat_frq$X <- NULL
summary(dat_frq)
IDpol Exposure Area VehPower Min. : 1 Min. :0.002732 Length:678007 Min. : 4.000 1st Qu.:1157948 1st Qu.:0.180000 Class :character 1st Qu.: 5.000 Median :2272153 Median :0.490000 Mode :character Median : 6.000 Mean :2621857 Mean :0.528547 Mean : 6.455 3rd Qu.:4046278 3rd Qu.:0.990000 3rd Qu.: 7.000 Max. :6114330 Max. :1.000000 Max. :15.000 VehAge DrivAge BonusMalus VehBrand Min. : 0.000 Min. : 18.0 Min. : 50.00 Length:678007 1st Qu.: 2.000 1st Qu.: 34.0 1st Qu.: 50.00 Class :character Median : 6.000 Median : 44.0 Median : 50.00 Mode :character Mean : 7.044 Mean : 45.5 Mean : 59.76 3rd Qu.: 11.000 3rd Qu.: 55.0 3rd Qu.: 64.00 Max. :100.000 Max. :100.0 Max. :230.00 VehGas Density Region ClaimTotal Length:678007 Min. : 1 Length:678007 Min. : 0.0 Class :character 1st Qu.: 92 Class :character 1st Qu.: 0.0 Mode :character Median : 393 Mode :character Median : 0.0 Mean : 1792 Mean : 88.2 3rd Qu.: 1658 3rd Qu.: 0.0 Max. :27000 Max. :4075400.6 ClaimNb Min. :0.00000 1st Qu.:0.00000 Median :0.00000 Mean :0.03891 3rd Qu.:0.00000 Max. :5.00000
head(dat_frq)
IDpol | Exposure | Area | VehPower | VehAge | DrivAge | BonusMalus | VehBrand | VehGas | Density | Region | ClaimTotal | ClaimNb | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<dbl> | <dbl> | <chr> | <int> | <int> | <int> | <int> | <chr> | <chr> | <int> | <chr> | <dbl> | <int> | |
1 | 1 | 0.10 | D | 5 | 0 | 55 | 50 | B12 | Regular | 1217 | R82 | 0 | 0 |
2 | 3 | 0.77 | D | 5 | 0 | 55 | 50 | B12 | Regular | 1217 | R82 | 0 | 0 |
3 | 5 | 0.75 | B | 6 | 2 | 52 | 50 | B12 | Diesel | 54 | R22 | 0 | 0 |
4 | 10 | 0.09 | B | 7 | 0 | 46 | 50 | B12 | Diesel | 76 | R72 | 0 | 0 |
5 | 11 | 0.84 | B | 7 | 0 | 46 | 50 | B12 | Diesel | 76 | R72 | 0 | 0 |
6 | 13 | 0.52 | E | 6 | 2 | 38 | 50 | B12 | Regular | 3003 | R31 | 0 | 0 |
Here are some brief descriptions of the variables:
IDpol
: policy number (unique identifier)ClaimNb
: number of claims on the given policyExposure
: total exposure in yearly unitsArea
: area code (categorical, ordinal)VehPower
: power of the car (categorical, ordinal)VehAge
: age of the car in yearsDrivAge
: age of the (most common) driver in yearsBonusMalus
: bonus-malus level between 50 and 230 (with reference level 100)VehBrand
: car brand (categorical, nominal)VehGas
: diesel or regular fuel car (binary)Density
: density of inhabitants per km^2^ in the city of the living place of the driverRegion
: regions in France (prior to 2016)Note that the variable 'Area' is defined as an ordinal variable. In their development of the case, Noll, Salman, Wütrich (2020) carefully describe how some areas are bigger than others, so that the variable can be considered as nominal. This is an interesting extension. Because of the overlap and the categorical variable 'Region', we will exclude the variable 'Area' from our analysis.
custom_summary <- function(x) {
c(
count = sum(!is.na(x)),
mean = mean(x, na.rm = TRUE),
sd = sd(x, na.rm = TRUE),
min = min(x, na.rm = TRUE),
`1%` = quantile(x, 0.01, na.rm = TRUE),
`50%` = quantile(x, 0.5, na.rm = TRUE),
`99.99%` = quantile(x, 0.9999, na.rm = TRUE),
max = max(x, na.rm = TRUE)
)
}
numeric_id <- sapply(dat_frq, is.numeric)
integer_id <- sapply(dat_frq, is.integer)
char_id <- sapply(dat_frq, is.character)
numeric_vars <- dat_frq[,numeric_id]
integer_vars <- dat_frq[,integer_id]
char_vars <- dat_frq[,char_id]
#str(numeric_vars)
#str(integer_vars)
#str(char_vars)
summary_df <- sapply(numeric_vars, custom_summary)
round(t(summary_df), digits = 2); # Transpose for easier viewing
'data.frame': 678007 obs. of 9 variables: $ IDpol : num 1 3 5 10 11 13 15 17 18 21 ... $ Exposure : num 0.1 0.77 0.75 0.09 0.84 0.52 0.45 0.27 0.71 0.15 ... $ VehPower : int 5 5 6 7 7 6 6 7 7 7 ... $ VehAge : int 0 0 2 0 0 2 2 0 0 0 ... $ DrivAge : int 55 55 52 46 46 38 38 33 33 41 ... $ BonusMalus: int 50 50 50 50 50 50 50 68 68 50 ... $ Density : int 1217 1217 54 76 76 3003 3003 137 137 60 ... $ ClaimTotal: num 0 0 0 0 0 0 0 0 0 0 ... $ ClaimNb : int 0 0 0 0 0 0 0 0 0 0 ...
count | mean | sd | min | 1%.1% | 50%.50% | 99.99%.99.99% | max | |
---|---|---|---|---|---|---|---|---|
IDpol | 678007 | 2621857.33 | 1641789.39 | 1 | 19934.12 | 2272153.00 | 6114262.20 | 6114330 |
Exposure | 678007 | 0.53 | 0.36 | 0 | 0.01 | 0.49 | 1.00 | 1 |
VehPower | 678007 | 6.45 | 2.05 | 4 | 4.00 | 6.00 | 15.00 | 15 |
VehAge | 678007 | 7.04 | 5.67 | 0 | 0.00 | 6.00 | 69.00 | 100 |
DrivAge | 678007 | 45.50 | 14.14 | 18 | 20.00 | 44.00 | 99.00 | 100 |
BonusMalus | 678007 | 59.76 | 15.64 | 50 | 50.00 | 50.00 | 158.00 | 230 |
Density | 678007 | 1792.43 | 3958.66 | 1 | 10.00 | 393.00 | 27000.00 | 27000 |
ClaimTotal | 678007 | 88.17 | 5822.07 | 0 | 0.00 | 0.00 | 65429.46 | 4075401 |
ClaimNb | 678007 | 0.04 | 0.20 | 0 | 0.00 | 0.00 | 3.00 | 5 |
Some extreme values of 'VehAge' and 'BonusMalus' are unrealistic. Let's censor them at the 99.99th percentile.
winsorize <- function(x, p = 0.01) {
qnt <- quantile(x, probs = c(p, 1 - p), na.rm = TRUE)
x[x < qnt[1]] <- qnt[1]
x[x > qnt[2]] <- qnt[2]
return(x)
}
dat_frqMod <- dat_frq
dat_frqMod$VehAge<- winsorize(dat_frqMod$VehAge, p = 0.0001) # like 0.01% and 99.99%
dat_frqMod$BonusMalus<- winsorize(dat_frqMod$BonusMalus, p = 0.0001)
numeric_vars <- sapply(dat_frqMod, is.numeric)
summary_df <- sapply(dat_frqMod[, numeric_vars], custom_summary)
round(t(summary_df), digits = 2); # Transpose for easier viewing
count | mean | sd | min | 1%.1% | 50%.50% | 99.99%.99.99% | max | |
---|---|---|---|---|---|---|---|---|
IDpol | 678007 | 2621857.33 | 1641789.39 | 1 | 19934.12 | 2272153.00 | 6114262.20 | 6114330 |
Exposure | 678007 | 0.53 | 0.36 | 0 | 0.01 | 0.49 | 1.00 | 1 |
VehPower | 678007 | 6.45 | 2.05 | 4 | 4.00 | 6.00 | 15.00 | 15 |
VehAge | 678007 | 7.04 | 5.63 | 0 | 0.00 | 6.00 | 69.00 | 69 |
DrivAge | 678007 | 45.50 | 14.14 | 18 | 20.00 | 44.00 | 99.00 | 100 |
BonusMalus | 678007 | 59.76 | 15.62 | 50 | 50.00 | 50.00 | 158.00 | 158 |
Density | 678007 | 1792.43 | 3958.66 | 1 | 10.00 | 393.00 | 27000.00 | 27000 |
ClaimTotal | 678007 | 88.17 | 5822.07 | 0 | 0.00 | 0.00 | 65429.46 | 4075401 |
ClaimNb | 678007 | 0.04 | 0.20 | 0 | 0.00 | 0.00 | 3.00 | 5 |
We now summarize the target variable 'ClaimsNb' in a bit more detail.
We can get some quick summaries with the 'table' function in 'R'.
table(dat_frqMod$ClaimNb)
0 1 2 3 4 5 653069 23571 1298 62 5 2
table(dat_frqMod$VehBrand)
table(dat_frqMod$VehGas)
table(dat_frqMod$Region)
table(dat_frqMod$Area)
table(dat_frqMod$VehPower)
B1 B10 B11 B12 B13 B14 B2 B3 B4 B5 B6 162730 17707 13585 166024 12178 4047 159861 53395 25179 34753 28548
Diesel Regular 332136 345871
R11 R21 R22 R23 R24 R25 R26 R31 R41 R42 R43 69791 3026 7994 8784 160601 10893 10492 27285 12990 2200 1326 R52 R53 R54 R72 R73 R74 R82 R83 R91 R93 R94 38751 42122 19046 31329 17141 4567 84752 5287 35799 79315 4516
A B C D E F 103957 75459 191880 151590 137167 17954
4 5 6 7 8 9 10 11 12 13 14 115343 124821 148976 145401 46956 30085 31354 18352 8214 3229 2350 15 2926
table(dat_frqMod$VehGas, dat_frqMod$ClaimNb)
0 1 2 3 4 5 Diesel 319436 11988 678 31 2 1 Regular 333633 11583 620 31 3 1
However, for more complex summarizes, we utilize a library "Hmisc". These needs to be first installed, which may take a couple of minutes in Google Colab.
#install.packages("Hmisc")
#library(Hmisc)
Hmisc::summarize(dat_frqMod$DrivAge, dat_frqMod$ClaimNb, mean)
Hmisc::summarize(dat_frqMod$VehAge, dat_frqMod$ClaimNb, mean)
dat_frqMod$ClaimNb | dat_frqMod$DrivAge | |
---|---|---|
<labelled> | <labelled[1d]> | |
1 | 0 | 45.51285 |
2 | 1 | 45.17216 |
3 | 2 | 44.60169 |
4 | 3 | 43.33871 |
5 | 4 | 40.20000 |
6 | 5 | 59.50000 |
dat_frqMod$ClaimNb | dat_frqMod$VehAge | |
---|---|---|
<labelled> | <labelled[1d]> | |
1 | 0 | 7.028766 |
2 | 1 | 7.426287 |
3 | 2 | 6.659476 |
4 | 3 | 6.306452 |
5 | 4 | 4.400000 |
6 | 5 | 10.500000 |