Abstract¶

The purpose of this notebook is to allow users to become familiar with model fitting in a Google colaboratory (colab) environment using Jupyter notebooks. The programming code is R, not Python. For additional resources:

Introduction to Google Colab, website
Introduction to Jupyter Notebooks.ipynb, Jupyter notebook available from the course Github repository (repo).

Getting Started¶

Start by going to "Runtime", select "Change Runtime Type", ensure that "R" is selected, and click "Save".

Next, run the following two boxes. The first checks that we can do basic calculations. The second is an "R" command (generate 10 standard normal random variates), to make sure that "R" is working. Then, try changing the numbers so you can convince yourself that this is an interactive environment.

In [ ]:

1 + 4

5

In [ ]:

rnorm(10)

0.898123924598654
-0.852576191795118
-0.189245668283116
-0.882185981402365
1.37011604595435
-1.80415538467972
0.107060010754886
-0.3054553211401
-0.342670024618139
0.480125313339421

You can make changes to a Colab notebook, and they will persist for as long as you keep your browser tab open. But once you close it, the changes will be lost. To avoid this, make sure you save a copy of the notebook to your Google Drive by selecting File → “Save a copy in Drive”.

Load Libraries¶

As is the custom with Jupyter notebooks, we start by loading some libraries that are going to be helpful. In the following code, we also indicate where th libraries are needed with comments.

In [17]:

install.packages("Hmisc")
library(Hmisc)
install.packages("doBy")
library(doBy)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘checkmate’, ‘gridExtra’, ‘htmlTable’, ‘viridis’, ‘colorspace’, ‘Formula’



Attaching package: ‘Hmisc’


The following objects are masked from ‘package:base’:

    format.pval, units


Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘cowplot’, ‘Deriv’, ‘microbenchmark’

Claim Counts¶

This tutorial presents a study in the context of auto liability insurance focusing on claim counts.

The data set is well known in the actuarial literature. The data source is the book edited by Arthur Charpentier entitled Computational Actuarial Science with R, (2016), by CRC Press. In this tutorial, we utilize ideas from:

Professor Dani Bauer's tutorials available at his Github site.
Dr. Patrick Laub's presentation at his course Artificial Intelligence and Deep Learning Models for Actuarial Applications
A case study by Noll, Salman, Wütrich (2020), Case Study: French Motor Third-Party Liability Claims, available at SSRN with R code.
- This case study was the basis for a tutorial by Professor Mario Wütrich.

The dataset is fairly large, containing 678007 observations of 14 variables. So, this helps underscore the value of working on remote servers such as through Google colab.

As is the custom with Jupyter notebooks, we start by loading the libraries that are going to be helpful.

Import and Describe the Data¶

First, we import the data from Github course repo. At the repo "https://github.com/OpenActTextDev/ActuarialRegression", you will see other datasets.

In [1]:

dat_frq  <- read.csv("https://raw.githubusercontent.com/OpenActTextDev/ActuarialRegression/refs/heads/main/CourseCSVData/freMTPL2freq.csv")

Just to make sure that data was read in properly, let's look at the structure, do a summary, and examine the first few observations.

In [35]:

str(dat_frq)

'data.frame':	678007 obs. of  14 variables:
 $ X         : int  1 2 3 4 5 6 7 8 9 10 ...
 $ IDpol     : num  1 3 5 10 11 13 15 17 18 21 ...
 $ Exposure  : num  0.1 0.77 0.75 0.09 0.84 0.52 0.45 0.27 0.71 0.15 ...
 $ Area      : chr  "D" "D" "B" "B" ...
 $ VehPower  : int  5 5 6 7 7 6 6 7 7 7 ...
 $ VehAge    : int  0 0 2 0 0 2 2 0 0 0 ...
 $ DrivAge   : int  55 55 52 46 46 38 38 33 33 41 ...
 $ BonusMalus: int  50 50 50 50 50 50 50 68 68 50 ...
 $ VehBrand  : chr  "B12" "B12" "B12" "B12" ...
 $ VehGas    : chr  "Regular" "Regular" "Diesel" "Diesel" ...
 $ Density   : int  1217 1217 54 76 76 3003 3003 137 137 60 ...
 $ Region    : chr  "R82" "R82" "R22" "R72" ...
 $ ClaimTotal: num  0 0 0 0 0 0 0 0 0 0 ...
 $ ClaimNb   : int  0 0 0 0 0 0 0 0 0 0 ...

Remove the 'X' variable (row label)

In [36]:

dat_frq$X <- NULL
summary(dat_frq)

     IDpol            Exposure            Area              VehPower     
 Min.   :      1   Min.   :0.002732   Length:678007      Min.   : 4.000  
 1st Qu.:1157948   1st Qu.:0.180000   Class :character   1st Qu.: 5.000  
 Median :2272153   Median :0.490000   Mode  :character   Median : 6.000  
 Mean   :2621857   Mean   :0.528547                      Mean   : 6.455  
 3rd Qu.:4046278   3rd Qu.:0.990000                      3rd Qu.: 7.000  
 Max.   :6114330   Max.   :1.000000                      Max.   :15.000  
     VehAge           DrivAge        BonusMalus       VehBrand        
 Min.   :  0.000   Min.   : 18.0   Min.   : 50.00   Length:678007     
 1st Qu.:  2.000   1st Qu.: 34.0   1st Qu.: 50.00   Class :character  
 Median :  6.000   Median : 44.0   Median : 50.00   Mode  :character  
 Mean   :  7.044   Mean   : 45.5   Mean   : 59.76                     
 3rd Qu.: 11.000   3rd Qu.: 55.0   3rd Qu.: 64.00                     
 Max.   :100.000   Max.   :100.0   Max.   :230.00                     
    VehGas             Density         Region            ClaimTotal       
 Length:678007      Min.   :    1   Length:678007      Min.   :      0.0  
 Class :character   1st Qu.:   92   Class :character   1st Qu.:      0.0  
 Mode  :character   Median :  393   Mode  :character   Median :      0.0  
                    Mean   : 1792                      Mean   :     88.2  
                    3rd Qu.: 1658                      3rd Qu.:      0.0  
                    Max.   :27000                      Max.   :4075400.6  
    ClaimNb       
 Min.   :0.00000  
 1st Qu.:0.00000  
 Median :0.00000  
 Mean   :0.03891  
 3rd Qu.:0.00000  
 Max.   :5.00000

In [37]:

head(dat_frq)

A data.frame: 6 × 13
	IDpol	Exposure	Area	VehPower	VehAge	DrivAge	BonusMalus	VehBrand	VehGas	Density	Region	ClaimTotal	ClaimNb
	<dbl>	<dbl>	<chr>	<int>	<int>	<int>	<int>	<chr>	<chr>	<int>	<chr>	<dbl>	<int>
1	1	0.10	D	5	0	55	50	B12	Regular	1217	R82	0	0
2	3	0.77	D	5	0	55	50	B12	Regular	1217	R82	0	0
3	5	0.75	B	6	2	52	50	B12	Diesel	54	R22	0	0
4	10	0.09	B	7	0	46	50	B12	Diesel	76	R72	0	0
5	11	0.84	B	7	0	46	50	B12	Diesel	76	R72	0	0
6	13	0.52	E	6	2	38	50	B12	Regular	3003	R31	0	0

Here are some brief descriptions of the variables:

IDpol: policy number (unique identifier)
ClaimNb: number of claims on the given policy
Exposure: total exposure in yearly units
Area: area code (categorical, ordinal)
VehPower: power of the car (categorical, ordinal)
VehAge: age of the car in years
DrivAge: age of the (most common) driver in years
BonusMalus: bonus-malus level between 50 and 230 (with reference level 100)
VehBrand: car brand (categorical, nominal)
VehGas: diesel or regular fuel car (binary)
Density: density of inhabitants per km^2^ in the city of the living place of the driver
Region: regions in France (prior to 2016)

Note that the variable 'Area' is defined as an ordinal variable. In their development of the case, Noll, Salman, Wütrich (2020) carefully describe how some areas are bigger than others, so that the variable can be considered as nominal. This is an interesting extension. Because of the overlap and the categorical variable 'Region', we will exclude the variable 'Area' from our analysis.

In [51]:

custom_summary <- function(x) {
  c(
    count = sum(!is.na(x)),
    mean = mean(x, na.rm = TRUE),
    sd = sd(x, na.rm = TRUE),
    min = min(x, na.rm = TRUE),
    `1%` = quantile(x, 0.01, na.rm = TRUE),
    `50%` = quantile(x, 0.5, na.rm = TRUE),
    `99.99%` = quantile(x, 0.9999, na.rm = TRUE),
    max = max(x, na.rm = TRUE)
  )
}

numeric_id <- sapply(dat_frq, is.numeric)
integer_id <- sapply(dat_frq, is.integer)
char_id <- sapply(dat_frq, is.character)
numeric_vars <- dat_frq[,numeric_id]
integer_vars <- dat_frq[,integer_id]
char_vars <- dat_frq[,char_id]
#str(numeric_vars)
#str(integer_vars)
#str(char_vars)
summary_df <- sapply(numeric_vars, custom_summary)
round(t(summary_df), digits = 2);  # Transpose for easier viewing

'data.frame':	678007 obs. of  9 variables:
 $ IDpol     : num  1 3 5 10 11 13 15 17 18 21 ...
 $ Exposure  : num  0.1 0.77 0.75 0.09 0.84 0.52 0.45 0.27 0.71 0.15 ...
 $ VehPower  : int  5 5 6 7 7 6 6 7 7 7 ...
 $ VehAge    : int  0 0 2 0 0 2 2 0 0 0 ...
 $ DrivAge   : int  55 55 52 46 46 38 38 33 33 41 ...
 $ BonusMalus: int  50 50 50 50 50 50 50 68 68 50 ...
 $ Density   : int  1217 1217 54 76 76 3003 3003 137 137 60 ...
 $ ClaimTotal: num  0 0 0 0 0 0 0 0 0 0 ...
 $ ClaimNb   : int  0 0 0 0 0 0 0 0 0 0 ...

A matrix: 9 × 8 of type dbl
	count	mean	sd	min	1%.1%	50%.50%	99.99%.99.99%	max
IDpol	678007	2621857.33	1641789.39	1	19934.12	2272153.00	6114262.20	6114330
Exposure	678007	0.53	0.36	0	0.01	0.49	1.00	1
VehPower	678007	6.45	2.05	4	4.00	6.00	15.00	15
VehAge	678007	7.04	5.67	0	0.00	6.00	69.00	100
DrivAge	678007	45.50	14.14	18	20.00	44.00	99.00	100
BonusMalus	678007	59.76	15.64	50	50.00	50.00	158.00	230
Density	678007	1792.43	3958.66	1	10.00	393.00	27000.00	27000
ClaimTotal	678007	88.17	5822.07	0	0.00	0.00	65429.46	4075401
ClaimNb	678007	0.04	0.20	0	0.00	0.00	3.00	5

Some extreme values of 'VehAge' and 'BonusMalus' are unrealistic. Let's censor them at the 99.99th percentile.

In [39]:

winsorize <- function(x, p = 0.01) {
  qnt <- quantile(x, probs = c(p, 1 - p), na.rm = TRUE)
  x[x < qnt[1]] <- qnt[1]
  x[x > qnt[2]] <- qnt[2]
  return(x)
  }
dat_frqMod <- dat_frq
dat_frqMod$VehAge<- winsorize(dat_frqMod$VehAge, p = 0.0001)  # like 0.01% and 99.99%
dat_frqMod$BonusMalus<- winsorize(dat_frqMod$BonusMalus, p = 0.0001)
numeric_vars <- sapply(dat_frqMod, is.numeric)
summary_df <- sapply(dat_frqMod[, numeric_vars], custom_summary)
round(t(summary_df), digits = 2);  # Transpose for easier viewing

A matrix: 9 × 8 of type dbl
	count	mean	sd	min	1%.1%	50%.50%	99.99%.99.99%	max
IDpol	678007	2621857.33	1641789.39	1	19934.12	2272153.00	6114262.20	6114330
Exposure	678007	0.53	0.36	0	0.01	0.49	1.00	1
VehPower	678007	6.45	2.05	4	4.00	6.00	15.00	15
VehAge	678007	7.04	5.63	0	0.00	6.00	69.00	69
DrivAge	678007	45.50	14.14	18	20.00	44.00	99.00	100
BonusMalus	678007	59.76	15.62	50	50.00	50.00	158.00	158
Density	678007	1792.43	3958.66	1	10.00	393.00	27000.00	27000
ClaimTotal	678007	88.17	5822.07	0	0.00	0.00	65429.46	4075401
ClaimNb	678007	0.04	0.20	0	0.00	0.00	3.00	5

We now summarize the target variable 'ClaimsNb' in a bit more detail.

We can get some quick summaries with the 'table' function in 'R'.

In [40]:

table(dat_frqMod$ClaimNb)

     0      1      2      3      4      5 
653069  23571   1298     62      5      2

In [41]:

table(dat_frqMod$VehBrand)
table(dat_frqMod$VehGas)
table(dat_frqMod$Region)
table(dat_frqMod$Area)
table(dat_frqMod$VehPower)

    B1    B10    B11    B12    B13    B14     B2     B3     B4     B5     B6 
162730  17707  13585 166024  12178   4047 159861  53395  25179  34753  28548

 Diesel Regular 
 332136  345871

   R11    R21    R22    R23    R24    R25    R26    R31    R41    R42    R43 
 69791   3026   7994   8784 160601  10893  10492  27285  12990   2200   1326 
   R52    R53    R54    R72    R73    R74    R82    R83    R91    R93    R94 
 38751  42122  19046  31329  17141   4567  84752   5287  35799  79315   4516

     A      B      C      D      E      F 
103957  75459 191880 151590 137167  17954

     4      5      6      7      8      9     10     11     12     13     14 
115343 124821 148976 145401  46956  30085  31354  18352   8214   3229   2350 
    15 
  2926

In [42]:

table(dat_frqMod$VehGas, dat_frqMod$ClaimNb)

         
               0      1      2      3      4      5
  Diesel  319436  11988    678     31      2      1
  Regular 333633  11583    620     31      3      1

However, for more complex summarizes, we utilize a library "Hmisc". These needs to be first installed, which may take a couple of minutes in Google Colab.

In [43]:

#install.packages("Hmisc")
#library(Hmisc)
Hmisc::summarize(dat_frqMod$DrivAge, dat_frqMod$ClaimNb, mean)
Hmisc::summarize(dat_frqMod$VehAge, dat_frqMod$ClaimNb, mean)

A data.frame: 6 × 2
	dat_frqMod$ClaimNb	dat_frqMod$DrivAge
	<labelled>	<labelled[1d]>
1	0	45.51285
2	1	45.17216
3	2	44.60169
4	3	43.33871
5	4	40.20000
6	5	59.50000

A data.frame: 6 × 2
	dat_frqMod$ClaimNb	dat_frqMod$VehAge
	<labelled>	<labelled[1d]>
1	0	7.028766
2	1	7.426287
3	2	6.659476
4	3	6.306452
5	4	4.400000
6	5	10.500000