HVT Model diagnostics and validation

Zubin Dowlaty, Shubhra Prakash, Sunuganty Achyut Raj, Praditi Shah, Somya Shambhawi,Vishwavani

Built on: 10/18/2024

1. Abstract

The HVT package is a collection of R functions to facilitate building topology preserving maps for rich multivariate data. Tending towards a big data preponderance, a large number of rows. A collection of R functions for this typical workflow is organized below :

  1. Data Compression: Vector quantization (VQ), HVQ (hierarchical vector quantization) using means or medians. This step compresses the rows (long data frame) using a compression objective

  2. Data Projection: Dimension projection of the compressed cells to 1D,2D and Interactive surface plot with the Sammons Nonlinear Algorithm. This step creates topology preserving map(also called as embedding coordinates into the desired output dimension.

  3. Tessellation: Create cells required for object visualization using the Voronoi Tessellation method, package includes heatmap plots for hierarchical Voronoi tessellations (HVT). This step enables data insights, visualization, and interaction with the topology preserving map. Useful for semi-supervised tasks

  4. Scoring: Scoring new data sets and recording their assignment using the map objects from the above steps, in a sequence of maps if required

2. Data Compression

This package can perform vector quantization using the following algorithms -

2.1 Using k-means

  1. The k-means algorithm randomly selects k data points as initial means
  2. k clusters are formed by assigning each data point to its closest cluster mean using the Euclidean distance
  3. Virtual means for each cluster are calculated by using all datapoints contained in a cluster

The second and third steps are iterated until a predefined number of iterations is reached or the clusters converge. The runtime for the algorithm is O(n).

2.2 Using k-medoids

  1. The k-medoids algorithm randomly selects k data points as initial means out of the n data points as the medoids.
  2. k clusters are formed by assigning each data point to its closest medoid by using any common distance metric methods.
  3. Virtual means for each cluster are calculated by using all datapoints contained in a cluster

The second and third steps are iterated until a predefined number of iterations is reached or the clusters converge. The runtime for the algorithm is O(k * (n-k)^2).

2.3 Hierarchical Vector Quantization

The algorithm divides the dataset recursively into cells using \(k-means\) or \(k-medoids\) algorithm. The maximum number of subsets are decided by setting \(n-cells\) to, say five, in order to divide the dataset into maximum of five subsets. These five subsets are further divided into five subsets(or less), resulting in a total of twenty five (5*5) subsets. The recursion terminates when the cells either contain less than three data point or a stop criterion is reached. In this case, the stop criterion is set to when the cell error exceeds the quantization threshold.

The steps for this method are as follows :

  1. Select k(number of cells), depth and quantization error threshold
  2. Perform quantization (using \(k-means\) or \(k-medoids\)) on the input dataset
  3. Calculate quantization error for each of the k cells
  4. Compare the quantization error for each cell to quantization error threshold
  5. Repeat steps 2 to 4 for each of the k cells whose quantization error is above threshold until stop criterion is reached.

The stop criterion is when the quantization error of a cell satisfies one of the below conditions

2.3.1 Quantization Error

Let us try to understand quantization error with an example.

Figure 1: The Voronoi tessellation for level 1 shown for the 5 cells with the points overlayed

Figure 1: The Voronoi tessellation for level 1 shown for the 5 cells with the points overlayed

An example of a 2 dimensional VQ is shown above.

In the above image, we can see 5 cells with each cell containing a certain number of points. The centroid for each cell is shown in blue. These centroids are also known as codewords since they represent all the points in that cell. The set of all codewords is called a codebook.

Now we want to calculate quantization error for each cell. For the sake of simplicity, let’s consider only one cell having centroid A and m data points \(F_i\) for calculating quantization error.

For each point, we calculate the distance between the point and the centroid.

\[ d = ||A - F_i||_{p} \]

In the above equation, p = 1 means L1_Norm distance whereas p = 2 means L2_Norm distance. In the package, the L1_Norm distance is chosen by default. The user can pass either L1_Norm, L2_Norm or a custom function to calculate the distance between two points in n dimensions.

\[QE = \max(||A-F_i||_{p})\]

where

  • \(A\) is the centroid of the cell
  • \(F_i\) represents a data point in the cell
  • \(p\) is the \(p\)-norm metric. Here \(p\) = 1 represents L1 Norm and \(p\) = 2 represents L2 Norm.

Now, we take the maximum calculated distance of all m points. This gives us the furthest distance of a point in the cell from the centroid, which we refer to as Quantization Error. If the Quantization Error is higher than the given threshold, the centroid/codevector is not a good representation for the points in the cell. Now we can perform further Vector Quantization on these points and repeat the above steps.

Please note that the user can select mean or max to calculate the Quantization Error. The custom function takes a vector of m value (where each value is a distance between point in n dimensions and centroids) and returns a single value which is the Quantization Error for the cell.

If we select mean as the error metric, the above Quantization Error equation will look like this :

\[QE = \frac{1}{m}\sum_{i=1}^m||A-F_i||_{p}\]

where

  • \(A\) is the centroid of the cell
  • \(F_i\) represents a data point in the cell
  • \(m\) is the number of points in the cell
  • \(p\) is the \(p\)-norm metric. Here \(p\) = 1 represents L1 Norm and \(p\) = 2 represents L2 Norm.

3. Data Projection

Sammon’s projection is an algorithm that maps a high-dimensional space to a space of lower dimensionality while attempting to preserve the structure of inter-point distances in the projection. It is particularly suited for use in exploratory data analysis and is usually considered a non-linear approach since the mapping cannot be represented as a linear combination of the original variables. The centroids are plotted in 2D after performing Sammon’s projection at every level of the tessellation.

Denoting the distance between \(i^{th}\) and \(j^{th}\) objects in the original space by \(d_{ij}^*\), and the distance between their projections by \(d_{ij}\). Sammon’s mapping aims to minimize the below error function, which is often referred to as Sammon’s stress or Sammon’s error

\[E=\frac{1}{\sum_{i<j} d_{ij}^*}\sum_{i<j}\frac{(d_{ij}^*-d_{ij})^2}{d_{ij}^*}\]

The minimization of this can be performed either by gradient descent, as proposed initially, or by other means, usually involving iterative methods. The number of iterations need to be experimentally determined and convergent solutions are not always guaranteed. Many implementations prefer to use the first Principal Components as a starting configuration.

3.1 Tessellations

A Voronoi diagram is a way of dividing space into a number of regions. A set of points (called seeds, sites, or generators) is specified beforehand and for each seed, there will be a corresponding region consisting of all points within proximity of that seed. These regions are called Voronoi cells. It is complementary to Delaunay triangulation.

Tessellate: Constructing Voronoi Tesselations

In this package, we use sammons from the package MASS to project higher dimensional data to a 2D space. The function hvq called from the trainHVT function returns hierarchical quantized data which will be the input for construction of the tessellations. The data is then represented in 2D coordinates and the tessellations are plotted using these coordinates as centroids. We use the package deldir for this purpose. The deldir package computes the Delaunay triangulation (and hence the Dirichlet or Voronoi tessellation) of a planar point set according to the second (iterative) algorithm of Lee and Schacter. For subsequent levels, transformation is performed on the 2D coordinates to get all the points within its parent tile. Tessellations are plotted using these transformed points as centroids. The lines in the tessellations are chopped in places so that they do not protrude outside the parent polygon. This is done for all the subsequent levels.

4. Notebook Requirements

This chunk verifies the installation of all the necessary packages to successfully run this vignette, if not, installs them and attach all the packages in the session environment.

list.of.packages <- c("dplyr","HVT", "kableExtra", "geozoo", "plotly", "purrr", "DT")

new.packages <-list.of.packages[!(list.of.packages %in% installed.packages()[, "Package"])]
if (length(new.packages))
  install.packages(new.packages, dependencies = TRUE, verbose = FALSE, repos='https://cloud.r-project.org/')

# Loading the required libraries
invisible(lapply(list.of.packages, library, character.only = TRUE))

5. Data Importing

5.1 Import Dataset from Local

The user can provide an absolute or relative path in the cell below to access the data from his/her computer. User can set import_data_from_local variable to TRUE to upload dataset from local.
Note: For this notebook import_data_from_local has been set to FALSE as we are simulating a dataset in next section.

import_data_from_local = FALSE # expects logical input

  file_name <- " " #enter the name of the local file
  file_path <- " " #enter the path of the local file

if(import_data_from_local){
  file_load <- paste0(file_path, file_name)
  dataset_updated <- as.data.frame(fread(file_load))
  if(nrow(dataset_updated) > 0){
    paste0("File ", file_name, " having ", nrow(dataset_updated), " row(s) and ", ncol(dataset_updated), " column(s)",  " imported successfully. ") %>% cat("\n")
    dataset_updated <- dataset_updated %>% mutate_if(is.numeric, round, digits = 4)
    paste0("Code chunk executed successfully. Below table showing first 10 row(s) of the dataset.") %>% cat("\n")
    dataset_updated %>% head(10) %>%as.data.frame() %>%DT::datatable(options = options, rownames = TRUE)
  }
  
} 

5.2 Simulate Dataset

In this section, we will use a simulated dataset. If you are not using this option set simulate_dataset to FALSE. Given below is a simulated dataset called torus that contains 12000 observations and 3 features.

Let us see how to generate data for torus. We are using a library geozoo for this purpose. Geo Zoo (stands for Geometric Zoo) is a compilation of geometric objects ranging from 3 to 10 dimensions. Geo Zoo contains regular or well-known objects, eg cube and sphere, and some abstract objects, e.g. Boy’s surface, Torus and Hyper-Torus.

Here, we load the data and store into a variable dataset_updated.

simulate_dataset= TRUE

if(simulate_dataset == TRUE){
  
set.seed(257)
##torus data generation
torus <- geozoo::torus(p = 3,n = 12000)
dataset_updated <- data.frame(torus$points)
colnames(dataset_updated) <- c("x","y","z")


  if(nrow(dataset_updated) > 0){
    paste0( "Dataset having ", nrow(dataset_updated), " row(s) and ", ncol(dataset_updated), " column(s)",  "simulated successfully. ") %>% cat("\n")  
    dataset_updated <- dataset_updated %>% mutate_if(is.numeric, round, digits = 4) 
    paste0("Code chunk executed successfully. The table below is showing first 10 row(s) of the dataset.") %>% cat("\n")
    dataset_updated %>% head(10) %>%as.data.frame() %>%displayTable()
  }
}
## Dataset having 12000 row(s) and 3 column(s)simulated successfully.  
## Code chunk executed successfully. The table below is showing first 10 row(s) of the dataset.
x y z
-1.0020 -2.3335 -0.8420
1.1021 -2.7447 -0.2878
-1.0033 1.2656 -0.9229
1.3204 0.6205 -0.8410
1.2998 -1.2470 -0.9801
-1.9606 2.0755 -0.5184
0.2807 -0.9724 0.1548
-1.5540 2.0564 -0.8164
-2.4653 1.6586 -0.2377
-2.3234 1.6933 -0.4841

6. Data Understanding

6.1 Quick Peek of the Data

Structure of Data

In the below section we can see the structure of the data.

dataset_updated %>% str()
## 'data.frame':    12000 obs. of  3 variables:
##  $ x: num  -1 1.1 -1 1.32 1.3 ...
##  $ y: num  -2.333 -2.745 1.266 0.621 -1.247 ...
##  $ z: num  -0.842 -0.288 -0.923 -0.841 -0.98 ...

6.2 Deleting Irrelevant Columns

The cell below will allow user to drop irrelevant column.

########################################################################################
################################## User Input Needed ###################################
########################################################################################

# Add column names which you want to remove
want_to_delete_column <- "no"

    del_col<-c(" `column_name` ")  

if(want_to_delete_column == "yes"){
   dataset_updated <-  dataset_updated[ , !(names(dataset_updated) %in% del_col)]
  print("Code chunk executed successfully. Overview of data types after removed selected columns")
  str( dataset_updated)
}else{
  paste0("No Columns removed. Please enter column name if you want to remove that column") %>% cat("\n")
}
## No Columns removed. Please enter column name if you want to remove that column

6.3 Formatting and Renaming Columns

The code below contains a user defined function to rename or reformat any column that the user chooses.

########################################################################################
################################## User Input Needed ###################################
########################################################################################

# convert the column names to lower case
colnames( dataset_updated) <- colnames( dataset_updated) %>% casefold()

## rename column ?
want_to_rename_column <- "no" ## type "yes" if you want to rename a column

## renaming a column of a dataset 
rename_col_name <- " 'column_name` " ## use small letters
rename_col_name_to <- " `new_name` "

if(want_to_rename_column == "yes"){
  names( dataset_updated)[names( dataset_updated) == rename_col_name] <- rename_col_name_to
}

# remove space, comma, dot from column names
spaceless <- function(x) {colnames(x) <- gsub(pattern = "[^[:alnum:]]+",
                             replacement = ".",
                             names(x));x}
 dataset_updated <- spaceless( dataset_updated)

## below is the dataset summary
paste0("Successfully converted the column names to lower case and check the renamed column name if you changed") %>% cat("\n")
## Successfully converted the column names to lower case and check the renamed column name if you changed
str( dataset_updated) ## showing summary for updated 
## 'data.frame':    12000 obs. of  3 variables:
##  $ x: num  -1 1.1 -1 1.32 1.3 ...
##  $ y: num  -2.333 -2.745 1.266 0.621 -1.247 ...
##  $ z: num  -0.842 -0.288 -0.923 -0.841 -0.98 ...

6.4 Changing Data Type of Columns

The section allows the user to change the data type of columns of his/her choice.

########################################################################################
################################## User Input Needed ###################################
########################################################################################

# If you want to change column type, change a below variable value to "yes"
want_to_change_column_type <- "no"

# you can change column type into numeric or character only
change_column_to_type <- "character" ## numeric

if(want_to_change_column_type == "yes" && change_column_to_type == "character"){
########################################################################################
################################## User Input Needed ###################################
########################################################################################
  select_columns <- c("panel_var") ###### Add column names you want to change here #####
   dataset_updated[select_columns]<- sapply( dataset_updated[select_columns],as.character)
  paste0("Code chunk executed successfully. Datatype of selected column(s) have been changed into numerical.")
  #str( dataset_updated)
}else if(want_to_change_column_type == "yes" && change_column_to_type == "numeric"){
  select_columns <- c('gearbox_oil_temperature')
   dataset_updated[select_columns]<- sapply( dataset_updated[select_columns],as.numeric)
  paste0("Code chunk executed successfully. Datatype of selected column(s) have been changed into categorical.")
  #str( dataset_updated)
}else{
  paste0("Datatype of columns have not been changed.") %>% cat("\n")
}
## Datatype of columns have not been changed.
dataset_updated <- do.call(data.frame, dataset_updated)
str( dataset_updated)
## 'data.frame':    12000 obs. of  3 variables:
##  $ x: num  -1 1.1 -1 1.32 1.3 ...
##  $ y: num  -2.333 -2.745 1.266 0.621 -1.247 ...
##  $ z: num  -0.842 -0.288 -0.923 -0.841 -0.98 ...

6.5 Checking and Removing Duplicates

Presence of duplicate observations can be misleading, this sections helps get rid of such rows in the dataset.

want_to_remove_duplicates <- "yes"  ## type "no" for choosing to not remove duplicates

## removing duplicate observation if present in the dataset
if(want_to_remove_duplicates == "yes"){
  
   dataset_updated <-  dataset_updated %>% unique()
  paste0("Code chunk executed successfully, duplicates if present successfully removed. Updated dataset has ", nrow( dataset_updated), " row(s) and ", ncol( dataset_updated), " column(s)") %>% print()
  cat("\n")
  str( dataset_updated) ## showing summary for updated dataset
} else{
  paste0("Code chunk executed successfully, NO duplicates were removed") %>% print()
}
## [1] "Code chunk executed successfully, duplicates if present successfully removed. Updated dataset has 12000 row(s) and 3 column(s)"
## 
## 'data.frame':    12000 obs. of  3 variables:
##  $ x: num  -1 1.1 -1 1.32 1.3 ...
##  $ y: num  -2.333 -2.745 1.266 0.621 -1.247 ...
##  $ z: num  -0.842 -0.288 -0.923 -0.841 -0.98 ...

6.6 List of Numerical and Categorical Column Names

# Return the column type 
CheckColumnType <- function(dataVector) {
  #Check if the column type is "numeric" or "character" & decide type accordingly
  if (class(dataVector) == "integer" || class(dataVector) == "numeric") {
    columnType <- "numeric"
  } else { columnType <- "character" }
  #Return the result
  return(columnType)
}
### Loading the list of numeric columns in variable
numeric_cols <<- colnames( dataset_updated)[unlist(sapply( dataset_updated, 
                                                       FUN = function(x){ CheckColumnType(x) == "numeric"}))]

### Loading the list of categorical columns in variable
cat_cols <- colnames( dataset_updated)[unlist(sapply( dataset_updated, 
                                                   FUN = function(x){ 
                                                     CheckColumnType(x) == "character"|| CheckColumnType(x) == "factor"}))]

### Removing Date Column from the list of categorical column
paste0("Code chunk executed successfully, list of numeric and categorical variables created.") %>% cat()
## Code chunk executed successfully, list of numeric and categorical variables created.
paste0("Numerical Column(s): \n Count : ", length(numeric_cols), "\n") %>% cat()
## Numerical Column(s): 
##  Count : 3
paste0(numeric_cols) %>% print()
## [1] "x" "y" "z"
paste0("Categorical Column(s): \n Count : ", length(cat_cols), "\n") %>% cat()
## Categorical Column(s): 
##  Count : 0
paste0(cat_cols) %>% print()
## character(0)

6.7 Filtering Dataset for Analysis

In this section, the dataset can be filtered for required row(s) for further analysis.

want_to_filter_dataset <- "no" ## type "yes" in case you want to filter
filter_col <- " "  ## Enter Column name to filter
filter_val <- " "  ## Enter Value to exclude for the column selected

if(want_to_filter_dataset == "yes"){
   dataset_updated <- filter_at( dataset_updated
                              , vars(contains(filter_col))
                              , all_vars(. != filter_val))
  
  paste0("Code chunk executed successfully, dataset filtered successfully on required columns. Updated dataset has ", nrow( dataset_updated), " row(s) and ", ncol( dataset_updated), " column(s)") %>% print()
  cat("\n")
  str( dataset_updated) ## showing summary for updated dataset
  
} else{
  paste0("Code chunk executed successfully, entire dataset is available for analysis.") %>% print()
}
## [1] "Code chunk executed successfully, entire dataset is available for analysis."

6.8 Missing Value Analysis

Missing values in the training data can lead to a biased model because we have not analyzed the behavior and relationship of those values with other variables correctly. It can lead to a wrong calculation or classification. Missing values can be of 3 types:

Missing Value on Entire dataset

na_total <- sum(is.na( dataset_updated))/prod(dim( dataset_updated))
if(na_total == 0){
  paste0("In the uploaded dataset, there is no missing value") %>% cat("\n")
}else{
  na_percentage <- paste0(sprintf(na_total*100, fmt = '%#.2f'),"%")
  paste0("Percentage of missing value in entire dataset is ",na_percentage) %>% cat("\n")
}
## In the uploaded dataset, there is no missing value

Missing Value on Column-level

The following code is to visualize the missing values (if any) using bar chart.

gg_miss_upset function are using to visualize the patterns of missingness, or rather the combinations of missingness across cases.

This function gives us(if any missing value present):

# Below code gives you missing value in each column
paste0("Number of missing value in each column") %>% cat("\n")
## Number of missing value in each column
print(sapply( dataset_updated, function(x) sum(is.na(x))))
## x y z 
## 0 0 0
missing_col_names <- names(which(sapply( dataset_updated, anyNA)))

total_na <- sum(is.na( dataset_updated))
# visualize the missing values (if any) using bar chart
if(total_na > 0 && length(missing_col_names) > 1){
  paste0("Code chunk executed successfully. Visualizing the missing values using bar chart") %>% cat("\n")
  gg_miss_upset( dataset_updated,
  nsets = 10,
  nintersects = NA)
}else if(total_na > 0){
   dataset_updated %>%
  DataExplorer::plot_missing() 
}else{
  paste("Code chunk executed successfully. No missing value exist.") %>% cat("\n")
}
## Code chunk executed successfully. No missing value exist.

Missing Value Treatment

In this section user can make decisions, how to tackle missing values in dataset. Both column(s) and row(s) can be removed in the following dataset based on the user choose to do so.

Drop Column(s) with Missing Values

The below code accepts user input and deletes the specified column.

########################################################################################
################################## User Input Needed ###################################
########################################################################################

# OR do you want to drop column specific column
drop_cloumn_name_na <- "yes" ## type "yes" to drop column(s)
# write column name that you want to drop
drop_column_name <- c(" ") #enter column name
if(drop_cloumn_name_na == "yes"){
  names_df=names( dataset_updated) %in% drop_column_name
  dataset_updated <-  dataset_updated[ , which(!names( dataset_updated) %in% drop_column_name)]
  paste0("Code chunk executed, selected column(s) dropped successfully.") %>% print()
  cat("\n")
  str( dataset_updated)
} else {
  paste0("Code chunk executed, missing value not removed (if any).") %>% cat("\n")
  cat("\n")
}
## [1] "Code chunk executed, selected column(s) dropped successfully."
## 
## 'data.frame':    12000 obs. of  3 variables:
##  $ x: num  -1 1.1 -1 1.32 1.3 ...
##  $ y: num  -2.333 -2.745 1.266 0.621 -1.247 ...
##  $ z: num  -0.842 -0.288 -0.923 -0.841 -0.98 ...

Drop Row(s) with Missing Values

The below code accepts user input and deletes rows.

# Do you want to drop row(s) containing "NA"
drop_row <- "no" ## type "yes" to delete missing value observations
if(drop_row == "yes"){
   dataset_updated <-  dataset_updated %>% na.omit()
   paste0("Code chunk executed, missing values successfully identified and removed. Updated dataset has ", nrow( dataset_updated), " row(s) and ", ncol( dataset_updated), " column(s)") %>% print()
  cat("\n")
} else{
  paste0("Code chunk executed, missing value(s) not removed (if any).") %>% cat("\n")
  cat("\n")
}
## Code chunk executed, missing value(s) not removed (if any).

6.8.1 One-Hot Encoding

This technique bins all categorical values as either 1 or 0. It is used for categorical variables with 2 classes. This is done because classification models can only handle features that have numeric values.

Given below is the length of unique values in each categorical column

cat_cols <-
  colnames(dataset_updated)[unlist(sapply(
    dataset_updated,
    FUN = function(x) {
      CheckColumnType(x) == "character" ||
        CheckColumnType(x) == "factor"
    }
  ))]

apply(dataset_updated[cat_cols], 2, function(x) {
  length(unique(x))
})
## integer(0)

Selecting categorical columns with smaller unique values for dummification

########################################################################################
################################## User Input Needed ###################################
########################################################################################
# Do you want to dummify the categorical variables?

dummify_cat <- FALSE ## TRUE,FALSE

# Select the columns on which dummification is to be performed
dum_cols <- c(" "," ") #enter column name in smalls
## [1] "One-Hot Encoding was not performed on dataset."

6.8.2 Check for Singularity

# Check data for singularity
singular_cols <- sapply(dataset_updated,function(x) length(unique(x))) %>%  # convert to dataframe
  data.frame(Unique_n = .) %>% dplyr::filter(Unique_n == 1) %>% 
  rownames() %>% data.frame(Constant_Variables = .)

if(nrow(singular_cols) != 0) {                              
  singular_cols  %>% DT::datatable()
} else {
  paste("There are no singular columns in the dataset") %>% htmltools::HTML()
}
There are no singular columns in the dataset
# Display variance of columns
data <- dataset_updated %>% dplyr::summarise_if(is.numeric, var) %>% t() %>% 
  data.frame() %>% round(3) #%>% DT::datatable(colnames = "Variance")

colnames(data) <- c("Variance")
displayTable(data)
Variance
x 2.239
y 2.214
z 0.499

6.8.3 Selecting only Numeric Cols after Dummification

numeric_cols=as.vector(sapply(dataset_updated, is.numeric))
dataset_updated=dataset_updated[,numeric_cols]
colnames(dataset_updated)
## [1] "x" "y" "z"

6.9 Final Dataset Summary

All further operations will be performed on the following dataset.

nums <- colnames(dataset_updated)[unlist(lapply(dataset_updated, is.numeric))]
cat(paste0("Final data frame contains ", nrow( dataset_updated), " row(s) and ", ncol( dataset_updated), " column(s).","Code chunk executed. Below table showing first 10 row(s) of the dataset."))
## Final data frame contains 12000 row(s) and 3 column(s).Code chunk executed. Below table showing first 10 row(s) of the dataset.
dataset_updated <-  dataset_updated %>% mutate_if(is.numeric, round, digits = 4)

displayTable(dataset_updated[1:10,])
x y z
-1.0020 -2.3335 -0.8420
1.1021 -2.7447 -0.2878
-1.0033 1.2656 -0.9229
1.3204 0.6205 -0.8410
1.2998 -1.2470 -0.9801
-1.9606 2.0755 -0.5184
0.2807 -0.9724 0.1548
-1.5540 2.0564 -0.8164
-2.4653 1.6586 -0.2377
-2.3234 1.6933 -0.4841

7. Data distribution

This section displays four objects.

Variable Histograms: The histogram distribution of all the features in the dataset.

Box Plots: Box plots for all the features in the dataset. These plots will display the median and Interquartile range of each column at a panel level.

Correlation Matrix: This calculates the Pearson correlation which is a bivariate correlation value measuring the linear correlation between two numeric columns. The output plot is shown as a matrix.

Summary EDA: The table provides descriptive statistics for all the features in the dataset.

It uses an inbuilt function called edaPlots to display the above-mentioned four objects.

Summary Table

edaPlots(dataset_updated, output_type = "summary", n_cols = 3)

Histograms

edaPlots(dataset_updated, output_type = "histogram", n_cols = 3)

Box Plots

edaPlots(dataset_updated, output_type = "boxplot", n_cols = 3)

Correlation Plot

edaPlots(dataset_updated, output_type = "correlation", n_cols = 3)

7.1 Train - Test Split

Let us split the data into train and test. We will randomly select 80% of the data as train and remaining as test.

## 80% of the sample size
smp_size <- floor(0.80 * nrow(dataset_updated))

## set the seed to make your partition reproducible
set.seed(279)
train_ind <- sample(seq_len(nrow(dataset_updated)), size = smp_size)

dataset_updated_train <- dataset_updated[train_ind, ]
dataset_updated_test <- dataset_updated[-train_ind, ]

The train data contains 9600 rows and 3 columns. The test data contains 2400 rows and 3 columns.

7.1.1 Train Distribution

Summary Table

edaPlots(dataset_updated_train, output_type = "summary", n_cols = 3)

Histograms

edaPlots(dataset_updated_train, output_type = "histogram", n_cols = 3)

Box Plots

edaPlots(dataset_updated_train, output_type = "boxplot", n_cols = 3)

Correlation Plot

edaPlots(dataset_updated_train, output_type = "correlation", n_cols = 3)

7.1.2 Test Distribution

Summary Table

edaPlots(dataset_updated_test, output_type = "summary", n_cols = 3)

Histograms

edaPlots(dataset_updated_test, output_type = "histogram", n_cols = 3)

Box Plots

edaPlots(dataset_updated_test, output_type = "boxplot", n_cols = 3)

Correlation Plot

edaPlots(dataset_updated_test, output_type = "correlation", n_cols = 3)

8. Model Training

In HVT, we use sammons from the package MASS to project higher dimensional data to a 2D space. The function hvq called from the trainHVT function returns hierarchical quantized data which will be the input for construction of the tessellations. The data is then represented in 2D coordinates and the tessellations are plotted using these coordinates as centroids. We use the package deldir for this purpose. The deldir package computes the Delaunay triangulation (and hence the Dirichlet or Voronoi tessellation) of a planar point set according to the second (iterative) algorithm of Lee and Schacter. For subsequent levels, transformation is performed on the 2D coordinates to get all the points within its parent tile. Tessellations are plotted using these transformed points as centroids. The lines in the tessellations are chopped in places so that they do not protrude outside the parent polygon. This is done for all the subsequent levels.

Let us try to understand the trainHVT function first.

trainHVT(
  dataset,
  min_compression_perc,
  n_cells,
  depth,
  quant.err,
  normalize = TRUE,
  distance_metric = c("L1_Norm", "L2_Norm"),
  error_metric = c("mean", "max"),
  quant_method = c("kmeans", "kmedoids"),
  projection.scale,
  dim_reduction_method = c("sammon" , "tsne" , "umap")
  diagnose = FALSE,
  hvt_validation = FALSE,
  train_validation_split_ratio,
  tsne_perplexity,tsne_theta,tsne_verbose,
  tsne_eta,tsne_max_iter,
  umap_n_neighbors,umap_min_dist
)

Each of the parameters of trainHVT function have been explained below:

The output of trainHVT function (list of 7 elements) have been explained below with an image attached for clear understanding.

NOTE: Here the attached image is the snapshot of output list generated from model training which can be referred later in this section

Figure 3: The Output list generated by trainHVT function.

Figure 3: The Output list generated by trainHVT function.

More information on building an HVT model at different levels and visualizing the output can be found here.

In the section below, we build a Level 1 HVT model. The number of cells (n_cells) is set to 500.

hvt.results <-trainHVT(dataset_updated_train,
                          n_cells = 500,
                          depth = 1,
                          quant.err = 0.1,
                          normalize = FALSE,
                          distance_metric = "L2_Norm",
                          error_metric = "max",
                          quant_method = "kmeans",
                          diagnose = TRUE,
                          hvt_validation = TRUE,
                          train_validation_split_ratio=0.8,
                          dim_reduction_method = "sammon")

Initial stress : 0.01925 stress after 10 iters: 0.01507, magic = 0.500 stress after 20 iters: 0.01507, magic = 0.500

displayTable(hvt.results[[3]][['compression_summary']])
segmentLevel noOfCells noOfCellsBelowQuantizationError percentOfCellsBelowQuantizationErrorThreshold parameters
1 500 450 0.9 n_cells: 500 quant.err: 0.1 distance_metric: L2_Norm error_metric: max quant_method: kmeans

As seen in the above table, 90% of the cells have a quantization error below the threshold.

Let’s have a closer look at Quant.Error of the cells. Here we are showing just top 100 rows for the sake of brevity.

displayTable(data =hvt.results[[3]][['summary']])
Segment.Level Segment.Parent Segment.Child n Cell.ID Quant.Error x y z
1 1 1 16 1 0.085 -2.5043 1.6314 0.0128
1 1 2 20 57 0.1056 -0.8851 2.6890 0.5198
1 1 3 19 215 0.0757 0.2462 1.4518 -0.8469
1 1 4 15 317 0.0462 0.9971 0.1269 -0.1018
1 1 5 18 469 0.0955 2.8862 0.2697 -0.4118
1 1 6 12 156 0.0664 -1.2610 -0.9961 0.9169
1 1 7 7 481 0.0605 2.0628 -1.6366 0.7702
1 1 8 19 173 0.0592 -1.0148 -0.4740 0.4757
1 1 9 11 337 0.0474 1.2020 0.2838 -0.6422
1 1 10 19 147 0.0637 -0.8002 0.9398 -0.6382
1 1 11 17 119 0.0602 -1.2857 0.3566 -0.7428
1 1 12 13 235 0.0549 -0.3764 -0.9945 0.3487
1 1 13 13 345 0.0499 0.7779 -0.8996 0.5824
1 1 14 14 322 0.049 0.7621 -0.7313 -0.3297
1 1 15 19 38 0.1071 -2.5437 -1.5681 0.0274
1 1 16 14 202 0.0869 0.3834 2.2422 -0.9543
1 1 17 13 180 0.0515 -0.9200 -0.8070 -0.6288
1 1 18 13 200 0.0483 -0.7782 -0.7013 0.3026
1 1 19 11 172 0.039 -0.9769 -0.4964 -0.4267
1 1 20 7 44 0.0488 -2.2419 0.6013 -0.9435
1 1 21 22 203 0.0531 -0.0623 1.2179 -0.6222
1 1 22 19 309 0.0605 0.1553 -1.4426 0.8334
1 1 23 13 4 0.0908 -2.2674 1.8196 -0.3920
1 1 24 17 42 0.0951 -2.6043 -1.1370 0.5171
1 1 25 12 422 0.0689 1.8880 -0.1159 0.9890
1 1 26 19 175 0.0636 -0.5551 0.9269 -0.3971
1 1 27 8 129 0.0468 -1.3295 -0.3648 -0.7822
1 1 28 20 416 0.0687 2.2832 0.9731 -0.8698
1 1 29 10 353 0.0434 0.9973 -0.9066 -0.7559
1 1 30 11 258 0.0561 0.5735 0.8548 -0.2362
1 1 31 12 494 0.0811 1.7290 -2.4385 -0.0562
1 1 32 15 254 0.061 0.5261 1.0950 0.6210
1 1 33 13 286 0.0522 0.3298 -0.9508 -0.1011
1 1 34 15 410 0.1093 1.8365 2.2859 -0.3242
1 1 35 9 312 0.0429 1.1547 0.6303 -0.7266
1 1 36 35 126 0.1046 -1.2236 -2.6458 -0.3519
1 1 37 14 397 0.107 1.9521 1.7699 -0.7555
1 1 38 10 210 0.0803 0.5334 2.8564 0.4039
1 1 39 15 411 0.0912 1.7384 2.1866 0.5855
1 1 40 14 213 0.0786 0.6242 2.8695 -0.3179
1 1 41 10 340 0.0503 0.9251 -0.7349 -0.5767
1 1 42 9 265 0.0449 0.7468 1.0075 -0.6651
1 1 43 13 256 0.0548 -0.0500 -1.0343 -0.2651
1 1 44 15 7 0.0847 -2.8902 0.6925 0.1851
1 1 45 12 398 0.0594 1.4908 -0.5839 0.9140
1 1 46 8 217 0.0493 0.0586 1.0962 -0.4322
1 1 47 18 420 0.1104 0.5761 -2.4941 0.8164
1 1 48 24 161 0.059 -0.9058 0.4330 0.0810
1 1 49 20 271 0.089 -0.4221 -2.4065 0.8811
1 1 50 28 190 0.0785 -0.2653 1.2979 0.7324
1 1 51 17 417 0.089 1.9581 1.1965 0.9514
1 1 52 23 112 0.0596 -1.4479 0.1639 0.8373
1 1 53 16 436 0.0763 2.2330 -0.4262 -0.9544
1 1 54 14 105 0.0623 -1.6477 -0.8579 -0.9862
1 1 55 14 310 0.0404 0.6567 -0.7805 -0.1951
1 1 56 16 145 0.0788 0.0785 2.5186 -0.8481
1 1 57 20 101 0.0695 -1.3314 0.7496 -0.8790
1 1 58 28 28 0.1169 -1.3507 2.4653 0.5585
1 1 59 18 3 0.082 -2.7258 1.2293 -0.0555
1 1 60 16 458 0.0786 2.6461 0.8808 0.6020
1 1 61 13 448 0.0909 2.5477 1.5254 -0.1933
1 1 62 6 497 0.0553 2.4053 -1.6289 0.4151
1 1 63 18 74 0.099 -2.1565 -1.0329 0.9085
1 1 64 15 106 0.0829 -1.7278 -0.8237 0.9903
1 1 65 24 15 0.1057 -2.8328 0.2844 0.5076
1 1 66 15 445 0.0841 1.0529 -2.2667 0.8588
1 1 67 19 426 0.0621 1.7760 -0.8549 0.9945
1 1 68 15 133 0.0841 -1.4151 -1.2917 0.9928
1 1 69 17 21 0.1024 -1.8727 1.8759 0.7471
1 1 70 11 22 0.0855 -2.7889 -0.1194 -0.5972
1 1 71 18 425 0.0812 2.0334 1.8479 0.6542
1 1 72 14 152 0.0558 -1.2178 -0.5852 0.7594
1 1 73 18 178 0.0524 -0.9222 -0.3876 0.0185
1 1 74 9 292 0.0439 0.9819 0.8353 -0.7014
1 1 75 7 60 0.0577 -2.0354 0.3946 -0.9946
1 1 76 19 184 0.0957 0.0769 2.2161 0.9656
1 1 77 10 406 0.0528 1.0878 -1.4912 0.9851
1 1 78 16 114 0.0762 -0.4937 2.0190 -0.9916
1 1 79 22 154 0.0805 -1.1446 -1.3244 -0.9625
1 1 80 19 174 0.089 -0.9800 -1.8102 -0.9944
1 1 81 11 20 0.1069 -2.9200 -0.3489 -0.3049
1 1 82 20 367 0.0585 1.2396 -0.2282 0.6715
1 1 83 18 194 0.079 -0.9804 -1.5473 0.9802
1 1 84 10 275 0.0379 0.0616 -1.1076 0.4520
1 1 85 14 483 0.0814 2.8612 -0.0229 0.4950
1 1 86 19 239 0.0811 -0.6117 -1.8982 0.9896
1 1 87 15 108 0.1132 -0.4173 2.3374 0.9180
1 1 88 15 14 0.0749 -2.6803 0.8072 0.5906
1 1 89 14 487 0.0933 2.9542 -0.3518 0.1693
1 1 90 15 49 0.078 -2.3255 -0.1169 0.9387
1 1 91 11 18 0.0661 -2.5622 0.9349 -0.6795
1 1 92 17 198 0.0676 -0.8917 -1.0186 0.7597
1 1 93 14 70 0.1113 -2.2507 -0.7708 0.9096
1 1 94 18 130 0.0671 -1.3785 -0.2504 0.7958
1 1 95 20 31 0.0843 -2.7571 -0.4438 0.5964
1 1 96 18 228 0.0681 0.4626 2.1152 0.9792
1 1 97 13 477 0.0839 1.6383 -2.3006 -0.5530
1 1 98 11 472 0.0849 2.6431 -0.0930 0.7589
1 1 99 16 77 0.0977 -0.8876 2.2554 0.8966
1 1 100 19 253 0.068 0.6705 1.2463 -0.8092

Let’s take a look at the 2D HVT plot.

plotHVT(hvt.results,
        line.width = c(0.6 ),
        color.vec = c("black"),
        centroid.size = 1,
        maxDepth = 1, 
        plot.type = '2Dhvt')

8.1 Model Diagnostics and Validation

8.1.1 Diagnostics

HVT model diagnostics are used to evaluate the model fit and investigate the proximity between centroids. The distribution of proximity value can also be used to decided an optimum Mean Absolute Deviation threshold for HVT model based scoring.

The diagnosis can be enabled by setting the diagnose parameter to TRUE while building the HVT Model.

8.1.2 Validation

Model validation is used to measure the fit/quality of the model. Measuring model fit is the key to iteratively improving the models. The relevant measure of model quality here is percentage of anomalous points. The percentage anomalies ideally should match with level of compression achieved during modeling, where PercentageAnomalies \(\approx\) 1-ModelCompression.

Model Validation can be enabled by setting the hvt_validation parameter to TRUE and setting the train_validation_split_ratio value while training the HVT Model.

The model trained above has a train_validation_split_ratio of 0.8, i.e 80% of the Train dataset is used for training the model while the remaining 20% will be used for validation

Note: User can skip this step, if the number of observations in train data is low.

8.2 Diagnostic Plots

The basic tool for examining the model fit is proximity plots and distribution of observations across centroids.

The proximity between object can be measured as distance matrix. The distances between the objects is calculated using Manhattan or Euclidean Distance and put into a matrix form. In the next step we find the minimum value for each row, excluding the diagonal values, as the diagonal elements of distance matrix are zero representing distance from an object to itself. This minimum distance value gives the proximity(distance to nearest neighbour) of other object in the datatable.

`plotModelDiagnostics() function can be used to print diagnostic plots for HVT model or HVT scoring.

The plotModelDiagnostics() function for HVT Model provides 5 diagnostic plots which are as follows:

Let’s have look at the function plotModelDiagnostics which we will use to print the diagnostic plots.

plotModelDiagnostics(hvt.results)

8.2.1 Mean Absolute Deviation based Calibration: HVT Model | Train Data

The first diagnostic plot is a calibration plot for HVT Model run on train data. This plot is obtained by executing the train data itself on the HVT model. It is a comparison of Percentage_Anomalies with at varying Mean Absolute Deviation values. It can be seen from the plot that at 0.1 Mean Absolute Deviation value the percentage anomalies drop below one percent.

p3=hvt.results[[4]]$mad_plot_train+ggtitle("Mean Absolute Deviation Plot: Calibration: HVT Model | Train Data")
p3

8.2.2 Minimum Intra-DataPoint Distance Plot: Train Data

The second diagnostics plot helps us in finding out how the points in the training data are distributed. Shown below is a histogram of minimum distances with nearest neighbour for each observation in train data.

p1=hvt.results[[4]]$datapoint_plot+ggtitle("Minimum Intra-DataPoint Distance Plot: Train Data")
p1

As seen in the plot above the mean value is 0.02.

8.2.3 Minimum Intra-Centroid Distance Plot: HVT Model | Train Data

The third diagnostics plot helps us in finding out how the centroids in the HVT Model are distributed. Shown below is a histogram of minimum distances with nearest neighbour for each centroid in HVT Model

p2=hvt.results[[4]]$cent_plot+ggtitle("Minimum Intra-Centroid Distance Plot: HVT Model | Train Data")
p2

As seen in the plot above the mean value is 0.8. This value can be selected as the Mean Absolute Deviation Threshold for scoring data using scoreHVT function

8.2.4 Distribution of Number of Observations in HVT Cells

The fourth diagnostics plot finds out distribution of number of observations in each centroid. Shown below is a histogram to depict the same.

p4=hvt.results[[4]]$number_plot+ggtitle("Distribution of Number of Observations in Cells: HVT Model | Train Data")
p4
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As shown in the plot above the mean number of records in each HVT cell is 15.

8.2.5 Singleton Count

The fifth diagnostics plot finds out number of Singleton centroids (Segments/Centroids with single observation.)

p5=hvt.results[[4]]$singleton_piechart
p5

Validation

The Mean Absolute Deviation Plot for Validation Data has been shown in above section. Alternatively to fetch it separately, we can use the following code:

m1=hvt.results[[5]][["mad_plot"]]+ggtitle("Mean Absolute Deviation Plot:Validation")
m1

As seen in the plot the mean absolute deviation for the validation data from the given dataset_updated_train is 0.11.

9. Scoring

Now once we have built the model, let us try to score using our test dataset to see which cell each point belongs to.

9.1 Scoring Algorithm

The Scoring algorithm recursively calculates the distance between each point in the test dataset and the cell centroids for each level. The following steps explain the scoring method for a single point in the test dataset :

  1. Calculate the distance between the point and the centroid of all the cells in the first level.
  2. Find the cell whose centroid has minimum distance to the point.
  3. Check if the cell drills down further to form more cells.
  4. If it doesn’t, return the path. Or else repeat steps 1 to 4 till we reach a level at which the cell doesn’t drill down further.

9.2 Load Test Data

The user can provide an absolute or relative path in the cell below to access the test data from his/her computer.

load_test_data=FALSE
if(load_test_data){
  file_name <- " " #enter the name of the local file for validation
  file_path <- " " #enter the path of the local file for validation
  file_load <- paste0(file_path, file_name)
  dataset_updated_test <- as.data.frame(fread(file_load))
  
  if(nrow(dataset_updated_test) > 0){

    paste0("File ", file_name, " having ", nrow(dataset_updated_test), " row(s) and ",
ncol(dataset_updated_test), " column(s)",  " imported successfully. ") %>% cat("\n")
dataset_updated_test <- dataset_updated_test %>% mutate_if(is.numeric, round, digits = 4)
    paste0("Code chunk executed successfully. Below table showing first 10 row(s) of the dataset.") %>% cat("\n")
    dataset_updated_test %>% head(10) %>%as.data.frame() %>%DT::datatable(options = options, rownames = TRUE)
  }
  
  colnames( dataset_updated_test) <- colnames( dataset_updated_test) %>% casefold()
 dataset_updated_test <- spaceless( dataset_updated_test)
}

9.2.1 Transformation of Categorical Features

In this section we will perform one hot encoding on test dataset, based on whether one hot encoding has been performed on train dataset.

if(dummify_cat){
dummified_cols_test <- dataset_updated_test %>% dplyr::select(dum_cols) %>%
  dummies::dummy.data.frame(dummy.classes = "ALL", sep = "_")

names(dummified_cols_test) <- gsub(pattern = "[^[:alnum:]]+",
                             replacement = ".",
                             names(dummified_cols_test))

columns_difference=setdiff(dummified_cols,names(dummified_cols_test))
dummified_cols_test[,columns_difference]=0

dataset_updated_test <- dataset_updated_test %>% cbind(dummified_cols_test) %>%     
# append encoded columns
dplyr::select(-dum_cols)
# remove the old categorical columns
dummified_cols_test %>% head(5) %>% DT::datatable(options = options)
}

9.2.2 Subsetting Test Dataset

In this section we will subset the test data based on numeric columns present in train data.

dataset_updated_test=dataset_updated_test %>% dplyr::select(nums)

Now once we have the test data ready, lets look at how the scoreHVT function looks like.

9.3 Scoring on Test Data

scoreHVT(dataset,
         hvt.results,
         child.level,
         mad.threshold,
         line.width,
         color.vec,
         normalize,
         distance_metric,
         error_metric,
         yVar,
         analysis.plots, 
         names.column)

The important parameters for the function scoreHVT are as below

The below given arguments are used only when character column can be mapped over the scored results. since torus doesn’t have a character column, we are not using them in this vignette.

Here the mad_threshold has been selected as 0.8 which is based on Mean of Minimum Intra-Centroid Distance plot from above.

hvt.score <- scoreHVT(dataset_updated_test,
                      hvt.results,
                      child.level = 1,
                      mad.threshold = 0.8, 
                      line.width = c(0.6, 0.4, 0.2),
                      color.vec = c("navyblue", "slateblue", "lavender"),
                      distance_metric = "L1_Norm",
                      error_metric = "max")
displayTable(hvt.score[["scoredPredictedData"]], value = 0.8)
Segment.Level Segment.Parent Segment.Child n Cell.ID Quant.Error centroidRadius diff anomalyFlag x y z
1 1 153 1 182 0.0764 0.3141 0.2377 0 -1.0020 -2.3335 -0.8420
1 1 217 1 17 0.0633 0.2037 0.1404 0 -2.1770 1.5699 -0.7295
1 1 162 1 475 0.1043 0.1806 0.0763 0 2.0941 -1.8907 -0.5704
1 1 445 1 78 0.1248 0.3166 0.1918 0 -0.1980 2.9839 -0.1378
1 1 436 1 123 0.1272 0.2633 0.1361 0 -1.2495 -1.9487 -0.9491
1 1 269 1 413 0.1154 0.1992 0.0838 0 2.0634 0.0815 -0.9979
1 1 191 1 347 0.0521 0.1879 0.1358 0 1.5182 0.9935 -0.9826
1 1 172 1 242 0.0443 0.0816 0.0373 0 0.4102 0.9552 -0.2784
1 1 39 1 411 0.1215 0.2735 0.1520 0 1.5409 2.3249 0.6142
1 1 113 1 197 0.0838 0.1432 0.0594 0 -0.9558 -0.6764 0.5591
1 1 237 1 186 0.072 0.1856 0.1136 0 -0.0790 1.8156 -0.9832
1 1 204 1 273 0.0392 0.1506 0.1113 0 0.7280 0.6856 0.0014
1 1 59 1 3 0.1355 0.2460 0.1105 0 -2.8278 1.0012 0.0209
1 1 209 1 368 0.0661 0.2251 0.1590 0 0.3594 -1.8433 -0.9925
1 1 92 1 198 0.0623 0.2029 0.1405 0 -0.9926 -0.9557 0.7829
1 1 91 1 18 0.0686 0.1984 0.1298 0 -2.5097 1.0868 -0.6782
1 1 237 1 186 0.0526 0.1856 0.1329 0 -0.1015 1.7005 -0.9550
1 1 471 1 404 0.0618 0.2626 0.2008 0 1.7236 1.5863 0.9395
1 1 157 1 452 0.026 0.1888 0.1628 0 1.6227 -1.6186 0.9564
1 1 257 1 377 0.0682 0.1637 0.0955 0 1.3105 -0.4772 0.7960
1 1 333 1 170 0.058 0.1101 0.0522 0 -1.0363 -0.2099 0.3338
1 1 452 1 354 0.0986 0.1824 0.0838 0 1.3842 0.3319 0.8170
1 1 435 1 107 0.0493 0.2171 0.1679 0 -1.2680 1.0389 0.9327
1 1 211 1 34 0.0734 0.2149 0.1414 0 -2.6369 -0.5918 -0.7117
1 1 384 1 236 0.0638 0.2220 0.1582 0 0.7185 2.1057 -0.9744
1 1 349 1 440 0.0985 0.2480 0.1495 0 2.3827 0.6634 0.8809
1 1 486 1 127 0.1209 0.2863 0.1654 0 -1.7146 -1.6026 0.9379
1 1 154 1 311 0.0723 0.2856 0.2133 0 1.3901 2.2967 -0.7289
1 1 119 1 148 0.0545 0.1760 0.1216 0 -1.1141 -0.0945 -0.4715
1 1 62 1 497 0.0429 0.1659 0.1231 0 2.4161 -1.5692 0.4732
1 1 457 1 388 0.0724 0.2316 0.1592 0 1.7900 0.0487 -0.9779
1 1 425 1 181 0.0362 0.1801 0.1439 0 -0.3151 1.1740 -0.6202
1 1 361 1 27 0.1781 0.3374 0.1593 0 -1.4384 2.6024 -0.2288
1 1 322 1 252 0.0704 0.1364 0.0661 0 0.5732 0.8297 0.1298
1 1 357 1 94 0.1396 0.2812 0.1416 0 -2.0384 -1.7248 -0.7422
1 1 183 1 261 0.0891 0.1566 0.0675 0 0.4926 0.9927 0.4523
1 1 317 1 418 0.0205 0.2420 0.2215 0 2.1445 0.4366 -0.9821
1 1 93 1 70 0.1438 0.3340 0.1902 0 -2.1026 -0.5622 0.9843
1 1 141 1 304 0.0736 0.2559 0.1822 0 1.0673 1.6906 1.0000
1 1 267 1 315 0.0533 0.1378 0.0845 0 1.0620 0.2168 -0.4009
1 1 57 1 101 0.0997 0.2086 0.1088 0 -1.2506 0.9453 -0.9017
1 1 97 1 477 0.0881 0.2516 0.1635 0 1.7293 -2.3387 -0.4177
1 1 95 1 31 0.0598 0.2529 0.1931 0 -2.7339 -0.5932 0.6032
1 1 52 1 112 0.0617 0.1788 0.1171 0 -1.4642 0.0022 0.8444
1 1 47 1 420 0.1156 0.3313 0.2156 0 0.4330 -2.3813 0.9074
1 1 319 1 459 0.096 0.2466 0.1506 0 2.7641 0.7309 -0.5118
1 1 202 1 29 0.0745 0.2193 0.1448 0 -2.5086 0.7689 0.7815
1 1 166 1 386 0.0379 0.2457 0.2078 0 1.4787 -0.2794 0.8688
1 1 252 1 196 0.0533 0.1539 0.1006 0 -0.7918 -0.7288 -0.3828
1 1 402 1 234 0.0555 0.1477 0.0922 0 -0.4083 -1.0176 -0.4286
1 1 282 1 245 0.0933 0.1726 0.0794 0 0.3045 1.0556 0.4330
1 1 423 1 176 0.0141 0.2224 0.2083 0 -0.5818 0.9918 0.5266
1 1 68 1 133 0.0566 0.2522 0.1956 0 -1.3209 -1.2284 0.9806
1 1 123 1 111 0.0699 0.2311 0.1612 0 -1.3814 0.7997 0.9148
1 1 147 1 480 0.098 0.2042 0.1061 0 2.0533 -1.8681 0.6308
1 1 292 1 343 0.0353 0.1916 0.1564 0 0.9554 -0.3904 0.2514
1 1 175 1 429 0.0859 0.2246 0.1387 0 1.9167 -0.6986 -0.9992
1 1 241 1 241 0.1479 0.2439 0.0959 0 0.4723 2.6730 0.6997
1 1 370 1 324 0.0279 0.1619 0.1341 0 0.5408 -0.9755 0.4664
1 1 291 1 468 0.0325 0.2760 0.2436 0 2.8851 0.7376 0.2091
1 1 478 1 52 0.0728 0.2279 0.1551 0 -1.7929 1.0316 -0.9977
1 1 89 1 487 0.0485 0.2798 0.2313 0 2.9771 -0.3583 0.0534
1 1 25 1 422 0.0421 0.2068 0.1647 0 1.9528 -0.0645 0.9989
1 1 97 1 477 0.0492 0.2516 0.2024 0 1.6524 -2.2377 -0.6237
1 1 86 1 239 0.0361 0.2433 0.2072 0 -0.5211 -1.8901 0.9992
1 1 492 1 36 0.0706 0.1235 0.0529 0 -2.3845 0.6975 -0.8749
1 1 477 1 41 0.0712 0.2101 0.1389 0 -2.4536 0.1933 -0.8873
1 1 28 1 416 0.0566 0.2060 0.1494 0 2.2249 1.0713 -0.8830
1 1 153 1 182 0.1569 0.3141 0.1572 0 -0.8697 -2.1554 -0.9460
1 1 13 1 345 0.0761 0.1496 0.0735 0 0.8484 -0.9514 0.6884
1 1 261 1 374 0.0411 0.1669 0.1257 0 0.6624 -1.5950 -0.9620
1 1 363 1 464 0.0593 0.2063 0.1470 0 2.2880 -0.8983 0.8890
1 1 177 1 307 0.0037 0.1773 0.1736 0 0.3260 -1.1945 0.6478
1 1 55 1 310 0.0716 0.1211 0.0495 0 0.6464 -0.8397 -0.3404
1 1 200 1 415 0.1246 0.2336 0.1089 0 1.6026 2.5123 0.1994
1 1 244 1 342 0.0622 0.1769 0.1147 0 1.1013 -0.0970 0.4471
1 1 460 1 230 0.0622 0.2442 0.1820 0 -0.4813 -1.5246 -0.9160
1 1 424 1 372 0.083 0.2654 0.1824 0 1.3371 -0.1317 -0.7544
1 1 73 1 178 0.0482 0.1572 0.1090 0 -0.9502 -0.3186 0.0659
1 1 102 1 462 0.0623 0.2230 0.1607 0 2.0180 -1.1153 0.9521
1 1 198 1 2 0.1735 0.2656 0.0922 0 -2.0152 2.0241 0.5166
1 1 366 1 100 0.0822 0.2644 0.1822 0 -1.0988 1.5419 -0.9943
1 1 262 1 6 0.1624 0.2576 0.0953 0 -1.9070 2.2917 -0.1921
1 1 268 1 205 0.0861 0.2128 0.1267 0 -0.7320 -1.3653 -0.8926
1 1 33 1 286 0.0814 0.1565 0.0751 0 0.3054 -0.9587 0.1109
1 1 311 1 281 0.0287 0.1955 0.1668 0 0.7924 1.1246 0.7812
1 1 48 1 161 0.0934 0.1770 0.0837 0 -0.8322 0.5737 0.1468
1 1 496 1 206 0.0658 0.1103 0.0445 0 -0.7373 -0.6758 -0.0178
1 1 484 1 53 0.1166 0.3132 0.1966 0 -1.9716 -2.2501 -0.1285
1 1 64 1 106 0.056 0.2486 0.1926 0 -1.8333 -0.7709 0.9999
1 1 97 1 477 0.1057 0.2516 0.1459 0 1.7131 -2.3705 -0.3806
1 1 119 1 148 0.0332 0.1760 0.1428 0 -1.0690 0.0172 -0.3653
1 1 328 1 493 0.133 0.2825 0.1495 0 2.6297 -1.2025 -0.4528
1 1 300 1 121 0.1444 0.3057 0.1613 0 -1.4048 -2.4167 -0.6061
1 1 148 1 120 0.0915 0.1618 0.0703 0 -1.4259 0.4853 0.8696
1 1 69 1 21 0.1422 0.3072 0.1650 0 -1.7295 2.1032 0.6909
1 1 268 1 205 0.0359 0.2128 0.1769 0 -0.7591 -1.6176 -0.9770
1 1 180 1 19 0.1275 0.2597 0.1322 0 -2.5488 0.6408 -0.7781
1 1 169 1 135 0.0943 0.2027 0.1084 0 -1.2734 0.1981 0.7029
1 1 388 1 291 0.0543 0.1174 0.0632 0 0.8795 0.4824 -0.0794

The plotModelDiagnostics() function can be called for scoring object as well. Shown below is the comparison of Mean Absolute Deviation Plot for train data and test data.

plotModelDiagnostics(hvt.score)

9.3.1 Quantization Error Comparison (Train vs. Test)

Table below shows cell(s) containing anomalous test data points. Datapoints are flagged as anomalous when Quantization Error for such is greater than the same of assigned centroid based on error metric. Comparison between scored/test and fitted Quantization error of cell(s) is provided for further insights.

Number of test data points: 2400 | Number of anomalous data points: 0 | Percentage of anomalous data points: 0.00%
Mean QE for fitted data: 0.075 | Mean QE for test data: 0.0753 | Difference in QE between fitted and test data: 3^{-4}

plotQuantErrorHistogram(hvt.results,hvt.score)

9.3.2 Anomalous Observations

The anomalous observations are shown below in the datatable.

NOTE: Since there are no anomaly found in torus dataset that we use, the table below will be empty.

QECompareDf <- hvt.score$QECompareDf %>% filter(anomalyFlag == 1)
displayTable(QECompareDf)
Segment.Level Segment.Parent Segment.Child anomalyFlag n Fitted.Quant.Error Scored.Quant.Error Quant.Error.Diff Quant.Error.Diff (%)

10. Applications

  1. Pricing Segmentation - The package can be used to discover groups of similar customers based on the customer spend pattern and understand price sensitivity of customers

  2. Market Segmentation - The package can be helpful in market segmentation where we have to identify micro and macro segments. The method used in this package can do both kinds of segmentation in one go

  3. Anomaly Detection - This method can help us categorize system behaviour over time and help us find anomaly when there are changes in the system. For e.g. Finding fraudulent claims in healthcare insurance

  4. The package can help us understand the underlying structure of the data. Suppose we want to analyze a curved surface such as sphere or vase, we can approximate it by a lot of small low-order polygons in the form of tessellations using this package

  5. In biology, Voronoi diagrams are used to model a number of different biological structures, including cells and bone microarchitecture

  6. Using the base idea of Systems Dynamics, these diagrams can also be used to depict customer state changes over a period of time

11. References

  1. Topology Preserving Maps

  2. Vector Quantization

  3. K-means

  4. Sammon’s Projection

  5. Voronoi Tessellations