1. Abstract

The HVT package is a collection of R functions to facilitate building topology preserving maps for rich multivariate data analysis, see Figure 1 as an example of a 3D torus map generated from the package. Tending towards a big data preponderance, a large number of rows. A collection of R functions for this typical workflow is organized below:

Data Compression: Vector quantization (VQ), HVQ (hierarchical vector quantization) using means or medians. This step compresses the rows (long data frame) using a compression objective.
Data Projection: Dimension projection of the compressed cells to 1D,2D or Interactive surface plot with the Sammons Non-linear Algorithm. This step creates topology preserving map (also called embeddings) coordinates into the desired output dimension.
Tessellation: Create cells required for object visualization using the Voronoi Tessellation method, package includes heatmap plots for hierarchical Voronoi tessellations (HVT). This step enables data insights, visualization, and interaction with the topology preserving map useful for semi-supervised tasks.
Scoring: Scoring new data sets or test data and recording their assignment using the map objects from the above steps, in a sequence of maps if required.
Temporal Analysis and Visualization: Collection of functions that analyzes time series data for its underlying patterns, calculation of transitioning probabilities and the visualizations for the flow of data over time.
Dynamic Forecasting: Simulate future states of dynamic systems using Monte Carlo simulations of Markov Chain (MSM), enabling ex-ante predictions for time-series data.

2. Data Compression

This package can perform vector quantization using the following algorithms -

Hierarchical Vector Quantization using \(k-means\)
Hierarchical Vector Quantization using \(k-medoids\)

2.1 Using k-means

The k-means algorithm randomly selects k data points as initial means
k clusters are formed by assigning each data point to its closest cluster mean using the Euclidean distance
Virtual means for each cluster are calculated by using all datapoints contained in a cluster

The second and third steps are iterated until a predefined number of iterations is reached or the clusters converge. The runtime for the algorithm is O(n).

2.2 Using k-medoids

The k-medoids algorithm randomly selects k data points as initial means out of the n data points as the medoids.
k clusters are formed by assigning each data point to its closest medoid by using any common distance metric methods.
Virtual means for each cluster are calculated by using all datapoints contained in a cluster

The second and third steps are iterated until a predefined number of iterations is reached or the clusters converge. The runtime for the algorithm is O(k * (n-k)^2).

2.3 Hierarchical Vector Quantization

The algorithm divides the dataset recursively into cells using \(k-means\) or \(k-medoids\) algorithm. The maximum number of subsets are decided by setting \(n-cells\) to, say five, in order to divide the dataset into maximum of five subsets. These five subsets are further divided into five subsets(or less), resulting in a total of twenty five (5*5) subsets. The recursion terminates when the cells either contain less than three data point or a stop criterion is reached. In this case, the stop criterion is set to when the cell error exceeds the quantization threshold.

The steps for this method are as follows :

Select k(number of cells), depth and quantization error threshold
Perform quantization (using \(k-means\) or \(k-medoids\)) on the input dataset
Calculate quantization error for each of the k cells
Compare the quantization error for each cell to quantization error threshold
Repeat steps 2 to 4 for each of the k cells whose quantization error is above threshold until stop criterion is reached.

The stop criterion is when the quantization error of a cell satisfies one of the below conditions

reaches below quantization error threshold
there are less than three data point in the cell
the user specified depth has been attained

2.3.1 Quantization Error

Let us try to understand quantization error with an example.

Figure 1: The Voronoi tessellation for level 1 shown for the 5 cells with the points overlayed

An example of a 2 dimensional VQ is shown above.

In the above image, we can see 5 cells with each cell containing a certain number of points. The centroid for each cell is shown in blue. These centroids are also known as codewords since they represent all the points in that cell. The set of all codewords is called a codebook.

Now we want to calculate quantization error for each cell. For the sake of simplicity, let’s consider only one cell having centroid A and m data points \(F_i\) for calculating quantization error.

For each point, we calculate the distance between the point and the centroid.

\[ d = ||A - F_i||_{p} \]

In the above equation, p = 1 means L1_Norm distance whereas p = 2 means L2_Norm distance. In the package, the L1_Norm distance is chosen by default. The user can pass either L1_Norm, L2_Norm or a custom function to calculate the distance between two points in n dimensions.

\[QE = \max(||A-F_i||_{p})\]

where

\(A\) is the centroid of the cell
\(F_i\) represents a data point in the cell
\(p\) is the \(p\)-norm metric. Here \(p\) = 1 represents L1 Norm and \(p\) = 2 represents L2 Norm.

Now, we take the maximum calculated distance of all m points. This gives us the furthest distance of a point in the cell from the centroid, which we refer to as Quantization Error. If the Quantization Error is higher than the given threshold, the centroid/codevector is not a good representation for the points in the cell. Now we can perform further Vector Quantization on these points and repeat the above steps.

Please note that the user can select mean or max to calculate the Quantization Error. The custom function takes a vector of m value (where each value is a distance between point in n dimensions and centroids) and returns a single value which is the Quantization Error for the cell.

If we select mean as the error metric, the above Quantization Error equation will look like this :

\[QE = \frac{1}{m}\sum_{i=1}^m||A-F_i||_{p}\]

where

\(A\) is the centroid of the cell
\(F_i\) represents a data point in the cell
\(m\) is the number of points in the cell
\(p\) is the \(p\)-norm metric. Here \(p\) = 1 represents L1 Norm and \(p\) = 2 represents L2 Norm.

3. Data Projection

Sammon’s projection is an algorithm that maps a high-dimensional space to a space of lower dimensionality while attempting to preserve the structure of inter-point distances in the projection. It is particularly suited for use in exploratory data analysis and is usually considered a non-linear approach since the mapping cannot be represented as a linear combination of the original variables. The centroids are plotted in 2D after performing Sammon’s projection at every level of the tessellation.

Denoting the distance between \(i^{th}\) and \(j^{th}\) objects in the original space by \(d_{ij}^*\), and the distance between their projections by \(d_{ij}\). Sammon’s mapping aims to minimize the below error function, which is often referred to as Sammon’s stress or Sammon’s error

\[E=\frac{1}{\sum_{i<j} d_{ij}^*}\sum_{i<j}\frac{(d_{ij}^*-d_{ij})^2}{d_{ij}^*}\]

The minimization of this can be performed either by gradient descent, as proposed initially, or by other means, usually involving iterative methods. The number of iterations need to be experimentally determined and convergent solutions are not always guaranteed. Many implementations prefer to use the first Principal Components as a starting configuration.

3.1 Tessellations

A Voronoi diagram is a way of dividing space into a number of regions. A set of points (called seeds, sites, or generators) is specified beforehand and for each seed, there will be a corresponding region consisting of all points within proximity of that seed. These regions are called Voronoi cells. It is complementary to Delaunay triangulation.

Tessellate: Constructing Voronoi Tesselations

In this package, we use sammons from the package MASS to project higher dimensional data to a 2D space. The function hvq called from the trainHVT function returns hierarchical quantized data which will be the input for construction of the tessellations. The data is then represented in 2D coordinates and the tessellations are plotted using these coordinates as centroids. We use the package deldir for this purpose. The deldir package computes the Delaunay triangulation (and hence the Dirichlet or Voronoi tessellation) of a planar point set according to the second (iterative) algorithm of Lee and Schacter. For subsequent levels, transformation is performed on the 2D coordinates to get all the points within its parent tile. Tessellations are plotted using these transformed points as centroids. The lines in the tessellations are chopped in places so that they do not protrude outside the parent polygon. This is done for all the subsequent levels.

4. Notebook Requirements

This chunk verifies the installation of all the necessary packages to successfully run this vignette, if not, installs them and attach all the packages in the session environment.

list.of.packages <- c("dplyr","HVT", "kableExtra", "geozoo", "plotly", "purrr", "DT")

new.packages <-list.of.packages[!(list.of.packages %in% installed.packages()[, "Package"])]
if (length(new.packages))
  install.packages(new.packages, dependencies = TRUE, verbose = FALSE, repos='https://cloud.r-project.org/')

# Loading the required libraries
invisible(lapply(list.of.packages, library, character.only = TRUE))

5. Data Importing

5.1 Import Dataset from Local

The user can provide an absolute or relative path in the cell below to access the data from his/her computer. User can set import_data_from_local variable to TRUE to upload dataset from local.
Note: For this notebook import_data_from_local has been set to FALSE as we are simulating a dataset in next section.

import_data_from_local = FALSE # expects logical input

  file_name <- " " #enter the name of the local file
  file_path <- " " #enter the path of the local file

if(import_data_from_local){
  file_load <- paste0(file_path, file_name)
  dataset_updated <- as.data.frame(fread(file_load))
  if(nrow(dataset_updated) > 0){
    paste0("File ", file_name, " having ", nrow(dataset_updated), " row(s) and ", ncol(dataset_updated), " column(s)",  " imported successfully. ") %>% cat("\n")
    dataset_updated <- dataset_updated %>% mutate_if(is.numeric, round, digits = 4)
    paste0("Code chunk executed successfully. Below table showing first 10 row(s) of the dataset.") %>% cat("\n")
    dataset_updated %>% head(10) %>%as.data.frame() %>%DT::datatable(options = options, rownames = TRUE)
  }
  
}

5.2 Simulate Dataset

In this section, we will use a simulated dataset. If you are not using this option set simulate_dataset to FALSE. Given below is a simulated dataset called torus that contains 12000 observations and 3 features.

Let us see how to generate data for torus. We are using a library geozoo for this purpose. Geo Zoo (stands for Geometric Zoo) is a compilation of geometric objects ranging from 3 to 10 dimensions. Geo Zoo contains regular or well-known objects, eg cube and sphere, and some abstract objects, e.g. Boy’s surface, Torus and Hyper-Torus.

Here, we load the data and store into a variable dataset_updated.

simulate_dataset= TRUE

if(simulate_dataset == TRUE){
  
set.seed(257)
##torus data generation
torus <- geozoo::torus(p = 3,n = 12000)
dataset_updated <- data.frame(torus$points)
colnames(dataset_updated) <- c("x","y","z")


  if(nrow(dataset_updated) > 0){
    paste0( "Dataset having ", nrow(dataset_updated), " row(s) and ", ncol(dataset_updated), " column(s)",  "simulated successfully. ") %>% cat("\n")  
    dataset_updated <- dataset_updated %>% mutate_if(is.numeric, round, digits = 4) 
    paste0("Code chunk executed successfully. The table below is showing first 10 row(s) of the dataset.") %>% cat("\n")
    dataset_updated %>% head(10) %>%as.data.frame() %>%displayTable()
  }
}

## Dataset having 12000 row(s) and 3 column(s)simulated successfully.  
## Code chunk executed successfully. The table below is showing first 10 row(s) of the dataset.

x	y	z
-1.0020	-2.3335	-0.8420
1.1021	-2.7447	-0.2878
-1.0033	1.2656	-0.9229
1.3204	0.6205	-0.8410
1.2998	-1.2470	-0.9801
-1.9606	2.0755	-0.5184
0.2807	-0.9724	0.1548
-1.5540	2.0564	-0.8164
-2.4653	1.6586	-0.2377
-2.3234	1.6933	-0.4841

6. Data Understanding

6.1 Quick Peek of the Data

Structure of Data

In the below section we can see the structure of the data.

dataset_updated %>% str()

## 'data.frame':    12000 obs. of  3 variables:
##  $ x: num  -1 1.1 -1 1.32 1.3 ...
##  $ y: num  -2.333 -2.745 1.266 0.621 -1.247 ...
##  $ z: num  -0.842 -0.288 -0.923 -0.841 -0.98 ...

6.2 Deleting Irrelevant Columns

The cell below will allow user to drop irrelevant column.

########################################################################################
################################## User Input Needed ###################################
########################################################################################

# Add column names which you want to remove
want_to_delete_column <- "no"

    del_col<-c(" `column_name` ")  

if(want_to_delete_column == "yes"){
   dataset_updated <-  dataset_updated[ , !(names(dataset_updated) %in% del_col)]
  print("Code chunk executed successfully. Overview of data types after removed selected columns")
  str( dataset_updated)
}else{
  paste0("No Columns removed. Please enter column name if you want to remove that column") %>% cat("\n")
}

## No Columns removed. Please enter column name if you want to remove that column

6.3 Formatting and Renaming Columns

The code below contains a user defined function to rename or reformat any column that the user chooses.

########################################################################################
################################## User Input Needed ###################################
########################################################################################

# convert the column names to lower case
colnames( dataset_updated) <- colnames( dataset_updated) %>% casefold()

## rename column ?
want_to_rename_column <- "no" ## type "yes" if you want to rename a column

## renaming a column of a dataset 
rename_col_name <- " 'column_name` " ## use small letters
rename_col_name_to <- " `new_name` "

if(want_to_rename_column == "yes"){
  names( dataset_updated)[names( dataset_updated) == rename_col_name] <- rename_col_name_to
}

# remove space, comma, dot from column names
spaceless <- function(x) {colnames(x) <- gsub(pattern = "[^[:alnum:]]+",
                             replacement = ".",
                             names(x));x}
 dataset_updated <- spaceless( dataset_updated)

## below is the dataset summary
paste0("Successfully converted the column names to lower case and check the renamed column name if you changed") %>% cat("\n")

## Successfully converted the column names to lower case and check the renamed column name if you changed

str( dataset_updated) ## showing summary for updated

## 'data.frame':    12000 obs. of  3 variables:
##  $ x: num  -1 1.1 -1 1.32 1.3 ...
##  $ y: num  -2.333 -2.745 1.266 0.621 -1.247 ...
##  $ z: num  -0.842 -0.288 -0.923 -0.841 -0.98 ...

6.4 Changing Data Type of Columns

The section allows the user to change the data type of columns of his/her choice.

########################################################################################
################################## User Input Needed ###################################
########################################################################################

# If you want to change column type, change a below variable value to "yes"
want_to_change_column_type <- "no"

# you can change column type into numeric or character only
change_column_to_type <- "character" ## numeric

if(want_to_change_column_type == "yes" && change_column_to_type == "character"){
########################################################################################
################################## User Input Needed ###################################
########################################################################################
  select_columns <- c("panel_var") ###### Add column names you want to change here #####
   dataset_updated[select_columns]<- sapply( dataset_updated[select_columns],as.character)
  paste0("Code chunk executed successfully. Datatype of selected column(s) have been changed into numerical.")
  #str( dataset_updated)
}else if(want_to_change_column_type == "yes" && change_column_to_type == "numeric"){
  select_columns <- c('gearbox_oil_temperature')
   dataset_updated[select_columns]<- sapply( dataset_updated[select_columns],as.numeric)
  paste0("Code chunk executed successfully. Datatype of selected column(s) have been changed into categorical.")
  #str( dataset_updated)
}else{
  paste0("Datatype of columns have not been changed.") %>% cat("\n")
}

## Datatype of columns have not been changed.

dataset_updated <- do.call(data.frame, dataset_updated)
str(dataset_updated)

## 'data.frame':    12000 obs. of  3 variables:
##  $ x: num  -1 1.1 -1 1.32 1.3 ...
##  $ y: num  -2.333 -2.745 1.266 0.621 -1.247 ...
##  $ z: num  -0.842 -0.288 -0.923 -0.841 -0.98 ...

6.5 Checking and Removing Duplicates

Presence of duplicate observations can be misleading, this sections helps get rid of such rows in the dataset.

want_to_remove_duplicates <- "yes"  ## type "no" for choosing to not remove duplicates

## removing duplicate observation if present in the dataset
if(want_to_remove_duplicates == "yes"){
  
   dataset_updated <-  dataset_updated %>% unique()
  paste0("Code chunk executed successfully, duplicates if present successfully removed. Updated dataset has ", nrow( dataset_updated), " row(s) and ", ncol( dataset_updated), " column(s)") %>% print()
  cat("\n")
  str( dataset_updated) ## showing summary for updated dataset
} else{
  paste0("Code chunk executed successfully, NO duplicates were removed") %>% print()
}

## [1] "Code chunk executed successfully, duplicates if present successfully removed. Updated dataset has 12000 row(s) and 3 column(s)"
## 
## 'data.frame':    12000 obs. of  3 variables:
##  $ x: num  -1 1.1 -1 1.32 1.3 ...
##  $ y: num  -2.333 -2.745 1.266 0.621 -1.247 ...
##  $ z: num  -0.842 -0.288 -0.923 -0.841 -0.98 ...

6.6 List of Numerical and Categorical Column Names

# Return the column type 
CheckColumnType <- function(dataVector) {
  #Check if the column type is "numeric" or "character" & decide type accordingly
  if (class(dataVector) == "integer" || class(dataVector) == "numeric") {
    columnType <- "numeric"
  } else { columnType <- "character" }
  #Return the result
  return(columnType)
}
### Loading the list of numeric columns in variable
numeric_cols <<- colnames( dataset_updated)[unlist(sapply( dataset_updated, 
                                                       FUN = function(x){ CheckColumnType(x) == "numeric"}))]

### Loading the list of categorical columns in variable
cat_cols <- colnames( dataset_updated)[unlist(sapply( dataset_updated, 
                                                   FUN = function(x){ 
                                                     CheckColumnType(x) == "character"|| CheckColumnType(x) == "factor"}))]

### Removing Date Column from the list of categorical column
paste0("Code chunk executed successfully, list of numeric and categorical variables created.") %>% cat()

## Code chunk executed successfully, list of numeric and categorical variables created.

paste0("Numerical Column(s): \n Count : ", length(numeric_cols), "\n") %>% cat()

## Numerical Column(s): 
##  Count : 3

paste0(numeric_cols) %>% print()

## [1] "x" "y" "z"

paste0("Categorical Column(s): \n Count : ", length(cat_cols), "\n") %>% cat()

## Categorical Column(s): 
##  Count : 0

paste0(cat_cols) %>% print()

## character(0)

6.7 Filtering Dataset for Analysis

In this section, the dataset can be filtered for required row(s) for further analysis.

want_to_filter_dataset <- "no" ## type "yes" in case you want to filter
filter_col <- " "  ## Enter Column name to filter
filter_val <- " "  ## Enter Value to exclude for the column selected

if(want_to_filter_dataset == "yes"){
   dataset_updated <- filter_at( dataset_updated
                              , vars(contains(filter_col))
                              , all_vars(. != filter_val))
  
  paste0("Code chunk executed successfully, dataset filtered successfully on required columns. Updated dataset has ", nrow( dataset_updated), " row(s) and ", ncol( dataset_updated), " column(s)") %>% print()
  cat("\n")
  str( dataset_updated) ## showing summary for updated dataset
  
} else{
  paste0("Code chunk executed successfully, entire dataset is available for analysis.") %>% print()
}

## [1] "Code chunk executed successfully, entire dataset is available for analysis."

6.8 Missing Value Analysis

Missing values in the training data can lead to a biased model because we have not analyzed the behavior and relationship of those values with other variables correctly. It can lead to a wrong calculation or classification. Missing values can be of 3 types:

Missing Completely At Random (MCAR): When missing data are MCAR, the presence/absence of data is completely independent of observable variables and parameters of interest. This is a case when the probability of missing variables is the same for all observations. For example, respondents of the data collection process decide that they will declare they’re earning after tossing a fair coin. If a head occurs, the respondent declares his / her earnings & vice versa.
Missing At Random (MAR): When missing data is not random but can be related to an observed variable where there is complete information. This kind of missing data can induce a bias in the analysis, especially if it unbalances the data because of many missing values in a certain category. For example, we are collecting data for age and female has higher missing value compare to male.
Missing Depending on Unobserved Predictors: This is a case when the missing values are not random and are related to the unobserved input variable.

Missing Value on Entire dataset

na_total <- sum(is.na( dataset_updated))/prod(dim( dataset_updated))
if(na_total == 0){
  paste0("In the uploaded dataset, there is no missing value") %>% cat("\n")
}else{
  na_percentage <- paste0(sprintf(na_total*100, fmt = '%#.2f'),"%")
  paste0("Percentage of missing value in entire dataset is ",na_percentage) %>% cat("\n")
}

## In the uploaded dataset, there is no missing value

Missing Value on Column-level

The following code is to visualize the missing values (if any) using bar chart.

gg_miss_upset function are using to visualize the patterns of missingness, or rather the combinations of missingness across cases.

This function gives us(if any missing value present):

how many variables have missing values
which variable has the most missing values
And give us the variables which have missing values together

# Below code gives you missing value in each column
paste0("Number of missing value in each column") %>% cat("\n")

## Number of missing value in each column

print(sapply( dataset_updated, function(x) sum(is.na(x))))

## x y z 
## 0 0 0

missing_col_names <- names(which(sapply( dataset_updated, anyNA)))

total_na <- sum(is.na( dataset_updated))
# visualize the missing values (if any) using bar chart
if(total_na > 0 && length(missing_col_names) > 1){
  paste0("Code chunk executed successfully. Visualizing the missing values using bar chart") %>% cat("\n")
  gg_miss_upset( dataset_updated,
  nsets = 10,
  nintersects = NA)
}else if(total_na > 0){
   dataset_updated %>%
  DataExplorer::plot_missing() 
}else{
  paste("Code chunk executed successfully. No missing value exist.") %>% cat("\n")
}

## Code chunk executed successfully. No missing value exist.

Missing Value Treatment

In this section user can make decisions, how to tackle missing values in dataset. Both column(s) and row(s) can be removed in the following dataset based on the user choose to do so.

Drop Column(s) with Missing Values

The below code accepts user input and deletes the specified column.

########################################################################################
################################## User Input Needed ###################################
########################################################################################

# OR do you want to drop column specific column
drop_cloumn_name_na <- "yes" ## type "yes" to drop column(s)
# write column name that you want to drop
drop_column_name <- c(" ") #enter column name
if(drop_cloumn_name_na == "yes"){
  names_df=names( dataset_updated) %in% drop_column_name
  dataset_updated <-  dataset_updated[ , which(!names( dataset_updated) %in% drop_column_name)]
  paste0("Code chunk executed, selected column(s) dropped successfully.") %>% print()
  cat("\n")
  str( dataset_updated)
} else {
  paste0("Code chunk executed, missing value not removed (if any).") %>% cat("\n")
  cat("\n")
}

## [1] "Code chunk executed, selected column(s) dropped successfully."
## 
## 'data.frame':    12000 obs. of  3 variables:
##  $ x: num  -1 1.1 -1 1.32 1.3 ...
##  $ y: num  -2.333 -2.745 1.266 0.621 -1.247 ...
##  $ z: num  -0.842 -0.288 -0.923 -0.841 -0.98 ...

Drop Row(s) with Missing Values

The below code accepts user input and deletes rows.

# Do you want to drop row(s) containing "NA"
drop_row <- "no" ## type "yes" to delete missing value observations
if(drop_row == "yes"){
   dataset_updated <-  dataset_updated %>% na.omit()
   paste0("Code chunk executed, missing values successfully identified and removed. Updated dataset has ", nrow( dataset_updated), " row(s) and ", ncol( dataset_updated), " column(s)") %>% print()
  cat("\n")
} else{
  paste0("Code chunk executed, missing value(s) not removed (if any).") %>% cat("\n")
  cat("\n")
}

## Code chunk executed, missing value(s) not removed (if any).

6.8.1 One-Hot Encoding

This technique bins all categorical values as either 1 or 0. It is used for categorical variables with 2 classes. This is done because classification models can only handle features that have numeric values.

Given below is the length of unique values in each categorical column

cat_cols <-
  colnames(dataset_updated)[unlist(sapply(
    dataset_updated,
    FUN = function(x) {
      CheckColumnType(x) == "character" ||
        CheckColumnType(x) == "factor"
    }
  ))]

apply(dataset_updated[cat_cols], 2, function(x) {
  length(unique(x))
})

## integer(0)

Selecting categorical columns with smaller unique values for dummification

########################################################################################
################################## User Input Needed ###################################
########################################################################################
# Do you want to dummify the categorical variables?

dummify_cat <- FALSE ## TRUE,FALSE

# Select the columns on which dummification is to be performed
dum_cols <- c(" "," ") #enter column name in smalls

## [1] "One-Hot Encoding was not performed on dataset."

6.8.2 Check for Singularity

# Check data for singularity
singular_cols <- sapply(dataset_updated,function(x) length(unique(x))) %>%  # convert to dataframe
  data.frame(Unique_n = .) %>% dplyr::filter(Unique_n == 1) %>% 
  rownames() %>% data.frame(Constant_Variables = .)

if(nrow(singular_cols) != 0) {                              
  singular_cols  %>% DT::datatable()
} else {
  paste("There are no singular columns in the dataset") %>% htmltools::HTML()
}

There are no singular columns in the dataset

# Display variance of columns
data <- dataset_updated %>% dplyr::summarise_if(is.numeric, var) %>% t() %>% 
  data.frame() %>% round(3) #%>% DT::datatable(colnames = "Variance")

colnames(data) <- c("Variance")
displayTable(data)

	Variance
x	2.239
y	2.214
z	0.499

6.8.3 Selecting only Numeric Cols after Dummification

numeric_cols=as.vector(sapply(dataset_updated, is.numeric))
dataset_updated=dataset_updated[,numeric_cols]
colnames(dataset_updated)

## [1] "x" "y" "z"

6.9 Final Dataset Summary

All further operations will be performed on the following dataset.

nums <- colnames(dataset_updated)[unlist(lapply(dataset_updated, is.numeric))]
cat(paste0("Final data frame contains ", nrow( dataset_updated), " row(s) and ", ncol( dataset_updated), " column(s).","Code chunk executed. Below table showing first 10 row(s) of the dataset."))

## Final data frame contains 12000 row(s) and 3 column(s).Code chunk executed. Below table showing first 10 row(s) of the dataset.

dataset_updated <-  dataset_updated %>% mutate_if(is.numeric, round, digits = 4)

displayTable(dataset_updated[1:10,])

x	y	z
-1.0020	-2.3335	-0.8420
1.1021	-2.7447	-0.2878
-1.0033	1.2656	-0.9229
1.3204	0.6205	-0.8410
1.2998	-1.2470	-0.9801
-1.9606	2.0755	-0.5184
0.2807	-0.9724	0.1548
-1.5540	2.0564	-0.8164
-2.4653	1.6586	-0.2377
-2.3234	1.6933	-0.4841

7. Data distribution

This section displays four objects.

Variable Histograms: The histogram distribution of all the features in the dataset.

Box Plots: Box plots for all the features in the dataset. These plots will display the median and Interquartile range of each column at a panel level.

Correlation Matrix: This calculates the Pearson correlation which is a bivariate correlation value measuring the linear correlation between two numeric columns. The output plot is shown as a matrix.

Summary EDA: The table provides descriptive statistics for all the features in the dataset.

variable: The features/columns of the dataset
min: Minimum value of that feature/column
1st Quartile: The value that splits the lower 25% of the data when arranged in ascending order
median: Middle value in the ascendingly ordered dataset
mean: Sum of all values in the dataset divided by the total number of values
sd: Measure of the dispersion of dataset relative to its mean.
3rd Quartile: The value that splits the lower 75% of the data when arranged in ascending order
max: Maximum value of that feature/column
hist: The basic barchart of the data distribution of a feature/column
n_row: Number of rows for that feature/column
n_missing: Number of missing values/NAs for that feature/column

It uses an inbuilt function called edaPlots to display the above-mentioned four objects.

Summary Table

edaPlots(dataset_updated)

Histograms

edaPlots(dataset_updated, output_type = "histogram")

Box Plots

edaPlots(dataset_updated, output_type = "boxplot")

Correlation Plot

edaPlots(dataset_updated, output_type = "correlation")

7.1 Train - Test Split

Let us split the data into train and test. We will randomly select 80% of the data as train and remaining as test.

## 80% of the sample size
smp_size <- floor(0.80 * nrow(dataset_updated))

## set the seed to make your partition reproducible
set.seed(279)
train_ind <- sample(seq_len(nrow(dataset_updated)), size = smp_size)

dataset_updated_train <- dataset_updated[train_ind, ]
dataset_updated_test <- dataset_updated[-train_ind, ]

The train data contains 9600 rows and 3 columns. The test data contains 2400 rows and 3 columns.

7.1.1 Train Distribution

Summary Table

edaPlots(dataset_updated_train)

Histograms

edaPlots(dataset_updated_train, output_type = "histogram")

Box Plots

edaPlots(dataset_updated_train, output_type = "boxplot")

Correlation Plot

edaPlots(dataset_updated_train, output_type = "correlation")

7.1.2 Test Distribution

Summary Table

edaPlots(dataset_updated_test)

Histograms

edaPlots(dataset_updated_test, output_type = "histogram")

Box Plots

edaPlots(dataset_updated_test, output_type = "boxplot")

Correlation Plot

edaPlots(dataset_updated_test, output_type = "correlation")

8. Model Training

In HVT, we use sammons from the package MASS to project higher dimensional data to a 2D space. The function hvq called from the trainHVT function returns hierarchical quantized data which will be the input for construction of the tessellations. The data is then represented in 2D coordinates and the tessellations are plotted using these coordinates as centroids. We use the package deldir for this purpose. The deldir package computes the Delaunay triangulation (and hence the Dirichlet or Voronoi tessellation) of a planar point set according to the second (iterative) algorithm of Lee and Schacter. For subsequent levels, transformation is performed on the 2D coordinates to get all the points within its parent tile. Tessellations are plotted using these transformed points as centroids. The lines in the tessellations are chopped in places so that they do not protrude outside the parent polygon. This is done for all the subsequent levels.

Let us try to understand the trainHVT function first.

trainHVT(
  dataset,
  min_compression_perc,
  n_cells,
  depth,
  quant.err,
  normalize = TRUE,
  distance_metric = c("L1_Norm", "L2_Norm"),
  error_metric = c("mean", "max"),
  quant_method = c("kmeans", "kmedoids"),
  projection.scale,
  dim_reduction_method = c("sammon" , "tsne" , "umap")
  diagnose = FALSE,
  hvt_validation = FALSE,
  train_validation_split_ratio,
  tsne_perplexity,tsne_theta,tsne_verbose,
  tsne_eta,tsne_max_iter,
  umap_n_neighbors,umap_min_dist
)

Each of the parameters of trainHVT function have been explained below:

dataset - A dataframe, with numeric columns (features) that will be used for training the model.
min_compression_perc - An integer, indicating the minimum compression percentage to be achieved for the dataset. It indicates the desired level of reduction in dataset size compared to its original size.
n_cells - An integer, indicating the number of cells per hierarchy (level). This parameter determines the granularity or level of detail in the hierarchical vector quantization. Minimum n_cells per hierarchy is 3.
depth - An integer, indicating the number of levels. A depth of 1 means no hierarchy (single level), while higher values indicate multiple levels (hierarchy).
quant.err - A number indicating the quantization error threshold. A cell will only breakdown into further cells if the quantization error of the cell is above the defined quantization error threshold.
normalize - A logical value indicating if the dataset should be normalized. When set to TRUE, scales the values of all features to have a mean of 0 and a standard deviation of 1 (Z-score)
distance_metric - The distance metric can be L1_Norm(Manhattan) or L2_Norm(Euclidean). L1_Norm is selected by default. The distance metric is used to calculate the distance between an n dimensional point and centroid.
error_metric - The error metric can be mean or max. max is selected by default. max will return the max of m values and mean will take mean of m values where each value is a distance between a point and centroid of the cell.
quant_method - The quantization method can be kmeans or kmedoids. Kmeans uses means (centroids) as cluster centers while Kmedoids uses actual data points (medoids) as cluster centers. kmeans is selected by default.
dim_reduction_method - The dimensionality reduction method to be chosen. options are ‘tsne’ , ‘umap’ & ‘sammon’. Default is ‘sammon’.
scale_summary - A list with user defined mean and standard deviation values for all the features in the dataset. Pass the scale summary when normalize is set to FALSE.
diagnose - A logical value indicating whether user wants to perform diagnostics on the model. Default value is FALSE.
hvt_validation - A logical value indicating whether user wants to holdout a validation set and find mean absolute deviation of the validation points from the centroid. Default value is FALSE.
train_validation_split_ratio - A numeric value indicating train validation split ratio. This argument is only used when hvt_validation has been set to TRUE. Default value for the argument is 0.8
tsne_perplexity - A numeric, balances the attention t-SNE gives to local and global aspects of the data. Lower values focus more on local structure, while higher values consider more global structure. It is recommended to be between 5 and 50. Default value is 30.
tsne_theta - A numeric, speed/accuracy trade-off parameter for Barnes-Hut approximation. If set to 0, exact t-SNE is performed, which is slower. If set to greater than 0, an approximation is used, which speeds up the process but may reduce accuracy. Default value is 0.5
tsne_eta (learning_rate) - A numeric, learning rate for t-SNE optimization.Determines the step size during optimization. If too low, the algorithm might get stuck in local minima; if too high, the solution may become unstable. Default value is 200.
tsne_max_iter - An integer, maximum number of iterations. Number of iterations for the optimization process. More iterations can improve results but increase computation time. Default value is 1000.
umap_n_neighbors - An integer, the size of the local neighborhood (in terms of number of neighboring sample points) used for manifold approximation, controls the balance between local and global structure in the data, smaller values focus on local structure, while larger values capture more global structures. Default value is 15.
umap_min_dist - A numeric, the minimum distance between points in the embedded space, controls how tightly UMAP packs points together, lower values result in a more clustered embedding. Default value is 0.1

The output of trainHVT function (list of 7 elements) have been explained below with an image attached for clear understanding.

NOTE: Here the attached image is the snapshot of output list generated from model training which can be referred later in this section

Figure 3: The Output list generated by trainHVT function.

The ‘1st element’ is a list containing information related to plotting tessellations. This information might include coordinates, boundaries, or other details necessary for visualizing the tessellations
The ‘2nd element’ is a list containing information related to Sammon’s projection coordinates of the data points in the reduced-dimensional space.
The ‘3rd element’ is a list containing detailed information about the hierarchical vector quantized data along with a summary section containing no of points, Quantization Error and the centroids for each cell for 2D.
The ‘4th element’ is a list that contains all the diagnostics information of the model when diagnose is set to TRUE. Otherwise NA.
The ‘5th element’ is a list that contains all the information required to generates a Mean Absolute Deviation (MAD) plot, if hvt_validation is set to TRUE. Otherwise NA
The ‘6th element’ is a list containing detailed information about the hierarchical vector quantized data along with a summary section containing no of points, Quantization Error and the centroids for each cell which is the output of hvq.
The ‘7th element’ (model info) is a list that contains model generated time, input parameters passed to the model, validation results and the dimensionality reduction evaluation metrics table.

More information on building an HVT model at different levels and visualizing the output can be found here.

In the section below, we build a Level 1 HVT model. The number of cells (n_cells) is set to 500.

hvt.results <-trainHVT(dataset_updated_train,
                          n_cells = 500,
                          depth = 1,
                          quant.err = 0.1,
                          normalize = FALSE,
                          distance_metric = "L2_Norm",
                          error_metric = "max",
                          quant_method = "kmeans",
                          diagnose = TRUE,
                          hvt_validation = TRUE,
                          train_validation_split_ratio=0.8,
                          dim_reduction_method = "sammon")

summary(hvt.results)

segmentLevel	noOfCells	noOfCellsBelowQuantizationError	percentOfCellsBelowQuantizationErrorThreshold	parameters
1	500	450	0.9	n_cells: 500 quant.err: 0.1 distance_metric: L2_Norm error_metric: max quant_method: kmeans

As seen in the above table, 90% of the cells have a quantization error below the threshold.

Let’s have a closer look at Quant.Error of the cells. Here we are showing just top 20 rows for the sake of brevity.

displayTable(hvt.results[[3]][['summary']])

Segment.Level	Segment.Parent	Segment.Child	n	Cell.ID	Quant.Error	x	y	z
1	1	1	16	1	0.0850	-2.5043	1.6314	0.0128
1	1	2	20	57	0.1056	-0.8851	2.6890	0.5198
1	1	3	19	215	0.0757	0.2462	1.4518	-0.8469
1	1	4	15	317	0.0462	0.9971	0.1269	-0.1018
1	1	5	18	469	0.0955	2.8862	0.2697	-0.4118
1	1	6	12	156	0.0664	-1.2610	-0.9961	0.9169
1	1	7	7	481	0.0605	2.0628	-1.6366	0.7702
1	1	8	19	173	0.0592	-1.0148	-0.4740	0.4757
1	1	9	11	337	0.0474	1.2020	0.2838	-0.6422
1	1	10	19	147	0.0637	-0.8002	0.9398	-0.6382
1	1	11	17	119	0.0602	-1.2857	0.3566	-0.7428
1	1	12	13	235	0.0549	-0.3764	-0.9945	0.3487
1	1	13	13	345	0.0499	0.7779	-0.8996	0.5824
1	1	14	14	322	0.0490	0.7621	-0.7313	-0.3297
1	1	15	19	38	0.1071	-2.5437	-1.5681	0.0274
1	1	16	14	202	0.0869	0.3834	2.2422	-0.9543
1	1	17	13	180	0.0515	-0.9200	-0.8070	-0.6288
1	1	18	13	200	0.0483	-0.7782	-0.7013	0.3026
1	1	19	11	172	0.0390	-0.9769	-0.4964	-0.4267
1	1	20	7	44	0.0488	-2.2419	0.6013	-0.9435

Let’s take a look at the 2D HVT plot.

plotHVT(hvt.results,plot.type = '2Dhvt')

8.1 Model Diagnostics and Validation

8.1.1 Diagnostics

HVT model diagnostics are used to evaluate the model fit and investigate the proximity between centroids. The distribution of proximity value can also be used to decided an optimum Mean Absolute Deviation threshold for HVT model based scoring.

The diagnosis can be enabled by setting the diagnose parameter to TRUE while building the HVT Model.

8.1.2 Validation

Model validation is used to measure the fit/quality of the model. Measuring model fit is the key to iteratively improving the models. The relevant measure of model quality here is percentage of anomalous points. The percentage anomalies ideally should match with level of compression achieved during modeling, where PercentageAnomalies \(\approx\) 1-ModelCompression.

Model Validation can be enabled by setting the hvt_validation parameter to TRUE and setting the train_validation_split_ratio value while training the HVT Model.

The model trained above has a train_validation_split_ratio of 0.8, i.e 80% of the Train dataset is used for training the model while the remaining 20% will be used for validation

Note: User can skip this step, if the number of observations in train data is low.

8.2 Diagnostic Plots

The basic tool for examining the model fit is proximity plots and distribution of observations across centroids.

The proximity between object can be measured as distance matrix. The distances between the objects is calculated using Manhattan or Euclidean Distance and put into a matrix form. In the next step we find the minimum value for each row, excluding the diagonal values, as the diagonal elements of distance matrix are zero representing distance from an object to itself. This minimum distance value gives the proximity(distance to nearest neighbour) of other object in the datatable.

`plotModelDiagnostics() function can be used to print diagnostic plots for HVT model or HVT scoring.

The plotModelDiagnostics() function for HVT Model provides 5 diagnostic plots which are as follows:

Mean Absolute Deviation Plot: Calibration: HVT Model | Train Data
Minimum Intra-DataPoint Distance Plot: Train Data
Minimum Intra-Centroid Distance Plot: HVT Model | Train Data
Distribution of Number of Observations in Cells: HVT Model | Train Data
Singletons Pie Chart: HVT Model | Train Data

Let’s have look at the function plotModelDiagnostics which we will use to print the diagnostic plots.

plotModelDiagnostics(hvt.results)

8.2.1 Mean Absolute Deviation based Calibration: HVT Model | Train Data

The first diagnostic plot is a calibration plot for HVT Model run on train data. This plot is obtained by executing the train data itself on the HVT model. It is a comparison of Percentage_Anomalies with at varying Mean Absolute Deviation values. It can be seen from the plot that at 0.1 Mean Absolute Deviation value the percentage anomalies drop below one percent.

p3=hvt.results[[4]]$mad_plot_train+ggtitle("Mean Absolute Deviation Plot: Calibration: HVT Model | Train Data")
p3

8.2.2 Minimum Intra-DataPoint Distance Plot: Train Data

The second diagnostics plot helps us in finding out how the points in the training data are distributed. Shown below is a histogram of minimum distances with nearest neighbour for each observation in train data.

p1=hvt.results[[4]]$datapoint_plot+ggtitle("Minimum Intra-DataPoint Distance Plot: Train Data")
p1

As seen in the plot above the mean value is 0.02.

8.2.3 Minimum Intra-Centroid Distance Plot: HVT Model | Train Data

The third diagnostics plot helps us in finding out how the centroids in the HVT Model are distributed. Shown below is a histogram of minimum distances with nearest neighbour for each centroid in HVT Model

p2=hvt.results[[4]]$cent_plot+ggtitle("Minimum Intra-Centroid Distance Plot: HVT Model | Train Data")
p2

As seen in the plot above the mean value is 0.8. This value can be selected as the Mean Absolute Deviation Threshold for scoring data using scoreHVT function

8.2.4 Distribution of Number of Observations in HVT Cells

The fourth diagnostics plot finds out distribution of number of observations in each centroid. Shown below is a histogram to depict the same.

p4=hvt.results[[4]]$number_plot+ggtitle("Distribution of Number of Observations in Cells: HVT Model | Train Data")
p4

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As shown in the plot above the mean number of records in each HVT cell is 15.

8.2.5 Singleton Count

The fifth diagnostics plot finds out number of Singleton centroids (Segments/Centroids with single observation.)

p5=hvt.results[[4]]$singleton_piechart
p5

Validation

The Mean Absolute Deviation Plot for Validation Data has been shown in above section. Alternatively to fetch it separately, we can use the following code:

m1=hvt.results[[5]][["mad_plot"]]+ggtitle("Mean Absolute Deviation Plot:Validation")
m1

As seen in the plot the mean absolute deviation for the validation data from the given dataset_updated_train is 0.11.

9. Scoring

Now once we have built the model, let us try to score using our test dataset to see which cell each point belongs to.

9.1 Scoring Algorithm

The Scoring algorithm recursively calculates the distance between each point in the test dataset and the cell centroids for each level. The following steps explain the scoring method for a single point in the test dataset :

Calculate the distance between the point and the centroid of all the cells in the first level.
Find the cell whose centroid has minimum distance to the point.
Check if the cell drills down further to form more cells.
If it doesn’t, return the path. Or else repeat steps 1 to 4 till we reach a level at which the cell doesn’t drill down further.

9.2 Load Test Data

The user can provide an absolute or relative path in the cell below to access the test data from his/her computer.

load_test_data=FALSE
if(load_test_data){
  file_name <- " " #enter the name of the local file for validation
  file_path <- " " #enter the path of the local file for validation
  file_load <- paste0(file_path, file_name)
  dataset_updated_test <- as.data.frame(fread(file_load))
  
  if(nrow(dataset_updated_test) > 0){

    paste0("File ", file_name, " having ", nrow(dataset_updated_test), " row(s) and ",
ncol(dataset_updated_test), " column(s)",  " imported successfully. ") %>% cat("\n")
dataset_updated_test <- dataset_updated_test %>% mutate_if(is.numeric, round, digits = 4)
    paste0("Code chunk executed successfully. Below table showing first 10 row(s) of the dataset.") %>% cat("\n")
    dataset_updated_test %>% head(10) %>%as.data.frame() %>%DT::datatable(options = options, rownames = TRUE)
  }
  
  colnames( dataset_updated_test) <- colnames( dataset_updated_test) %>% casefold()
 dataset_updated_test <- spaceless( dataset_updated_test)
}

9.2.1 Transformation of Categorical Features

In this section we will perform one hot encoding on test dataset, based on whether one hot encoding has been performed on train dataset.

if(dummify_cat){
dummified_cols_test <- dataset_updated_test %>% dplyr::select(dum_cols) %>%
  dummies::dummy.data.frame(dummy.classes = "ALL", sep = "_")

names(dummified_cols_test) <- gsub(pattern = "[^[:alnum:]]+",
                             replacement = ".",
                             names(dummified_cols_test))

columns_difference=setdiff(dummified_cols,names(dummified_cols_test))
dummified_cols_test[,columns_difference]=0

dataset_updated_test <- dataset_updated_test %>% cbind(dummified_cols_test) %>%     
# append encoded columns
dplyr::select(-dum_cols)
# remove the old categorical columns
dummified_cols_test %>% head(5) %>% DT::datatable(options = options)
}

9.2.2 Subsetting Test Dataset

In this section we will subset the test data based on numeric columns present in train data.

dataset_updated_test=dataset_updated_test %>% dplyr::select(nums)

Now once we have the test data ready, lets look at how the scoreHVT function looks like.

9.3 Scoring on Test Data

scoreHVT(dataset,
         hvt.results,
         child.level,
         mad.threshold,
         line.width,
         color.vec,
         normalize,
         distance_metric,
         error_metric,
         yVar,
         analysis.plots, 
         names.column)

The important parameters for the function scoreHVT are as below

dataset - A dataframe containing the test dataset. The dataframe should have all the variable(features) used for training.
hvt.results.model - A list obtained from the trainHVT function while performing hierarchical vector quantization on training data. This list provides an overview of the hierarchical vector quantized data, including diagnostics, tessellation details, Sammon’s projection coordinates, and model input information.
child.level - A number indicating the depth for which the heat map is to be plotted. Each depth represents a different level of clustering or partitioning of the data.
mad.threshold - A numeric value indicating the permissible Mean Absolute Deviation which is obtained from Minimum Intra centroid plot(when diagnose is set to TRUE in trainHVT). mad.threshold value is important since it is used in anomaly detection. Default value is 0.2 NOTE: for a given datapoint, when the quantization error is above mad.threshold it is denoted as anomaly else not.
line.width - A vector indicating the line widths of the tessellation boundaries for each layer. (Optional Parameter)
color.vec - A vector indicating the colors of the tessellations boundaries at each layer. (Optional Parameter)
normalize - A logical value indicating if the dataset should be normalized. When set to TRUE, the data (testing dataset) is standardized by mean and sd of the training dataset referred from the trainHVT(). When set to FALSE, the data is used as such without any changes.
distance_metric - The distance metric can be L1_Norm(Manhattan) or L2_Norm(Euclidean). The metric is used when calculating distance between each datapoint(in test dataset) with the centroids obtained from results of trainHVT. Default is L1_Norm.
error_metric - The error metric can be mean or max. max will return the max of m values and mean will take mean of m values where each value is a distance between the datapoint and centroid of the cell. This helps in calculating the scored quantization error. Default value is max.
yVar - A character or a vector representing the name of the dependent variable(s)

The below given arguments are used only when character column can be mapped over the scored results. since torus doesn’t have a character column, we are not using them in this vignette.

analysis.plots - A logical value to indicate whether to include the insight plots which are useful in viewing the contents and clusters of cells. Default is FALSE.
names.column - The column of names of the datapoints which will be displayed as the contents of the cell in ‘scoredPlotly’. Default is NULL.

Here the mad_threshold has been selected as 0.8 which is based on Mean of Minimum Intra-Centroid Distance plot from above.

hvt.score <- scoreHVT(dataset_updated_test,
                      hvt.results,
                      child.level = 1,
                      mad.threshold = 0.8, 
                      line.width = c(0.6, 0.4, 0.2),
                      color.vec = c("navyblue", "slateblue", "lavender"),
                      distance_metric = "L1_Norm",
                      error_metric = "max")

summary(hvt.score)

Row_Number	Segment.Level	Segment.Parent	Segment.Child	n	Cell.ID	Quant.Error	centroidRadius	diff	x	y	z
1	1	1	153	1	182	0.0764	0.3141	0.2377	-1.0020	-2.3335	-0.8420
2	1	1	217	1	17	0.0633	0.2037	0.1404	-2.1770	1.5699	-0.7295
3	1	1	162	1	475	0.1043	0.1806	0.0763	2.0941	-1.8907	-0.5704
4	1	1	445	1	78	0.1248	0.3166	0.1918	-0.1980	2.9839	-0.1378
5	1	1	436	1	123	0.1272	0.2633	0.1361	-1.2495	-1.9487	-0.9491
6	1	1	269	1	413	0.1154	0.1992	0.0838	2.0634	0.0815	-0.9979
7	1	1	191	1	347	0.0521	0.1879	0.1358	1.5182	0.9935	-0.9826
8	1	1	172	1	242	0.0443	0.0816	0.0373	0.4102	0.9552	-0.2784
9	1	1	39	1	411	0.1215	0.2735	0.1520	1.5409	2.3249	0.6142
10	1	1	113	1	197	0.0838	0.1432	0.0594	-0.9558	-0.6764	0.5591
11	1	1	237	1	186	0.072	0.1856	0.1136	-0.0790	1.8156	-0.9832
12	1	1	204	1	273	0.0392	0.1506	0.1113	0.7280	0.6856	0.0014
13	1	1	59	1	3	0.1355	0.2460	0.1105	-2.8278	1.0012	0.0209
14	1	1	209	1	368	0.0661	0.2251	0.1590	0.3594	-1.8433	-0.9925
15	1	1	92	1	198	0.0623	0.2029	0.1405	-0.9926	-0.9557	0.7829
16	1	1	91	1	18	0.0686	0.1984	0.1298	-2.5097	1.0868	-0.6782
17	1	1	237	1	186	0.0526	0.1856	0.1329	-0.1015	1.7005	-0.9550
18	1	1	471	1	404	0.0618	0.2626	0.2008	1.7236	1.5863	0.9395
19	1	1	157	1	452	0.026	0.1888	0.1628	1.6227	-1.6186	0.9564
20	1	1	257	1	377	0.0682	0.1637	0.0955	1.3105	-0.4772	0.7960

The plotModelDiagnostics() function can be called for scoring object as well. Shown below is the comparison of Mean Absolute Deviation Plot for train data and test data.

plotModelDiagnostics(hvt.score)

9.3.1 Quantization Error Comparison (Train vs. Test)

Table below shows cell(s) containing anomalous test data points. Datapoints are flagged as anomalous when Quantization Error for such is greater than the same of assigned centroid based on error metric. Comparison between scored/test and fitted Quantization error of cell(s) is provided for further insights.

Number of test data points: 2400 | Number of anomalous data points: 0 | Percentage of anomalous data points: 0.00%
Mean QE for fitted data: 0.075 | Mean QE for test data: 0.0753 | Difference in QE between fitted and test data: 3^{-4}

plotQuantErrorHistogram(hvt.results,hvt.score)

9.3.2 Anomalous Observations

The anomalous observations are shown below in the datatable.

NOTE: Since there are no anomaly found in torus dataset that we use, the table below will be empty.

QECompareDf <- hvt.score$QECompareDf %>% filter(anomalyFlag == 1)
displayTable(QECompareDf)

Segment.Level	Segment.Parent	Segment.Child	anomalyFlag	n	Fitted.Quant.Error	Scored.Quant.Error	Quant.Error.Diff	Quant.Error.Diff (%)

HVT Model diagnostics and validation

Zubin Dowlaty, Shubhra Prakash, Sunuganty Achyut Raj, Praditi Shah, Somya Shambhawi,Vishwavani

Created Date: 2018-05-17 Modified Date: 2025-03-24

1. Abstract

2. Data Compression

2.1 Using k-means

2.2 Using k-medoids

2.3 Hierarchical Vector Quantization

2.3.1 Quantization Error

3. Data Projection

3.1 Tessellations

4. Notebook Requirements

5. Data Importing

5.1 Import Dataset from Local

5.2 Simulate Dataset

6. Data Understanding

6.1 Quick Peek of the Data

6.2 Deleting Irrelevant Columns

6.3 Formatting and Renaming Columns

6.4 Changing Data Type of Columns

6.5 Checking and Removing Duplicates

6.6 List of Numerical and Categorical Column Names

6.7 Filtering Dataset for Analysis

6.8 Missing Value Analysis

6.8.1 One-Hot Encoding

6.8.2 Check for Singularity

6.8.3 Selecting only Numeric Cols after Dummification

6.9 Final Dataset Summary

7. Data distribution

Summary Table

Histograms

Box Plots

Correlation Plot

7.1 Train - Test Split

7.1.1 Train Distribution

Summary Table

Histograms

Box Plots

Correlation Plot

7.1.2 Test Distribution

Summary Table

Histograms

Box Plots

Correlation Plot

8. Model Training

8.1 Model Diagnostics and Validation

8.1.1 Diagnostics

8.1.2 Validation

8.2 Diagnostic Plots

8.2.1 Mean Absolute Deviation based Calibration: HVT Model | Train Data

8.2.2 Minimum Intra-DataPoint Distance Plot: Train Data

8.2.3 Minimum Intra-Centroid Distance Plot: HVT Model | Train Data

8.2.4 Distribution of Number of Observations in HVT Cells

8.2.5 Singleton Count

9. Scoring

9.1 Scoring Algorithm

9.2 Load Test Data

9.2.1 Transformation of Categorical Features

9.2.2 Subsetting Test Dataset

9.3 Scoring on Test Data

9.3.1 Quantization Error Comparison (Train vs. Test)

9.3.2 Anomalous Observations

10. Applications

11. References

Created Date: 2018-05-17
Modified Date: 2025-03-24