The HVT package is a collection of R functions to facilitate building topology preserving maps for rich multivariate data. Tending towards a big data preponderance, a large number of rows. A collection of R functions for this typical workflow is organized below :
Data Compression: Vector quantization (VQ), HVQ (hierarchical vector quantization) using means or medians. This step compresses the rows (long data frame) using a compression objective
Data Projection: Dimension projection of the compressed cells to 1D,2D and Interactive surface plot with the Sammons Nonlinear Algorithm. This step creates topology preserving map (also called embeddings) coordinates into the desired output dimension.
Tessellation: Create cells required for object visualization using the Voronoi Tessellation method, package includes heatmap plots for hierarchical Voronoi tessellations (HVT). This step enables data insights, visualization, and interaction with the topology preserving map. Useful for semi-supervised tasks
Scoring: Scoring new data sets and recording their assignment using the map objects from the above steps, in a sequence of maps if required
This package can perform vector quantization using the following algorithms -
The second and third steps are iterated until a predefined number of iterations is reached or the clusters converge. The runtime for the algorithm is O(n).
The second and third steps are iterated until a predefined number of iterations is reached or the clusters converge. The runtime for the algorithm is O(k * (n-k)^2).
The algorithm divides the dataset recursively into cells using \(k-means\) or \(k-medoids\) algorithm. The maximum number of subsets are decided by setting \(n-cells\) to, say five, in order to divide the dataset into maximum of five subsets. These five subsets are further divided into five subsets(or less), resulting in a total of twenty five (5*5) subsets. The recursion terminates when the cells either contain less than three data point or a stop criterion is reached. In this case, the stop criterion is set to when the cell error exceeds the quantization threshold.
The steps for this method are as follows :
The stop criterion is when the quantization error of a cell satisfies one of the below conditions
Let us try to understand quantization error with an example.
An example of a 2 dimensional VQ is shown above.
In the above image, we can see 5 cells with each cell containing a certain number of points. The centroid for each cell is shown in blue. These centroids are also known as codewords since they represent all the points in that cell. The set of all codewords is called a codebook.
Now we want to calculate quantization error for each cell. For the
sake of simplicity, let’s consider only one cell having centroid
A
and m
data points \(F_i\) for calculating quantization
error.
For each point, we calculate the distance between the point and the centroid.
\[ d = ||A - F_i||_{p} \]
In the above equation, p = 1 means L1_Norm
distance
whereas p = 2 means L2_Norm
distance. In the package, the
L1_Norm
distance is chosen by default. The user can pass
either L1_Norm
, L2_Norm
or a custom function
to calculate the distance between two points in n dimensions.
\[QE = \max(||A-F_i||_{p})\]
where
Now, we take the maximum calculated distance of all m points. This
gives us the furthest distance of a point in the cell from the centroid,
which we refer to as Quantization Error
. If the
Quantization Error is higher than the given threshold, the
centroid/codevector is not a good representation for the points in the
cell. Now we can perform further Vector Quantization on these points and
repeat the above steps.
Please note that the user can select mean or max to calculate the
Quantization Error. The custom function takes a vector of m value (where
each value is a distance between point in n
dimensions and
centroids) and returns a single value which is the Quantization Error
for the cell.
If we select mean
as the error metric, the above
Quantization Error equation will look like this :
\[QE = \frac{1}{m}\sum_{i=1}^m||A-F_i||_{p}\]
where
Sammon’s projection is an algorithm that maps a high-dimensional space to a space of lower dimensionality while attempting to preserve the structure of inter-point distances in the projection. It is particularly suited for use in exploratory data analysis and is usually considered a non-linear approach since the mapping cannot be represented as a linear combination of the original variables. The centroids are plotted in 2D after performing Sammon’s projection at every level of the tessellation.
Denoting the distance between \(i^{th}\) and \(j^{th}\) objects in the original space by \(d_{ij}^*\), and the distance between their projections by \(d_{ij}\). Sammon’s mapping aims to minimize the below error function, which is often referred to as Sammon’s stress or Sammon’s error
\[E=\frac{1}{\sum_{i<j} d_{ij}^*}\sum_{i<j}\frac{(d_{ij}^*-d_{ij})^2}{d_{ij}^*}\]
The minimization of this can be performed either by gradient descent, as proposed initially, or by other means, usually involving iterative methods. The number of iterations need to be experimentally determined and convergent solutions are not always guaranteed. Many implementations prefer to use the first Principal Components as a starting configuration.
A Voronoi diagram is a way of dividing space into a number of regions. A set of points (called seeds, sites, or generators) is specified beforehand and for each seed, there will be a corresponding region consisting of all points within proximity of that seed. These regions are called Voronoi cells. It is complementary to Delaunay triangulation.
Tessellate: Constructing Voronoi Tesselations
In this package, we use sammons
from the package
MASS
to project higher dimensional data to a 2D space. The
function hvq
called from the trainHVT
function
returns hierarchical quantized data which will be the input for
construction of the tessellations. The data is then represented in 2D
coordinates and the tessellations are plotted using these coordinates as
centroids. We use the package deldir
for this purpose. The
deldir
package computes the Delaunay triangulation (and
hence the Dirichlet or Voronoi tessellation) of a planar point set
according to the second (iterative) algorithm of Lee and Schacter. For
subsequent levels, transformation is performed on the 2D coordinates to
get all the points within its parent tile. Tessellations are plotted
using these transformed points as centroids. The lines in the
tessellations are chopped in places so that they do not protrude outside
the parent polygon. This is done for all the subsequent levels.
This chunk verifies the installation of all the necessary packages to successfully run this vignette, if not, installs them and attach all the packages in the session environment.
<- c("dplyr","HVT", "kableExtra", "geozoo", "plotly", "purrr", "DT")
list.of.packages
<-list.of.packages[!(list.of.packages %in% installed.packages()[, "Package"])]
new.packages if (length(new.packages))
install.packages(new.packages, dependencies = TRUE, verbose = FALSE, repos='https://cloud.r-project.org/')
# Loading the required libraries
invisible(lapply(list.of.packages, library, character.only = TRUE))
The user can provide an absolute or relative path in the cell below
to access the data from his/her computer. User can set
import_data_from_local
variable to TRUE to upload dataset
from local.
Note: For this notebook import_data_from_local
has been set to FALSE as we are simulating a dataset in next
section.
= FALSE # expects logical input
import_data_from_local
<- " " #enter the name of the local file
file_name <- " " #enter the path of the local file
file_path
if(import_data_from_local){
<- paste0(file_path, file_name)
file_load <- as.data.frame(fread(file_load))
dataset_updated if(nrow(dataset_updated) > 0){
paste0("File ", file_name, " having ", nrow(dataset_updated), " row(s) and ", ncol(dataset_updated), " column(s)", " imported successfully. ") %>% cat("\n")
<- dataset_updated %>% mutate_if(is.numeric, round, digits = 4)
dataset_updated paste0("Code chunk executed successfully. Below table showing first 10 row(s) of the dataset.") %>% cat("\n")
%>% head(10) %>%as.data.frame() %>%DT::datatable(options = options, rownames = TRUE)
dataset_updated
}
}
In this section, we will use a simulated dataset. If you are not
using this option set simulate_dataset
to FALSE. Given
below is a simulated dataset called torus that contains 12000
observations and 3 features.
Let us see how to generate data for torus. We are using a library
geozoo
for this purpose. Geo Zoo (stands for Geometric Zoo)
is a compilation of geometric objects ranging from 3 to 10 dimensions.
Geo Zoo contains regular or well-known objects, eg cube and sphere, and
some abstract objects, e.g. Boy’s surface, Torus and Hyper-Torus.
Here, we load the data and store into a variable
dataset_updated
.
= TRUE
simulate_dataset
if(simulate_dataset == TRUE){
set.seed(257)
##torus data generation
<- geozoo::torus(p = 3,n = 12000)
torus <- data.frame(torus$points)
dataset_updated colnames(dataset_updated) <- c("x","y","z")
if(nrow(dataset_updated) > 0){
paste0( "Dataset having ", nrow(dataset_updated), " row(s) and ", ncol(dataset_updated), " column(s)", "simulated successfully. ") %>% cat("\n")
<- dataset_updated %>% mutate_if(is.numeric, round, digits = 4)
dataset_updated paste0("Code chunk executed successfully. The table below is showing first 10 row(s) of the dataset.") %>% cat("\n")
%>% head(10) %>%as.data.frame() %>%displayTable()
dataset_updated
} }
## Dataset having 12000 row(s) and 3 column(s)simulated successfully.
## Code chunk executed successfully. The table below is showing first 10 row(s) of the dataset.
x | y | z |
---|---|---|
-1.0020 | -2.3335 | -0.8420 |
1.1021 | -2.7447 | -0.2878 |
-1.0033 | 1.2656 | -0.9229 |
1.3204 | 0.6205 | -0.8410 |
1.2998 | -1.2470 | -0.9801 |
-1.9606 | 2.0755 | -0.5184 |
0.2807 | -0.9724 | 0.1548 |
-1.5540 | 2.0564 | -0.8164 |
-2.4653 | 1.6586 | -0.2377 |
-2.3234 | 1.6933 | -0.4841 |
Structure of Data
In the below section we can see the structure of the data.
%>% str() dataset_updated
## 'data.frame': 12000 obs. of 3 variables:
## $ x: num -1 1.1 -1 1.32 1.3 ...
## $ y: num -2.333 -2.745 1.266 0.621 -1.247 ...
## $ z: num -0.842 -0.288 -0.923 -0.841 -0.98 ...
The cell below will allow user to drop irrelevant column.
########################################################################################
################################## User Input Needed ###################################
########################################################################################
# Add column names which you want to remove
<- "no"
want_to_delete_column
<-c(" `column_name` ")
del_col
if(want_to_delete_column == "yes"){
<- dataset_updated[ , !(names(dataset_updated) %in% del_col)]
dataset_updated print("Code chunk executed successfully. Overview of data types after removed selected columns")
str( dataset_updated)
else{
}paste0("No Columns removed. Please enter column name if you want to remove that column") %>% cat("\n")
}
## No Columns removed. Please enter column name if you want to remove that column
The code below contains a user defined function to rename or reformat any column that the user chooses.
########################################################################################
################################## User Input Needed ###################################
########################################################################################
# convert the column names to lower case
colnames( dataset_updated) <- colnames( dataset_updated) %>% casefold()
## rename column ?
<- "no" ## type "yes" if you want to rename a column
want_to_rename_column
## renaming a column of a dataset
<- " 'column_name` " ## use small letters
rename_col_name <- " `new_name` "
rename_col_name_to
if(want_to_rename_column == "yes"){
names( dataset_updated)[names( dataset_updated) == rename_col_name] <- rename_col_name_to
}
# remove space, comma, dot from column names
<- function(x) {colnames(x) <- gsub(pattern = "[^[:alnum:]]+",
spaceless replacement = ".",
names(x));x}
<- spaceless( dataset_updated)
dataset_updated
## below is the dataset summary
paste0("Successfully converted the column names to lower case and check the renamed column name if you changed") %>% cat("\n")
## Successfully converted the column names to lower case and check the renamed column name if you changed
str( dataset_updated) ## showing summary for updated
## 'data.frame': 12000 obs. of 3 variables:
## $ x: num -1 1.1 -1 1.32 1.3 ...
## $ y: num -2.333 -2.745 1.266 0.621 -1.247 ...
## $ z: num -0.842 -0.288 -0.923 -0.841 -0.98 ...
The section allows the user to change the data type of columns of his/her choice.
########################################################################################
################################## User Input Needed ###################################
########################################################################################
# If you want to change column type, change a below variable value to "yes"
<- "no"
want_to_change_column_type
# you can change column type into numeric or character only
<- "character" ## numeric
change_column_to_type
if(want_to_change_column_type == "yes" && change_column_to_type == "character"){
########################################################################################
################################## User Input Needed ###################################
########################################################################################
<- c("panel_var") ###### Add column names you want to change here #####
select_columns <- sapply( dataset_updated[select_columns],as.character)
dataset_updated[select_columns]paste0("Code chunk executed successfully. Datatype of selected column(s) have been changed into numerical.")
#str( dataset_updated)
else if(want_to_change_column_type == "yes" && change_column_to_type == "numeric"){
}<- c('gearbox_oil_temperature')
select_columns <- sapply( dataset_updated[select_columns],as.numeric)
dataset_updated[select_columns]paste0("Code chunk executed successfully. Datatype of selected column(s) have been changed into categorical.")
#str( dataset_updated)
else{
}paste0("Datatype of columns have not been changed.") %>% cat("\n")
}
## Datatype of columns have not been changed.
<- do.call(data.frame, dataset_updated)
dataset_updated str( dataset_updated)
## 'data.frame': 12000 obs. of 3 variables:
## $ x: num -1 1.1 -1 1.32 1.3 ...
## $ y: num -2.333 -2.745 1.266 0.621 -1.247 ...
## $ z: num -0.842 -0.288 -0.923 -0.841 -0.98 ...
Presence of duplicate observations can be misleading, this sections helps get rid of such rows in the dataset.
<- "yes" ## type "no" for choosing to not remove duplicates
want_to_remove_duplicates
## removing duplicate observation if present in the dataset
if(want_to_remove_duplicates == "yes"){
<- dataset_updated %>% unique()
dataset_updated paste0("Code chunk executed successfully, duplicates if present successfully removed. Updated dataset has ", nrow( dataset_updated), " row(s) and ", ncol( dataset_updated), " column(s)") %>% print()
cat("\n")
str( dataset_updated) ## showing summary for updated dataset
else{
} paste0("Code chunk executed successfully, NO duplicates were removed") %>% print()
}
## [1] "Code chunk executed successfully, duplicates if present successfully removed. Updated dataset has 12000 row(s) and 3 column(s)"
##
## 'data.frame': 12000 obs. of 3 variables:
## $ x: num -1 1.1 -1 1.32 1.3 ...
## $ y: num -2.333 -2.745 1.266 0.621 -1.247 ...
## $ z: num -0.842 -0.288 -0.923 -0.841 -0.98 ...
# Return the column type
<- function(dataVector) {
CheckColumnType #Check if the column type is "numeric" or "character" & decide type accordingly
if (class(dataVector) == "integer" || class(dataVector) == "numeric") {
<- "numeric"
columnType else { columnType <- "character" }
} #Return the result
return(columnType)
}### Loading the list of numeric columns in variable
<<- colnames( dataset_updated)[unlist(sapply( dataset_updated,
numeric_cols FUN = function(x){ CheckColumnType(x) == "numeric"}))]
### Loading the list of categorical columns in variable
<- colnames( dataset_updated)[unlist(sapply( dataset_updated,
cat_cols FUN = function(x){
CheckColumnType(x) == "character"|| CheckColumnType(x) == "factor"}))]
### Removing Date Column from the list of categorical column
paste0("Code chunk executed successfully, list of numeric and categorical variables created.") %>% cat()
## Code chunk executed successfully, list of numeric and categorical variables created.
paste0("Numerical Column(s): \n Count : ", length(numeric_cols), "\n") %>% cat()
## Numerical Column(s):
## Count : 3
paste0(numeric_cols) %>% print()
## [1] "x" "y" "z"
paste0("Categorical Column(s): \n Count : ", length(cat_cols), "\n") %>% cat()
## Categorical Column(s):
## Count : 0
paste0(cat_cols) %>% print()
## character(0)
In this section, the dataset can be filtered for required row(s) for further analysis.
<- "no" ## type "yes" in case you want to filter
want_to_filter_dataset <- " " ## Enter Column name to filter
filter_col <- " " ## Enter Value to exclude for the column selected
filter_val
if(want_to_filter_dataset == "yes"){
<- filter_at( dataset_updated
dataset_updated vars(contains(filter_col))
, all_vars(. != filter_val))
,
paste0("Code chunk executed successfully, dataset filtered successfully on required columns. Updated dataset has ", nrow( dataset_updated), " row(s) and ", ncol( dataset_updated), " column(s)") %>% print()
cat("\n")
str( dataset_updated) ## showing summary for updated dataset
else{
} paste0("Code chunk executed successfully, entire dataset is available for analysis.") %>% print()
}
## [1] "Code chunk executed successfully, entire dataset is available for analysis."
Missing values in the training data can lead to a biased model because we have not analyzed the behavior and relationship of those values with other variables correctly. It can lead to a wrong calculation or classification. Missing values can be of 3 types:
Missing Value on Entire dataset
<- sum(is.na( dataset_updated))/prod(dim( dataset_updated))
na_total if(na_total == 0){
paste0("In the uploaded dataset, there is no missing value") %>% cat("\n")
else{
}<- paste0(sprintf(na_total*100, fmt = '%#.2f'),"%")
na_percentage paste0("Percentage of missing value in entire dataset is ",na_percentage) %>% cat("\n")
}
## In the uploaded dataset, there is no missing value
Missing Value on Column-level
The following code is to visualize the missing values (if any) using bar chart.
gg_miss_upset
function are using to visualize the
patterns of missingness, or rather the combinations of missingness
across cases.
This function gives us(if any missing value present):
# Below code gives you missing value in each column
paste0("Number of missing value in each column") %>% cat("\n")
## Number of missing value in each column
print(sapply( dataset_updated, function(x) sum(is.na(x))))
## x y z
## 0 0 0
<- names(which(sapply( dataset_updated, anyNA)))
missing_col_names
<- sum(is.na( dataset_updated))
total_na # visualize the missing values (if any) using bar chart
if(total_na > 0 && length(missing_col_names) > 1){
paste0("Code chunk executed successfully. Visualizing the missing values using bar chart") %>% cat("\n")
gg_miss_upset( dataset_updated,
nsets = 10,
nintersects = NA)
else if(total_na > 0){
}%>%
dataset_updated ::plot_missing()
DataExplorerelse{
}paste("Code chunk executed successfully. No missing value exist.") %>% cat("\n")
}
## Code chunk executed successfully. No missing value exist.
Missing Value Treatment
In this section user can make decisions, how to tackle missing values in dataset. Both column(s) and row(s) can be removed in the following dataset based on the user choose to do so.
Drop Column(s) with Missing Values
The below code accepts user input and deletes the specified column.
########################################################################################
################################## User Input Needed ###################################
########################################################################################
# OR do you want to drop column specific column
<- "yes" ## type "yes" to drop column(s)
drop_cloumn_name_na # write column name that you want to drop
<- c(" ") #enter column name
drop_column_name if(drop_cloumn_name_na == "yes"){
=names( dataset_updated) %in% drop_column_name
names_df<- dataset_updated[ , which(!names( dataset_updated) %in% drop_column_name)]
dataset_updated paste0("Code chunk executed, selected column(s) dropped successfully.") %>% print()
cat("\n")
str( dataset_updated)
else {
} paste0("Code chunk executed, missing value not removed (if any).") %>% cat("\n")
cat("\n")
}
## [1] "Code chunk executed, selected column(s) dropped successfully."
##
## 'data.frame': 12000 obs. of 3 variables:
## $ x: num -1 1.1 -1 1.32 1.3 ...
## $ y: num -2.333 -2.745 1.266 0.621 -1.247 ...
## $ z: num -0.842 -0.288 -0.923 -0.841 -0.98 ...
Drop Row(s) with Missing Values
The below code accepts user input and deletes rows.
# Do you want to drop row(s) containing "NA"
<- "no" ## type "yes" to delete missing value observations
drop_row if(drop_row == "yes"){
<- dataset_updated %>% na.omit()
dataset_updated paste0("Code chunk executed, missing values successfully identified and removed. Updated dataset has ", nrow( dataset_updated), " row(s) and ", ncol( dataset_updated), " column(s)") %>% print()
cat("\n")
else{
} paste0("Code chunk executed, missing value(s) not removed (if any).") %>% cat("\n")
cat("\n")
}
## Code chunk executed, missing value(s) not removed (if any).
This technique bins all categorical values as either 1 or 0. It is used for categorical variables with 2 classes. This is done because classification models can only handle features that have numeric values.
Given below is the length of unique values in each categorical column
<-
cat_cols colnames(dataset_updated)[unlist(sapply(
dataset_updated,FUN = function(x) {
CheckColumnType(x) == "character" ||
CheckColumnType(x) == "factor"
}
))]
apply(dataset_updated[cat_cols], 2, function(x) {
length(unique(x))
})
## integer(0)
Selecting categorical columns with smaller unique values for dummification
########################################################################################
################################## User Input Needed ###################################
########################################################################################
# Do you want to dummify the categorical variables?
<- FALSE ## TRUE,FALSE
dummify_cat
# Select the columns on which dummification is to be performed
<- c(" "," ") #enter column name in smalls dum_cols
## [1] "One-Hot Encoding was not performed on dataset."
# Check data for singularity
<- sapply(dataset_updated,function(x) length(unique(x))) %>% # convert to dataframe
singular_cols data.frame(Unique_n = .) %>% dplyr::filter(Unique_n == 1) %>%
rownames() %>% data.frame(Constant_Variables = .)
if(nrow(singular_cols) != 0) {
%>% DT::datatable()
singular_cols else {
} paste("There are no singular columns in the dataset") %>% htmltools::HTML()
}
# Display variance of columns
<- dataset_updated %>% dplyr::summarise_if(is.numeric, var) %>% t() %>%
data data.frame() %>% round(3) #%>% DT::datatable(colnames = "Variance")
colnames(data) <- c("Variance")
displayTable(data)
Variance | |
---|---|
x | 2.239 |
y | 2.214 |
z | 0.499 |
=as.vector(sapply(dataset_updated, is.numeric))
numeric_cols=dataset_updated[,numeric_cols]
dataset_updatedcolnames(dataset_updated)
## [1] "x" "y" "z"
All further operations will be performed on the following dataset.
<- colnames(dataset_updated)[unlist(lapply(dataset_updated, is.numeric))]
nums cat(paste0("Final data frame contains ", nrow( dataset_updated), " row(s) and ", ncol( dataset_updated), " column(s).","Code chunk executed. Below table showing first 10 row(s) of the dataset."))
## Final data frame contains 12000 row(s) and 3 column(s).Code chunk executed. Below table showing first 10 row(s) of the dataset.
<- dataset_updated %>% mutate_if(is.numeric, round, digits = 4)
dataset_updated
displayTable(dataset_updated[1:10,])
x | y | z |
---|---|---|
-1.0020 | -2.3335 | -0.8420 |
1.1021 | -2.7447 | -0.2878 |
-1.0033 | 1.2656 | -0.9229 |
1.3204 | 0.6205 | -0.8410 |
1.2998 | -1.2470 | -0.9801 |
-1.9606 | 2.0755 | -0.5184 |
0.2807 | -0.9724 | 0.1548 |
-1.5540 | 2.0564 | -0.8164 |
-2.4653 | 1.6586 | -0.2377 |
-2.3234 | 1.6933 | -0.4841 |
This section displays four objects.
Variable Histograms: The histogram distribution of all the features in the dataset.
Box Plots: Box plots for all the features in the dataset. These plots will display the median and Interquartile range of each column at a panel level.
Correlation Matrix: This calculates the Pearson correlation which is a bivariate correlation value measuring the linear correlation between two numeric columns. The output plot is shown as a matrix.
Summary EDA: The table provides descriptive statistics for all the features in the dataset.
It uses an inbuilt function called edaPlots
to display
the above-mentioned four objects.
edaPlots(dataset_updated, output_type = "summary", n_cols = 3)
edaPlots(dataset_updated, output_type = "histogram", n_cols = 3)
edaPlots(dataset_updated, output_type = "boxplot", n_cols = 3)
edaPlots(dataset_updated, output_type = "correlation", n_cols = 3)
Let us split the data into train and test. We will randomly select 80% of the data as train and remaining as test.
## 80% of the sample size
<- floor(0.80 * nrow(dataset_updated))
smp_size
## set the seed to make your partition reproducible
set.seed(279)
<- sample(seq_len(nrow(dataset_updated)), size = smp_size)
train_ind
<- dataset_updated[train_ind, ]
dataset_updated_train <- dataset_updated[-train_ind, ] dataset_updated_test
The train data contains 9600 rows and 3 columns. The test data contains 2400 rows and 3 columns.
edaPlots(dataset_updated_train, output_type = "summary", n_cols = 3)
edaPlots(dataset_updated_train, output_type = "histogram", n_cols = 3)
edaPlots(dataset_updated_train, output_type = "boxplot", n_cols = 3)
edaPlots(dataset_updated_train, output_type = "correlation", n_cols = 3)
edaPlots(dataset_updated_test, output_type = "summary", n_cols = 3)
edaPlots(dataset_updated_test, output_type = "histogram", n_cols = 3)
edaPlots(dataset_updated_test, output_type = "boxplot", n_cols = 3)
edaPlots(dataset_updated_test, output_type = "correlation", n_cols = 3)
In HVT, we use sammons
from the package
MASS
to project higher dimensional data to a 2D space. The
function hvq
called from the trainHVT
function
returns hierarchical quantized data which will be the input for
construction of the tessellations. The data is then represented in 2D
coordinates and the tessellations are plotted using these coordinates as
centroids. We use the package deldir
for this purpose. The
deldir
package computes the Delaunay triangulation (and
hence the Dirichlet or Voronoi tessellation) of a planar point set
according to the second (iterative) algorithm of Lee and Schacter. For
subsequent levels, transformation is performed on the 2D coordinates to
get all the points within its parent tile. Tessellations are plotted
using these transformed points as centroids. The lines in the
tessellations are chopped in places so that they do not protrude outside
the parent polygon. This is done for all the subsequent levels.
Let us try to understand the trainHVT function first.
trainHVT(
dataset,
min_compression_perc,
n_cells,
depth,
quant.err,
normalize = TRUE,
distance_metric = c("L1_Norm", "L2_Norm"),
error_metric = c("mean", "max"),
quant_method = c("kmeans", "kmedoids"),
projection.scale,
dim_reduction_method = c("sammon" , "tsne" , "umap")
diagnose = FALSE,
hvt_validation = FALSE,
train_validation_split_ratio,
tsne_perplexity,tsne_theta,tsne_verbose,
tsne_eta,tsne_max_iter,
umap_n_neighbors,umap_min_dist
)
Each of the parameters of trainHVT function have been explained below:
dataset
- A dataframe, with numeric
columns (features) that will be used for training the model.
min_compression_perc
- An integer,
indicating the minimum compression percentage to be achieved for the
dataset. It indicates the desired level of reduction in dataset size
compared to its original size.
n_cells
- An integer, indicating
the number of cells per hierarchy (level). This parameter determines the
granularity or level of detail in the hierarchical vector
quantization.
depth
- An integer, indicating the
number of levels. A depth of 1 means no hierarchy (single level), while
higher values indicate multiple levels (hierarchy).
quant.err
- A number indicating the
quantization error threshold. A cell will only breakdown into further
cells if the quantization error of the cell is above the defined
quantization error threshold.
normalize
- A logical value
indicating if the dataset should be normalized. When set to TRUE, scales
the values of all features to have a mean of 0 and a standard deviation
of 1 (Z-score)
distance_metric
- The distance
metric can be L1_Norm
(Manhattan) or
L2_Norm
(Euclidean). L1_Norm
is selected by
default. The distance metric is used to calculate the distance between
an n
dimensional point and centroid.
error_metric
- The error metric can
be mean
or max
. max
is selected
by default. max
will return the max of m
values and mean
will take mean of m
values
where each value is a distance between a point and centroid of the
cell.
quant_method
- The quantization
method can be kmeans
or kmedoids
. Kmeans uses
means (centroids) as cluster centers while Kmedoids uses actual data
points (medoids) as cluster centers. kmeans
is selected by
default.
projection.scale
- A number
indicating the scale factor for the tessellations to visualize the
sub-tessellations well enough. It helps in adjusting the visual
representation of the hierarchy to make the sub-tessellations more
visible. Default is 10.
dim_reduction_method
- The
dimensionality reduction method to be chosen. options are ‘tsne’ ,
‘umap’ & ‘sammon’. Default is ‘sammon’.
scale_summary
- A list with user
defined mean and standard deviation values for all the features in the
dataset. Pass the scale summary when normalize is set to FALSE.
diagnose
- A logical value
indicating whether user wants to perform diagnostics on the model.
Default value is FALSE.
hvt_validation
- A logical value
indicating whether user wants to holdout a validation set and find mean
absolute deviation of the validation points from the centroid. Default
value is FALSE.
train_validation_split_ratio
- A
numeric value indicating train validation split ratio. This argument is
only used when hvt_validation has been set to TRUE. Default value for
the argument is 0.8
tsne_perplexity
- A numeric,
balances the attention t-SNE gives to local and global aspects of the
data. Lower values focus more on local structure, while higher values
consider more global structure. It is recommended to be between 5 and
50. Default value is 30.
tsne_theta
- A numeric,
speed/accuracy trade-off parameter for Barnes-Hut approximation. If set
to 0, exact t-SNE is performed, which is slower. If set to greater than
0, an approximation is used, which speeds up the process but may reduce
accuracy. Default value is 0.5
tsne_eta (learning_rate)
- A
numeric, learning rate for t-SNE optimization.Determines the step size
during optimization. If too low, the algorithm might get stuck in local
minima; if too high, the solution may become unstable. Default value is
200.
tsne_max_iter
- An integer, maximum
number of iterations. Number of iterations for the optimization process.
More iterations can improve results but increase computation time.
Default value is 1000.
umap_n_neighbors
- An integer, the
size of the local neighborhood (in terms of number of neighboring sample
points) used for manifold approximation, controls the balance between
local and global structure in the data, smaller values focus on local
structure, while larger values capture more global structures. Default
value is 15.
umap_min_dist
- A numeric, the
minimum distance between points in the embedded space, controls how
tightly UMAP packs points together, lower values result in a more
clustered embedding. Default value is 0.1
The output of trainHVT function (list of 7 elements) have been explained below with an image attached for clear understanding.
NOTE: Here the attached image is the snapshot of output list generated from model training which can be referred later in this section
The ‘1st element’ is a list containing information related to plotting tessellations. This information might include coordinates, boundaries, or other details necessary for visualizing the tessellations
The ‘2nd element’ is a list containing information related to Sammon’s projection coordinates of the data points in the reduced-dimensional space.
The ‘3rd element’ is a list containing detailed information about the hierarchical vector quantized data along with a summary section containing no of points, Quantization Error and the centroids for each cell for 2D.
The ‘4th element’ is a list that contains all the diagnostics information of the model when diagnose is set to TRUE. Otherwise NA.
The ‘5th element’ is a list that contains all the information required to generates a Mean Absolute Deviation (MAD) plot, if hvt_validation is set to TRUE. Otherwise NA
The ‘6th element’ is a list containing detailed information about
the hierarchical vector quantized data along with a summary section
containing no of points, Quantization Error and the centroids for each
cell which is the output of hvq
.
The ‘7th element’ (model info) is a list that contains model generated time, input parameters passed to the model, validation results and the dimensionality reduction evaluation metrics table.
More information on building an HVT model at different levels and visualizing the output can be found here.
In the section below, we build a Level 1 HVT model. The number of
cells (n_cells
) is set to 500.
<-trainHVT(dataset_updated_train,
hvt.results n_cells = 500,
depth = 1,
quant.err = 0.1,
normalize = FALSE,
distance_metric = "L2_Norm",
error_metric = "max",
quant_method = "kmeans",
diagnose = TRUE,
hvt_validation = TRUE,
train_validation_split_ratio=0.8,
dim_reduction_method = "sammon")
Initial stress : 0.01925 stress after 10 iters: 0.01507, magic = 0.500 stress after 20 iters: 0.01507, magic = 0.500
displayTable(hvt.results[[3]][['compression_summary']])
segmentLevel | noOfCells | noOfCellsBelowQuantizationError | percentOfCellsBelowQuantizationErrorThreshold | parameters |
---|---|---|---|---|
1 | 500 | 450 | 0.9 | n_cells: 500 quant.err: 0.1 distance_metric: L2_Norm error_metric: max quant_method: kmeans |
As seen in the above table, 90% of the cells have a quantization error below the threshold.
Let’s have a closer look at Quant.Error
of the cells.
Here we are showing just top 100 rows for the sake of brevity.
displayTable(data =hvt.results[[3]][['summary']])
Segment.Level | Segment.Parent | Segment.Child | n | Cell.ID | Quant.Error | x | y | z |
---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 16 | 1 | 0.085 | -2.5043 | 1.6314 | 0.0128 |
1 | 1 | 2 | 20 | 57 | 0.1056 | -0.8851 | 2.6890 | 0.5198 |
1 | 1 | 3 | 19 | 215 | 0.0757 | 0.2462 | 1.4518 | -0.8469 |
1 | 1 | 4 | 15 | 317 | 0.0462 | 0.9971 | 0.1269 | -0.1018 |
1 | 1 | 5 | 18 | 469 | 0.0955 | 2.8862 | 0.2697 | -0.4118 |
1 | 1 | 6 | 12 | 156 | 0.0664 | -1.2610 | -0.9961 | 0.9169 |
1 | 1 | 7 | 7 | 481 | 0.0605 | 2.0628 | -1.6366 | 0.7702 |
1 | 1 | 8 | 19 | 173 | 0.0592 | -1.0148 | -0.4740 | 0.4757 |
1 | 1 | 9 | 11 | 337 | 0.0474 | 1.2020 | 0.2838 | -0.6422 |
1 | 1 | 10 | 19 | 147 | 0.0637 | -0.8002 | 0.9398 | -0.6382 |
1 | 1 | 11 | 17 | 119 | 0.0602 | -1.2857 | 0.3566 | -0.7428 |
1 | 1 | 12 | 13 | 235 | 0.0549 | -0.3764 | -0.9945 | 0.3487 |
1 | 1 | 13 | 13 | 345 | 0.0499 | 0.7779 | -0.8996 | 0.5824 |
1 | 1 | 14 | 14 | 322 | 0.049 | 0.7621 | -0.7313 | -0.3297 |
1 | 1 | 15 | 19 | 38 | 0.1071 | -2.5437 | -1.5681 | 0.0274 |
1 | 1 | 16 | 14 | 202 | 0.0869 | 0.3834 | 2.2422 | -0.9543 |
1 | 1 | 17 | 13 | 180 | 0.0515 | -0.9200 | -0.8070 | -0.6288 |
1 | 1 | 18 | 13 | 200 | 0.0483 | -0.7782 | -0.7013 | 0.3026 |
1 | 1 | 19 | 11 | 172 | 0.039 | -0.9769 | -0.4964 | -0.4267 |
1 | 1 | 20 | 7 | 44 | 0.0488 | -2.2419 | 0.6013 | -0.9435 |
1 | 1 | 21 | 22 | 203 | 0.0531 | -0.0623 | 1.2179 | -0.6222 |
1 | 1 | 22 | 19 | 309 | 0.0605 | 0.1553 | -1.4426 | 0.8334 |
1 | 1 | 23 | 13 | 4 | 0.0908 | -2.2674 | 1.8196 | -0.3920 |
1 | 1 | 24 | 17 | 42 | 0.0951 | -2.6043 | -1.1370 | 0.5171 |
1 | 1 | 25 | 12 | 422 | 0.0689 | 1.8880 | -0.1159 | 0.9890 |
1 | 1 | 26 | 19 | 175 | 0.0636 | -0.5551 | 0.9269 | -0.3971 |
1 | 1 | 27 | 8 | 129 | 0.0468 | -1.3295 | -0.3648 | -0.7822 |
1 | 1 | 28 | 20 | 416 | 0.0687 | 2.2832 | 0.9731 | -0.8698 |
1 | 1 | 29 | 10 | 353 | 0.0434 | 0.9973 | -0.9066 | -0.7559 |
1 | 1 | 30 | 11 | 258 | 0.0561 | 0.5735 | 0.8548 | -0.2362 |
1 | 1 | 31 | 12 | 494 | 0.0811 | 1.7290 | -2.4385 | -0.0562 |
1 | 1 | 32 | 15 | 254 | 0.061 | 0.5261 | 1.0950 | 0.6210 |
1 | 1 | 33 | 13 | 286 | 0.0522 | 0.3298 | -0.9508 | -0.1011 |
1 | 1 | 34 | 15 | 410 | 0.1093 | 1.8365 | 2.2859 | -0.3242 |
1 | 1 | 35 | 9 | 312 | 0.0429 | 1.1547 | 0.6303 | -0.7266 |
1 | 1 | 36 | 35 | 126 | 0.1046 | -1.2236 | -2.6458 | -0.3519 |
1 | 1 | 37 | 14 | 397 | 0.107 | 1.9521 | 1.7699 | -0.7555 |
1 | 1 | 38 | 10 | 210 | 0.0803 | 0.5334 | 2.8564 | 0.4039 |
1 | 1 | 39 | 15 | 411 | 0.0912 | 1.7384 | 2.1866 | 0.5855 |
1 | 1 | 40 | 14 | 213 | 0.0786 | 0.6242 | 2.8695 | -0.3179 |
1 | 1 | 41 | 10 | 340 | 0.0503 | 0.9251 | -0.7349 | -0.5767 |
1 | 1 | 42 | 9 | 265 | 0.0449 | 0.7468 | 1.0075 | -0.6651 |
1 | 1 | 43 | 13 | 256 | 0.0548 | -0.0500 | -1.0343 | -0.2651 |
1 | 1 | 44 | 15 | 7 | 0.0847 | -2.8902 | 0.6925 | 0.1851 |
1 | 1 | 45 | 12 | 398 | 0.0594 | 1.4908 | -0.5839 | 0.9140 |
1 | 1 | 46 | 8 | 217 | 0.0493 | 0.0586 | 1.0962 | -0.4322 |
1 | 1 | 47 | 18 | 420 | 0.1104 | 0.5761 | -2.4941 | 0.8164 |
1 | 1 | 48 | 24 | 161 | 0.059 | -0.9058 | 0.4330 | 0.0810 |
1 | 1 | 49 | 20 | 271 | 0.089 | -0.4221 | -2.4065 | 0.8811 |
1 | 1 | 50 | 28 | 190 | 0.0785 | -0.2653 | 1.2979 | 0.7324 |
1 | 1 | 51 | 17 | 417 | 0.089 | 1.9581 | 1.1965 | 0.9514 |
1 | 1 | 52 | 23 | 112 | 0.0596 | -1.4479 | 0.1639 | 0.8373 |
1 | 1 | 53 | 16 | 436 | 0.0763 | 2.2330 | -0.4262 | -0.9544 |
1 | 1 | 54 | 14 | 105 | 0.0623 | -1.6477 | -0.8579 | -0.9862 |
1 | 1 | 55 | 14 | 310 | 0.0404 | 0.6567 | -0.7805 | -0.1951 |
1 | 1 | 56 | 16 | 145 | 0.0788 | 0.0785 | 2.5186 | -0.8481 |
1 | 1 | 57 | 20 | 101 | 0.0695 | -1.3314 | 0.7496 | -0.8790 |
1 | 1 | 58 | 28 | 28 | 0.1169 | -1.3507 | 2.4653 | 0.5585 |
1 | 1 | 59 | 18 | 3 | 0.082 | -2.7258 | 1.2293 | -0.0555 |
1 | 1 | 60 | 16 | 458 | 0.0786 | 2.6461 | 0.8808 | 0.6020 |
1 | 1 | 61 | 13 | 448 | 0.0909 | 2.5477 | 1.5254 | -0.1933 |
1 | 1 | 62 | 6 | 497 | 0.0553 | 2.4053 | -1.6289 | 0.4151 |
1 | 1 | 63 | 18 | 74 | 0.099 | -2.1565 | -1.0329 | 0.9085 |
1 | 1 | 64 | 15 | 106 | 0.0829 | -1.7278 | -0.8237 | 0.9903 |
1 | 1 | 65 | 24 | 15 | 0.1057 | -2.8328 | 0.2844 | 0.5076 |
1 | 1 | 66 | 15 | 445 | 0.0841 | 1.0529 | -2.2667 | 0.8588 |
1 | 1 | 67 | 19 | 426 | 0.0621 | 1.7760 | -0.8549 | 0.9945 |
1 | 1 | 68 | 15 | 133 | 0.0841 | -1.4151 | -1.2917 | 0.9928 |
1 | 1 | 69 | 17 | 21 | 0.1024 | -1.8727 | 1.8759 | 0.7471 |
1 | 1 | 70 | 11 | 22 | 0.0855 | -2.7889 | -0.1194 | -0.5972 |
1 | 1 | 71 | 18 | 425 | 0.0812 | 2.0334 | 1.8479 | 0.6542 |
1 | 1 | 72 | 14 | 152 | 0.0558 | -1.2178 | -0.5852 | 0.7594 |
1 | 1 | 73 | 18 | 178 | 0.0524 | -0.9222 | -0.3876 | 0.0185 |
1 | 1 | 74 | 9 | 292 | 0.0439 | 0.9819 | 0.8353 | -0.7014 |
1 | 1 | 75 | 7 | 60 | 0.0577 | -2.0354 | 0.3946 | -0.9946 |
1 | 1 | 76 | 19 | 184 | 0.0957 | 0.0769 | 2.2161 | 0.9656 |
1 | 1 | 77 | 10 | 406 | 0.0528 | 1.0878 | -1.4912 | 0.9851 |
1 | 1 | 78 | 16 | 114 | 0.0762 | -0.4937 | 2.0190 | -0.9916 |
1 | 1 | 79 | 22 | 154 | 0.0805 | -1.1446 | -1.3244 | -0.9625 |
1 | 1 | 80 | 19 | 174 | 0.089 | -0.9800 | -1.8102 | -0.9944 |
1 | 1 | 81 | 11 | 20 | 0.1069 | -2.9200 | -0.3489 | -0.3049 |
1 | 1 | 82 | 20 | 367 | 0.0585 | 1.2396 | -0.2282 | 0.6715 |
1 | 1 | 83 | 18 | 194 | 0.079 | -0.9804 | -1.5473 | 0.9802 |
1 | 1 | 84 | 10 | 275 | 0.0379 | 0.0616 | -1.1076 | 0.4520 |
1 | 1 | 85 | 14 | 483 | 0.0814 | 2.8612 | -0.0229 | 0.4950 |
1 | 1 | 86 | 19 | 239 | 0.0811 | -0.6117 | -1.8982 | 0.9896 |
1 | 1 | 87 | 15 | 108 | 0.1132 | -0.4173 | 2.3374 | 0.9180 |
1 | 1 | 88 | 15 | 14 | 0.0749 | -2.6803 | 0.8072 | 0.5906 |
1 | 1 | 89 | 14 | 487 | 0.0933 | 2.9542 | -0.3518 | 0.1693 |
1 | 1 | 90 | 15 | 49 | 0.078 | -2.3255 | -0.1169 | 0.9387 |
1 | 1 | 91 | 11 | 18 | 0.0661 | -2.5622 | 0.9349 | -0.6795 |
1 | 1 | 92 | 17 | 198 | 0.0676 | -0.8917 | -1.0186 | 0.7597 |
1 | 1 | 93 | 14 | 70 | 0.1113 | -2.2507 | -0.7708 | 0.9096 |
1 | 1 | 94 | 18 | 130 | 0.0671 | -1.3785 | -0.2504 | 0.7958 |
1 | 1 | 95 | 20 | 31 | 0.0843 | -2.7571 | -0.4438 | 0.5964 |
1 | 1 | 96 | 18 | 228 | 0.0681 | 0.4626 | 2.1152 | 0.9792 |
1 | 1 | 97 | 13 | 477 | 0.0839 | 1.6383 | -2.3006 | -0.5530 |
1 | 1 | 98 | 11 | 472 | 0.0849 | 2.6431 | -0.0930 | 0.7589 |
1 | 1 | 99 | 16 | 77 | 0.0977 | -0.8876 | 2.2554 | 0.8966 |
1 | 1 | 100 | 19 | 253 | 0.068 | 0.6705 | 1.2463 | -0.8092 |
Let’s take a look at the 2D HVT plot.
plotHVT(hvt.results,
line.width = c(0.6 ),
color.vec = c("black"),
centroid.size = 1,
maxDepth = 1,
plot.type = '2Dhvt')
HVT model diagnostics are used to evaluate the model fit and investigate the proximity between centroids. The distribution of proximity value can also be used to decided an optimum Mean Absolute Deviation threshold for HVT model based scoring.
The diagnosis can be enabled by setting the
diagnose
parameter to TRUE
while building the HVT Model.
Model validation is used to measure the fit/quality of the model. Measuring model fit is the key to iteratively improving the models. The relevant measure of model quality here is percentage of anomalous points. The percentage anomalies ideally should match with level of compression achieved during modeling, where PercentageAnomalies \(\approx\) 1-ModelCompression.
Model Validation can be enabled by setting the
hvt_validation
parameter to
TRUE
and setting the
train_validation_split_ratio
value while training the HVT
Model.
The model trained above has a
train_validation_split_ratio
of 0.8, i.e 80% of the Train
dataset is used for training the model while the remaining 20% will be
used for validation
Note: User can skip this step, if the number of observations in train data is low.
The basic tool for examining the model fit is proximity plots and distribution of observations across centroids.
The proximity between object can be measured as distance matrix. The distances between the objects is calculated using Manhattan or Euclidean Distance and put into a matrix form. In the next step we find the minimum value for each row, excluding the diagonal values, as the diagonal elements of distance matrix are zero representing distance from an object to itself. This minimum distance value gives the proximity(distance to nearest neighbour) of other object in the datatable.
`plotModelDiagnostics()
function can be used to print
diagnostic plots for HVT model or HVT scoring.
The plotModelDiagnostics()
function for HVT Model
provides 5 diagnostic plots which are as follows:
Let’s have look at the function plotModelDiagnostics
which we will use to print the diagnostic plots.
plotModelDiagnostics(hvt.results)
The first diagnostic plot is a calibration plot for HVT Model run on train data. This plot is obtained by executing the train data itself on the HVT model. It is a comparison of Percentage_Anomalies with at varying Mean Absolute Deviation values. It can be seen from the plot that at 0.1 Mean Absolute Deviation value the percentage anomalies drop below one percent.
=hvt.results[[4]]$mad_plot_train+ggtitle("Mean Absolute Deviation Plot: Calibration: HVT Model | Train Data")
p3 p3
The second diagnostics plot helps us in finding out how the points in the training data are distributed. Shown below is a histogram of minimum distances with nearest neighbour for each observation in train data.
=hvt.results[[4]]$datapoint_plot+ggtitle("Minimum Intra-DataPoint Distance Plot: Train Data")
p1 p1
As seen in the plot above the mean value is 0.02.
The third diagnostics plot helps us in finding out how the centroids in the HVT Model are distributed. Shown below is a histogram of minimum distances with nearest neighbour for each centroid in HVT Model
=hvt.results[[4]]$cent_plot+ggtitle("Minimum Intra-Centroid Distance Plot: HVT Model | Train Data")
p2 p2
As seen in the plot above the mean value is 0.8. This value can be selected as the Mean Absolute Deviation Threshold for scoring data using scoreHVT function
The fourth diagnostics plot finds out distribution of number of observations in each centroid. Shown below is a histogram to depict the same.
=hvt.results[[4]]$number_plot+ggtitle("Distribution of Number of Observations in Cells: HVT Model | Train Data")
p4 p4
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
As shown in the plot above the mean number of records in each HVT cell is 15.
The fifth diagnostics plot finds out number of Singleton centroids (Segments/Centroids with single observation.)
=hvt.results[[4]]$singleton_piechart
p5 p5
Validation
The Mean Absolute Deviation Plot for Validation Data has been shown in above section. Alternatively to fetch it separately, we can use the following code:
=hvt.results[[5]][["mad_plot"]]+ggtitle("Mean Absolute Deviation Plot:Validation")
m1 m1
As seen in the plot the mean absolute deviation for the validation
data from the given dataset_updated_train
is
0.11.
Now once we have built the model, let us try to score using our test dataset to see which cell each point belongs to.
The Scoring algorithm recursively calculates the distance between each point in the test dataset and the cell centroids for each level. The following steps explain the scoring method for a single point in the test dataset :
The user can provide an absolute or relative path in the cell below to access the test data from his/her computer.
=FALSE
load_test_dataif(load_test_data){
<- " " #enter the name of the local file for validation
file_name <- " " #enter the path of the local file for validation
file_path <- paste0(file_path, file_name)
file_load <- as.data.frame(fread(file_load))
dataset_updated_test
if(nrow(dataset_updated_test) > 0){
paste0("File ", file_name, " having ", nrow(dataset_updated_test), " row(s) and ",
ncol(dataset_updated_test), " column(s)", " imported successfully. ") %>% cat("\n")
<- dataset_updated_test %>% mutate_if(is.numeric, round, digits = 4)
dataset_updated_test paste0("Code chunk executed successfully. Below table showing first 10 row(s) of the dataset.") %>% cat("\n")
%>% head(10) %>%as.data.frame() %>%DT::datatable(options = options, rownames = TRUE)
dataset_updated_test
}
colnames( dataset_updated_test) <- colnames( dataset_updated_test) %>% casefold()
<- spaceless( dataset_updated_test)
dataset_updated_test }
In this section we will perform one hot encoding on test dataset, based on whether one hot encoding has been performed on train dataset.
if(dummify_cat){
<- dataset_updated_test %>% dplyr::select(dum_cols) %>%
dummified_cols_test ::dummy.data.frame(dummy.classes = "ALL", sep = "_")
dummies
names(dummified_cols_test) <- gsub(pattern = "[^[:alnum:]]+",
replacement = ".",
names(dummified_cols_test))
=setdiff(dummified_cols,names(dummified_cols_test))
columns_difference=0
dummified_cols_test[,columns_difference]
<- dataset_updated_test %>% cbind(dummified_cols_test) %>%
dataset_updated_test # append encoded columns
::select(-dum_cols)
dplyr# remove the old categorical columns
%>% head(5) %>% DT::datatable(options = options)
dummified_cols_test }
In this section we will subset the test data based on numeric columns present in train data.
=dataset_updated_test %>% dplyr::select(nums) dataset_updated_test
Now once we have the test data ready, lets look at how the scoreHVT function looks like.
scoreHVT(dataset,
hvt.results,
child.level,
mad.threshold,
line.width,
color.vec,
normalize,
distance_metric,
error_metric,
yVar,
analysis.plots,
names.column)
The important parameters for the function scoreHVT
are
as below
dataset
- A dataframe containing
the test dataset. The dataframe should have all the variable(features)
used for training.
hvt.results.model
- A list obtained
from the trainHVT function while performing hierarchical vector
quantization on training data. This list provides an overview of the
hierarchical vector quantized data, including diagnostics, tessellation
details, Sammon’s projection coordinates, and model input
information.
child.level
- A number indicating
the depth for which the heat map is to be plotted. Each depth represents
a different level of clustering or partitioning of the data.
mad.threshold
- A numeric value
indicating the permissible Mean Absolute Deviation which is obtained
from Minimum Intra centroid plot(when diagnose is set to TRUE in
trainHVT). mad.threshold
value is important since it is
used in anomaly detection. Default value is 0.2 NOTE: for a given
datapoint, when the quantization error is above
mad.threshold
it is denoted as anomaly else
not.
line.width
- A vector indicating
the line widths of the tessellation boundaries for each layer. (Optional
Parameter)
color.vec
- A vector indicating the
colors of the tessellations boundaries at each layer. (Optional
Parameter)
normalize
- A logical value
indicating if the dataset should be normalized. When set to TRUE, the
data (testing dataset) is standardized by mean and sd of the training
dataset referred from the trainHVT(). When set to FALSE, the
data
is used as such without any changes.
distance_metric
- The distance
metric can be L1_Norm
(Manhattan) or
L2_Norm
(Euclidean). The metric is used when calculating
distance between each datapoint(in test dataset) with the centroids
obtained from results of trainHVT. Default is
L1_Norm
.
error_metric
- The error metric can
be mean
or max
. max
will return
the max of m
values and mean
will take mean of
m
values where each value is a distance between the
datapoint and centroid of the cell. This helps in calculating the scored
quantization error. Default value is max
.
yVar
- A character or a vector
representing the name of the dependent variable(s)
The below given arguments are used only when character column can be mapped over the scored results. since torus doesn’t have a character column, we are not using them in this vignette.
analysis.plots
- A logical value to
indicate whether to include the insight plots which are useful in
viewing the contents and clusters of cells. Default is FALSE.
names.column
- The column of names
of the datapoints which will be displayed as the contents of the cell in
‘scoredPlotly’. Default is NULL.
Here the mad_threshold has been selected as 0.8 which is based on Mean of Minimum Intra-Centroid Distance plot from above.
<- scoreHVT(dataset_updated_test,
hvt.score
hvt.results,child.level = 1,
mad.threshold = 0.8,
line.width = c(0.6, 0.4, 0.2),
color.vec = c("navyblue", "slateblue", "lavender"),
distance_metric = "L1_Norm",
error_metric = "max")
displayTable(hvt.score[["scoredPredictedData"]], value = 0.8)
Segment.Level | Segment.Parent | Segment.Child | n | Cell.ID | Quant.Error | centroidRadius | diff | anomalyFlag | x | y | z |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 153 | 1 | 182 | 0.0764 | 0.3141 | 0.2377 | 0 | -1.0020 | -2.3335 | -0.8420 |
1 | 1 | 217 | 1 | 17 | 0.0633 | 0.2037 | 0.1404 | 0 | -2.1770 | 1.5699 | -0.7295 |
1 | 1 | 162 | 1 | 475 | 0.1043 | 0.1806 | 0.0763 | 0 | 2.0941 | -1.8907 | -0.5704 |
1 | 1 | 445 | 1 | 78 | 0.1248 | 0.3166 | 0.1918 | 0 | -0.1980 | 2.9839 | -0.1378 |
1 | 1 | 436 | 1 | 123 | 0.1272 | 0.2633 | 0.1361 | 0 | -1.2495 | -1.9487 | -0.9491 |
1 | 1 | 269 | 1 | 413 | 0.1154 | 0.1992 | 0.0838 | 0 | 2.0634 | 0.0815 | -0.9979 |
1 | 1 | 191 | 1 | 347 | 0.0521 | 0.1879 | 0.1358 | 0 | 1.5182 | 0.9935 | -0.9826 |
1 | 1 | 172 | 1 | 242 | 0.0443 | 0.0816 | 0.0373 | 0 | 0.4102 | 0.9552 | -0.2784 |
1 | 1 | 39 | 1 | 411 | 0.1215 | 0.2735 | 0.1520 | 0 | 1.5409 | 2.3249 | 0.6142 |
1 | 1 | 113 | 1 | 197 | 0.0838 | 0.1432 | 0.0594 | 0 | -0.9558 | -0.6764 | 0.5591 |
1 | 1 | 237 | 1 | 186 | 0.072 | 0.1856 | 0.1136 | 0 | -0.0790 | 1.8156 | -0.9832 |
1 | 1 | 204 | 1 | 273 | 0.0392 | 0.1506 | 0.1113 | 0 | 0.7280 | 0.6856 | 0.0014 |
1 | 1 | 59 | 1 | 3 | 0.1355 | 0.2460 | 0.1105 | 0 | -2.8278 | 1.0012 | 0.0209 |
1 | 1 | 209 | 1 | 368 | 0.0661 | 0.2251 | 0.1590 | 0 | 0.3594 | -1.8433 | -0.9925 |
1 | 1 | 92 | 1 | 198 | 0.0623 | 0.2029 | 0.1405 | 0 | -0.9926 | -0.9557 | 0.7829 |
1 | 1 | 91 | 1 | 18 | 0.0686 | 0.1984 | 0.1298 | 0 | -2.5097 | 1.0868 | -0.6782 |
1 | 1 | 237 | 1 | 186 | 0.0526 | 0.1856 | 0.1329 | 0 | -0.1015 | 1.7005 | -0.9550 |
1 | 1 | 471 | 1 | 404 | 0.0618 | 0.2626 | 0.2008 | 0 | 1.7236 | 1.5863 | 0.9395 |
1 | 1 | 157 | 1 | 452 | 0.026 | 0.1888 | 0.1628 | 0 | 1.6227 | -1.6186 | 0.9564 |
1 | 1 | 257 | 1 | 377 | 0.0682 | 0.1637 | 0.0955 | 0 | 1.3105 | -0.4772 | 0.7960 |
1 | 1 | 333 | 1 | 170 | 0.058 | 0.1101 | 0.0522 | 0 | -1.0363 | -0.2099 | 0.3338 |
1 | 1 | 452 | 1 | 354 | 0.0986 | 0.1824 | 0.0838 | 0 | 1.3842 | 0.3319 | 0.8170 |
1 | 1 | 435 | 1 | 107 | 0.0493 | 0.2171 | 0.1679 | 0 | -1.2680 | 1.0389 | 0.9327 |
1 | 1 | 211 | 1 | 34 | 0.0734 | 0.2149 | 0.1414 | 0 | -2.6369 | -0.5918 | -0.7117 |
1 | 1 | 384 | 1 | 236 | 0.0638 | 0.2220 | 0.1582 | 0 | 0.7185 | 2.1057 | -0.9744 |
1 | 1 | 349 | 1 | 440 | 0.0985 | 0.2480 | 0.1495 | 0 | 2.3827 | 0.6634 | 0.8809 |
1 | 1 | 486 | 1 | 127 | 0.1209 | 0.2863 | 0.1654 | 0 | -1.7146 | -1.6026 | 0.9379 |
1 | 1 | 154 | 1 | 311 | 0.0723 | 0.2856 | 0.2133 | 0 | 1.3901 | 2.2967 | -0.7289 |
1 | 1 | 119 | 1 | 148 | 0.0545 | 0.1760 | 0.1216 | 0 | -1.1141 | -0.0945 | -0.4715 |
1 | 1 | 62 | 1 | 497 | 0.0429 | 0.1659 | 0.1231 | 0 | 2.4161 | -1.5692 | 0.4732 |
1 | 1 | 457 | 1 | 388 | 0.0724 | 0.2316 | 0.1592 | 0 | 1.7900 | 0.0487 | -0.9779 |
1 | 1 | 425 | 1 | 181 | 0.0362 | 0.1801 | 0.1439 | 0 | -0.3151 | 1.1740 | -0.6202 |
1 | 1 | 361 | 1 | 27 | 0.1781 | 0.3374 | 0.1593 | 0 | -1.4384 | 2.6024 | -0.2288 |
1 | 1 | 322 | 1 | 252 | 0.0704 | 0.1364 | 0.0661 | 0 | 0.5732 | 0.8297 | 0.1298 |
1 | 1 | 357 | 1 | 94 | 0.1396 | 0.2812 | 0.1416 | 0 | -2.0384 | -1.7248 | -0.7422 |
1 | 1 | 183 | 1 | 261 | 0.0891 | 0.1566 | 0.0675 | 0 | 0.4926 | 0.9927 | 0.4523 |
1 | 1 | 317 | 1 | 418 | 0.0205 | 0.2420 | 0.2215 | 0 | 2.1445 | 0.4366 | -0.9821 |
1 | 1 | 93 | 1 | 70 | 0.1438 | 0.3340 | 0.1902 | 0 | -2.1026 | -0.5622 | 0.9843 |
1 | 1 | 141 | 1 | 304 | 0.0736 | 0.2559 | 0.1822 | 0 | 1.0673 | 1.6906 | 1.0000 |
1 | 1 | 267 | 1 | 315 | 0.0533 | 0.1378 | 0.0845 | 0 | 1.0620 | 0.2168 | -0.4009 |
1 | 1 | 57 | 1 | 101 | 0.0997 | 0.2086 | 0.1088 | 0 | -1.2506 | 0.9453 | -0.9017 |
1 | 1 | 97 | 1 | 477 | 0.0881 | 0.2516 | 0.1635 | 0 | 1.7293 | -2.3387 | -0.4177 |
1 | 1 | 95 | 1 | 31 | 0.0598 | 0.2529 | 0.1931 | 0 | -2.7339 | -0.5932 | 0.6032 |
1 | 1 | 52 | 1 | 112 | 0.0617 | 0.1788 | 0.1171 | 0 | -1.4642 | 0.0022 | 0.8444 |
1 | 1 | 47 | 1 | 420 | 0.1156 | 0.3313 | 0.2156 | 0 | 0.4330 | -2.3813 | 0.9074 |
1 | 1 | 319 | 1 | 459 | 0.096 | 0.2466 | 0.1506 | 0 | 2.7641 | 0.7309 | -0.5118 |
1 | 1 | 202 | 1 | 29 | 0.0745 | 0.2193 | 0.1448 | 0 | -2.5086 | 0.7689 | 0.7815 |
1 | 1 | 166 | 1 | 386 | 0.0379 | 0.2457 | 0.2078 | 0 | 1.4787 | -0.2794 | 0.8688 |
1 | 1 | 252 | 1 | 196 | 0.0533 | 0.1539 | 0.1006 | 0 | -0.7918 | -0.7288 | -0.3828 |
1 | 1 | 402 | 1 | 234 | 0.0555 | 0.1477 | 0.0922 | 0 | -0.4083 | -1.0176 | -0.4286 |
1 | 1 | 282 | 1 | 245 | 0.0933 | 0.1726 | 0.0794 | 0 | 0.3045 | 1.0556 | 0.4330 |
1 | 1 | 423 | 1 | 176 | 0.0141 | 0.2224 | 0.2083 | 0 | -0.5818 | 0.9918 | 0.5266 |
1 | 1 | 68 | 1 | 133 | 0.0566 | 0.2522 | 0.1956 | 0 | -1.3209 | -1.2284 | 0.9806 |
1 | 1 | 123 | 1 | 111 | 0.0699 | 0.2311 | 0.1612 | 0 | -1.3814 | 0.7997 | 0.9148 |
1 | 1 | 147 | 1 | 480 | 0.098 | 0.2042 | 0.1061 | 0 | 2.0533 | -1.8681 | 0.6308 |
1 | 1 | 292 | 1 | 343 | 0.0353 | 0.1916 | 0.1564 | 0 | 0.9554 | -0.3904 | 0.2514 |
1 | 1 | 175 | 1 | 429 | 0.0859 | 0.2246 | 0.1387 | 0 | 1.9167 | -0.6986 | -0.9992 |
1 | 1 | 241 | 1 | 241 | 0.1479 | 0.2439 | 0.0959 | 0 | 0.4723 | 2.6730 | 0.6997 |
1 | 1 | 370 | 1 | 324 | 0.0279 | 0.1619 | 0.1341 | 0 | 0.5408 | -0.9755 | 0.4664 |
1 | 1 | 291 | 1 | 468 | 0.0325 | 0.2760 | 0.2436 | 0 | 2.8851 | 0.7376 | 0.2091 |
1 | 1 | 478 | 1 | 52 | 0.0728 | 0.2279 | 0.1551 | 0 | -1.7929 | 1.0316 | -0.9977 |
1 | 1 | 89 | 1 | 487 | 0.0485 | 0.2798 | 0.2313 | 0 | 2.9771 | -0.3583 | 0.0534 |
1 | 1 | 25 | 1 | 422 | 0.0421 | 0.2068 | 0.1647 | 0 | 1.9528 | -0.0645 | 0.9989 |
1 | 1 | 97 | 1 | 477 | 0.0492 | 0.2516 | 0.2024 | 0 | 1.6524 | -2.2377 | -0.6237 |
1 | 1 | 86 | 1 | 239 | 0.0361 | 0.2433 | 0.2072 | 0 | -0.5211 | -1.8901 | 0.9992 |
1 | 1 | 492 | 1 | 36 | 0.0706 | 0.1235 | 0.0529 | 0 | -2.3845 | 0.6975 | -0.8749 |
1 | 1 | 477 | 1 | 41 | 0.0712 | 0.2101 | 0.1389 | 0 | -2.4536 | 0.1933 | -0.8873 |
1 | 1 | 28 | 1 | 416 | 0.0566 | 0.2060 | 0.1494 | 0 | 2.2249 | 1.0713 | -0.8830 |
1 | 1 | 153 | 1 | 182 | 0.1569 | 0.3141 | 0.1572 | 0 | -0.8697 | -2.1554 | -0.9460 |
1 | 1 | 13 | 1 | 345 | 0.0761 | 0.1496 | 0.0735 | 0 | 0.8484 | -0.9514 | 0.6884 |
1 | 1 | 261 | 1 | 374 | 0.0411 | 0.1669 | 0.1257 | 0 | 0.6624 | -1.5950 | -0.9620 |
1 | 1 | 363 | 1 | 464 | 0.0593 | 0.2063 | 0.1470 | 0 | 2.2880 | -0.8983 | 0.8890 |
1 | 1 | 177 | 1 | 307 | 0.0037 | 0.1773 | 0.1736 | 0 | 0.3260 | -1.1945 | 0.6478 |
1 | 1 | 55 | 1 | 310 | 0.0716 | 0.1211 | 0.0495 | 0 | 0.6464 | -0.8397 | -0.3404 |
1 | 1 | 200 | 1 | 415 | 0.1246 | 0.2336 | 0.1089 | 0 | 1.6026 | 2.5123 | 0.1994 |
1 | 1 | 244 | 1 | 342 | 0.0622 | 0.1769 | 0.1147 | 0 | 1.1013 | -0.0970 | 0.4471 |
1 | 1 | 460 | 1 | 230 | 0.0622 | 0.2442 | 0.1820 | 0 | -0.4813 | -1.5246 | -0.9160 |
1 | 1 | 424 | 1 | 372 | 0.083 | 0.2654 | 0.1824 | 0 | 1.3371 | -0.1317 | -0.7544 |
1 | 1 | 73 | 1 | 178 | 0.0482 | 0.1572 | 0.1090 | 0 | -0.9502 | -0.3186 | 0.0659 |
1 | 1 | 102 | 1 | 462 | 0.0623 | 0.2230 | 0.1607 | 0 | 2.0180 | -1.1153 | 0.9521 |
1 | 1 | 198 | 1 | 2 | 0.1735 | 0.2656 | 0.0922 | 0 | -2.0152 | 2.0241 | 0.5166 |
1 | 1 | 366 | 1 | 100 | 0.0822 | 0.2644 | 0.1822 | 0 | -1.0988 | 1.5419 | -0.9943 |
1 | 1 | 262 | 1 | 6 | 0.1624 | 0.2576 | 0.0953 | 0 | -1.9070 | 2.2917 | -0.1921 |
1 | 1 | 268 | 1 | 205 | 0.0861 | 0.2128 | 0.1267 | 0 | -0.7320 | -1.3653 | -0.8926 |
1 | 1 | 33 | 1 | 286 | 0.0814 | 0.1565 | 0.0751 | 0 | 0.3054 | -0.9587 | 0.1109 |
1 | 1 | 311 | 1 | 281 | 0.0287 | 0.1955 | 0.1668 | 0 | 0.7924 | 1.1246 | 0.7812 |
1 | 1 | 48 | 1 | 161 | 0.0934 | 0.1770 | 0.0837 | 0 | -0.8322 | 0.5737 | 0.1468 |
1 | 1 | 496 | 1 | 206 | 0.0658 | 0.1103 | 0.0445 | 0 | -0.7373 | -0.6758 | -0.0178 |
1 | 1 | 484 | 1 | 53 | 0.1166 | 0.3132 | 0.1966 | 0 | -1.9716 | -2.2501 | -0.1285 |
1 | 1 | 64 | 1 | 106 | 0.056 | 0.2486 | 0.1926 | 0 | -1.8333 | -0.7709 | 0.9999 |
1 | 1 | 97 | 1 | 477 | 0.1057 | 0.2516 | 0.1459 | 0 | 1.7131 | -2.3705 | -0.3806 |
1 | 1 | 119 | 1 | 148 | 0.0332 | 0.1760 | 0.1428 | 0 | -1.0690 | 0.0172 | -0.3653 |
1 | 1 | 328 | 1 | 493 | 0.133 | 0.2825 | 0.1495 | 0 | 2.6297 | -1.2025 | -0.4528 |
1 | 1 | 300 | 1 | 121 | 0.1444 | 0.3057 | 0.1613 | 0 | -1.4048 | -2.4167 | -0.6061 |
1 | 1 | 148 | 1 | 120 | 0.0915 | 0.1618 | 0.0703 | 0 | -1.4259 | 0.4853 | 0.8696 |
1 | 1 | 69 | 1 | 21 | 0.1422 | 0.3072 | 0.1650 | 0 | -1.7295 | 2.1032 | 0.6909 |
1 | 1 | 268 | 1 | 205 | 0.0359 | 0.2128 | 0.1769 | 0 | -0.7591 | -1.6176 | -0.9770 |
1 | 1 | 180 | 1 | 19 | 0.1275 | 0.2597 | 0.1322 | 0 | -2.5488 | 0.6408 | -0.7781 |
1 | 1 | 169 | 1 | 135 | 0.0943 | 0.2027 | 0.1084 | 0 | -1.2734 | 0.1981 | 0.7029 |
1 | 1 | 388 | 1 | 291 | 0.0543 | 0.1174 | 0.0632 | 0 | 0.8795 | 0.4824 | -0.0794 |
The plotModelDiagnostics()
function can be called for
scoring object as well. Shown below is the comparison of Mean Absolute
Deviation Plot for train data and test data.
plotModelDiagnostics(hvt.score)
Table below shows cell(s) containing anomalous test data points. Datapoints are flagged as anomalous when Quantization Error for such is greater than the same of assigned centroid based on error metric. Comparison between scored/test and fitted Quantization error of cell(s) is provided for further insights.
Number of test data points: 2400 | Number of anomalous data
points: 0 | Percentage of anomalous data points: 0.00%
Mean QE for fitted data: 0.075 | Mean QE for test data: 0.0753 |
Difference in QE between fitted and test data: 3^{-4}
plotQuantErrorHistogram(hvt.results,hvt.score)
The anomalous observations are shown below in the datatable.
NOTE: Since there are no anomaly found in torus dataset that we use, the table below will be empty.
<- hvt.score$QECompareDf %>% filter(anomalyFlag == 1)
QECompareDf displayTable(QECompareDf)
Segment.Level | Segment.Parent | Segment.Child | anomalyFlag | n | Fitted.Quant.Error | Scored.Quant.Error | Quant.Error.Diff | Quant.Error.Diff (%) |
---|---|---|---|---|---|---|---|---|
Pricing Segmentation - The package can be used to discover groups of similar customers based on the customer spend pattern and understand price sensitivity of customers
Market Segmentation - The package can be helpful in market segmentation where we have to identify micro and macro segments. The method used in this package can do both kinds of segmentation in one go
Anomaly Detection - This method can help us categorize system behaviour over time and help us find anomaly when there are changes in the system. For e.g. Finding fraudulent claims in healthcare insurance
The package can help us understand the underlying structure of the data. Suppose we want to analyze a curved surface such as sphere or vase, we can approximate it by a lot of small low-order polygons in the form of tessellations using this package
In biology, Voronoi diagrams are used to model a number of different biological structures, including cells and bone microarchitecture
Using the base idea of Systems Dynamics, these diagrams can also be used to depict customer state changes over a period of time