1. Abstract

The HVT package is a collection of R functions to facilitate building topology preserving maps for rich multivariate data analysis. Tending towards a big data preponderance, a large number of rows. A collection of R functions for this typical workflow is organized below:

Data Compression: Vector quantization (VQ), HVQ (hierarchical vector quantization) using means or medians. This step compresses the rows (long data frame) using a compression objective.
Data Projection: Dimension projection of the compressed cells to 1D,2D or Interactive surface plot with the Sammons Non-linear Algorithm. This step creates topology preserving map (also called embedding ) coordinates into the desired output dimension.
Tessellation: Create cells required for object visualization using the Voronoi Tessellation method, package includes heatmap plots for hierarchical Voronoi tessellations (HVT). This step enables data insights, visualization, and interaction with the topology preserving map. Useful for semi-supervised tasks.
Scoring: Scoring new data sets and recording their assignment using the map objects from the above steps, in a sequence of maps if required.
Temporal Analysis and Visualization: A Collection of new functions that leverages the capacity of the HVT package by analyzing time series data for its underlying patterns, calculation of transitioning probabilities and the visualizations for the flow of data over time.

2. Import Code Modules

Here is the guide to install the HVT package. This helps user to install the most recent version of the HVT package.

###direct installation###
#install.packages("HVT")

#or

###git repo installation###
#library(devtools)
#devtools::install_github(repo = "Mu-Sigma/HVT")

NOTE: At the time documenting this vignette, the updated changes were not still in CRAN, hence we are sourcing the scripts from the R folder directly to the session environment.

# Sourcing required code scripts for HVT
script_dir <- "../R"
r_files <- list.files(script_dir, pattern = "\\.R$", full.names = TRUE)
invisible(lapply(r_files, function(file) { source(file, echo = FALSE); }))

3. Example : HVT with the Torus dataset

In this section, we will see how we can use the package to visualize multidimensional data by projecting them to two dimensions using Sammon’s projection and further used for Scoring.

Data Understanding

First of all, let us see how to generate data for torus. We are using a library geozoo for this purpose. Geo Zoo (stands for Geometric Zoo) is a compilation of geometric objects ranging from three to 10 dimensions. Geo Zoo contains regular or well-known objects, eg cube and sphere, and some abstract objects, e.g. Boy’s surface, Torus and Hyper-Torus.

Here, we will generate a 3D torus (a torus is a surface of revolution generated by revolving a circle in three-dimensional space one full revolution about an axis that is coplanar with the circle) with 12000 points.

Raw Torus Dataset

The torus dataset includes the following columns:

x: This column represents the X-coordinate of each point in the torus.
y: This column represents the Y-coordinate of each point in the torus.
z: This column represents the Z-coordinate of each point in the torus.

Lets, explore the torus dataset containing 12000 points. For the sake of brevity we are displaying first 6 rows.

set.seed(240)
# Here p represents dimension of object, n represents number of points
torus <- geozoo::torus(p = 3,n = 12000)
torus_df <- data.frame(torus$points)
colnames(torus_df) <- c("x","y","z")
torus_df <- torus_df %>% round(4)
Table(head(torus_df), scroll = FALSE)

x	y	z
-2.6282	0.5656	-0.7253
-1.4179	-0.8903	0.9455
-1.0308	1.1066	-0.8731
1.8847	0.1895	0.9944
-1.9506	-2.2507	0.2071
-1.4824	0.9229	0.9672

Now let’s have a look at structure of the torus dataset.

str(torus_df)

## 'data.frame':    12000 obs. of  3 variables:
##  $ x: num  -2.63 -1.42 -1.03 1.88 -1.95 ...
##  $ y: num  0.566 -0.89 1.107 0.19 -2.251 ...
##  $ z: num  -0.725 0.946 -0.873 0.994 0.207 ...

Data distribution

This section displays four objects.

Variable Histograms: The histogram distribution of all the features in the dataset.

Box Plots: Box plots for all the features in the dataset. These plots will display the median and Interquartile range of each column at a panel level.

Correlation Matrix: This calculates the Pearson correlation which is a bivariate correlation value measuring the linear correlation between two numeric columns. The output plot is shown as a matrix.

Summary EDA: The table provides descriptive statistics for all the features in the dataset.

variable: The features/columns of the dataset
min: Minimum value of that feature/column
1st Quartile: The value that splits the lower 25% of the data when arranged in ascending order
median: Middle value in the ascendingly ordered dataset
mean: Sum of all values in the dataset divided by the total number of values
sd: Measure of the dispersion of dataset relative to its mean.
3rd Quartile: The value that splits the lower 75% of the data when arranged in ascending order
max: Maximum value of that feature/column
hist: The basic barchart of the data distribution of a feature/column
n_row: Number of rows for that feature/column
n_missing: Number of missing values/NAs for that feature/column

It uses an inbuilt function called edaPlots to display the above-mentioned four objects.

edaPlots(torus_df, output_type = "summary", n_cols = 3)

edaPlots(torus_df, output_type = "histogram", n_cols = 3)

edaPlots(torus_df, output_type = "boxplot", n_cols = 3)

edaPlots(torus_df, output_type = "correlation", n_cols = 3)

Train - Test Split

Let us split the torus dataset into train and test. We will randomly select 80% of the torus dataset as train and remaining as test.

smp_size <- floor(0.80 * nrow(torus_df))
set.seed(279)
train_ind <- sample(seq_len(nrow(torus_df)), size = smp_size)
torus_train <- torus_df[train_ind, ]
torus_test <- torus_df[-train_ind, ]

Training Dataset

Now, lets have a look at the selected training dataset containing (9600 data points). For the sake of brevity we are displaying first six rows.

rownames(torus_train) <- NULL
Table(head(torus_train), scroll= FALSE)

x	y	z
1.7958	-0.4204	-0.9878
0.7115	-2.3528	-0.8889
1.9285	1.2034	0.9620
1.0175	0.0344	-0.1894
-0.2736	1.1298	-0.5464
1.8976	2.2391	0.3545

Now lets have a look at structure of the training dataset.

str(torus_train)

## 'data.frame':    9600 obs. of  3 variables:
##  $ x: num  1.796 0.712 1.929 1.018 -0.274 ...
##  $ y: num  -0.4204 -2.3528 1.2034 0.0344 1.1298 ...
##  $ z: num  -0.988 -0.889 0.962 -0.189 -0.546 ...

Data Distribution

edaPlots(torus_train, output_type = "summary", n_cols = 3)

edaPlots(torus_train,output_type = "histogram", n_cols = 3)

edaPlots(torus_train, output_type = "boxplot", n_cols = 3)

edaPlots(torus_train, output_type = "correlation", n_cols = 3)

Testing Dataset

Now, lets have a look at testing dataset containing(2400 data points).For the sake of brevity we are displaying first six rows.

rownames(torus_test) <- NULL
Table(head(torus_test), scroll = FALSE)

x	y	z
-2.6282	0.5656	-0.7253
2.7471	-0.9987	-0.3848
-2.4446	-1.6528	0.3097
-2.6487	-0.5745	0.7040
-0.2676	-1.0800	-0.4611
-1.1130	-0.6516	-0.7040

Now lets have a look at structure of the testing dataset.

str(torus_test)

## 'data.frame':    2400 obs. of  3 variables:
##  $ x: num  -2.628 2.747 -2.445 -2.649 -0.268 ...
##  $ y: num  0.566 -0.999 -1.653 -0.575 -1.08 ...
##  $ z: num  -0.725 -0.385 0.31 0.704 -0.461 ...

Data Distribution

edaPlots(torus_test, output_type = "summary", n_cols = 3)

edaPlots(torus_test,output_type = "histogram", n_cols = 3)

edaPlots(torus_test, output_type = "boxplot", n_cols = 3)

edaPlots(torus_test, output_type = "correlation", n_cols = 3)

4. Map A : Base Compressed Map

Let us try to visualize the compressed Map A from the diagram below.

Figure 1: Data Segregation with highlighted bounding box in red around compressed map A

This package can perform vector quantization using the following algorithms -

Hierarchical Vector Quantization using k−means
Hierarchical Vector Quantization using k−medoids

For more information on vector quantization, refer the following link.

The trainHVT function constructs highly compressed hierarchical Voronoi tessellations. The raw data is first scaled and this scaled data is supplied as input to the vector quantization algorithm. The vector quantization algorithm compresses the dataset until a user-defined compression percentage rate is achieved using a parameter called quantization error which acts as a threshold and determines the compression percentage. It means that for a given user-defined compression percentage we get the ‘n’ number of cells, then all of these cells formed will have a quantization error less than the threshold quantization error.

Let’s try to comprehend the trainHVT first before moving ahead.

trainHVT(
  data,
  min_compression_perc,
  n_cells,
  depth,
  quant.err,
  normalize,
  distance_metric = c("L1_Norm", "L2_Norm"),
  error_metric = c("mean", "max"),
  quant_method = c("kmeans", "kmedoids"),
  dim_reduction_method = c("sammon" , "tsne" , "umap")
  scale_summary = NA,
  diagnose = FALSE,
  hvt_validation = FALSE,
  train_validation_split_ratio = 0.8,
  projection.scale,
  tsne_perplexity,tsne_theta,tsne_verbose,
  tsne_eta,tsne_max_iter,
  umap_n_neighbors,umap_min_dist
)

Each of the parameters of trainHVT function have been explained below:

data - A dataframe, with numeric columns (features) that will be used for training the model.
min_compression_perc - An integer, indicating the minimum compression percentage to be achieved for the dataset. It indicates the desired level of reduction in dataset size compared to its original size.
n_cells - An integer, indicating the number of cells per hierarchy (level). This parameter determines the granularity or level of detail in the hierarchical vector quantization.
depth - An integer, indicating the number of levels. A depth of 1 means no hierarchy (single level), while higher values indicate multiple levels (hierarchy).
quant.err - A number indicating the quantization error threshold. A cell will only breakdown into further cells if the quantization error of the cell is above the defined quantization error threshold.
normalize - A logical value indicating if the dataset should be normalized. When set to TRUE, scales the values of all features to have a mean of 0 and a standard deviation of 1 (Z-score)
distance_metric - The distance metric can be L1_Norm(Manhattan) or L2_Norm(Euclidean). L1_Norm is selected by default. The distance metric is used to calculate the distance between an n dimensional point and centroid.
error_metric - The error metric can be mean or max. max is selected by default. max will return the max of m values and mean will take mean of m values where each value is a distance between a point and centroid of the cell.
quant_method - The quantization method can be kmeans or kmedoids. Kmeans uses means (centroids) as cluster centers while Kmedoids uses actual data points (medoids) as cluster centers. kmeans is selected by default.
projection.scale - A number indicating the scale factor for the tessellations to visualize the sub-tessellations well enough. It helps in adjusting the visual representation of the hierarchy to make the sub-tessellations more visible. Default is 10.
dim_reduction_method - The dimensionality reduction method to be chosen. options are ‘tsne’ , ‘umap’ & ‘sammon’. Default is ‘sammon’.
scale_summary - A list with user defined mean and standard deviation values for all the features in the dataset. Pass the scale summary when normalize is set to FALSE.
diagnose - A logical value indicating whether user wants to perform diagnostics on the model. Default value is FALSE.
hvt_validation - A logical value indicating whether user wants to holdout a validation set and find mean absolute deviation of the validation points from the centroid. Default value is FALSE.
train_validation_split_ratio - A numeric value indicating train validation split ratio. This argument is only used when hvt_validation has been set to TRUE. Default value for the argument is 0.8.
tsne_verbose - A logical value which indicates the t-SNE algorithm to print detailed information about its progress to the console.
tsne_perplexity - A numeric, balances the attention t-SNE gives to local and global aspects of the data. Lower values focus more on local structure, while higher values consider more global structure. It is recommended to be between 5 and 50. Default value is 30.
tsne_theta - A numeric, speed/accuracy trade-off parameter for Barnes-Hut approximation. If set to 0, exact t-SNE is performed, which is slower. If set to greater than 0, an approximation is used, which speeds up the process but may reduce accuracy. Default value is 0.5
tsne_eta (learning_rate) - A numeric, learning rate for t-SNE optimization.Determines the step size during optimization. If too low, the algorithm might get stuck in local minima; if too high, the solution may become unstable. Default value is 200.
tsne_max_iter - An integer, maximum number of iterations. Number of iterations for the optimization process. More iterations can improve results but increase computation time. Default value is 1000.
umap_n_neighbors - An integer, the size of the local neighborhood (in terms of number of neighboring sample points) used for manifold approximation, controls the balance between local and global structure in the data, smaller values focus on local structure, while larger values capture more global structures. Default value is 15.
umap_min_dist - A numeric, the minimum distance between points in the embedded space, controls how tightly UMAP packs points together, lower values result in a more clustered embedding. Default value is 0.1

The output of trainHVT function (list of 7 elements) have been explained below with an image attached for clear understanding.

NOTE: Here the attached image is the snapshot of output list generated from map A which can be referred later in this section

Figure 2: The Output list generated by trainHVT function.

The ‘1st element’ is a list containing information related to plotting tessellations. This information might include coordinates, boundaries, or other details necessary for visualizing the tessellations
The ‘2nd element’ is a list containing information related to Sammon’s projection coordinates of the data points in the reduced-dimensional space.
The ‘3rd element’ is a list containing detailed information about the hierarchical vector quantized data along with a summary section containing no of points, Quantization Error and the centroids for each cell for 2D.
The ‘4th element’ is a list that contains all the diagnostics information of the model when diagnose is set to TRUE. Otherwise NA.
The ‘5th element’ is a list that contains all the information required to generates a Mean Absolute Deviation (MAD) plot, if hvt_validation is set to TRUE. Otherwise NA
The ‘6th element’ is a list containing detailed information about the hierarchical vector quantized data along with a summary section containing no of points, Quantization Error and the centroids for each cell which is the output of hvq.
The ‘7th element’ (model info) is a list that contains model generated timestamp, input parameters passed to the model, validation results and the dimensionality reduction evaluation metrics table.

We will use the trainHVT function to compress our data while preserving essential features of the dataset. Our goal is to achieve data compression upto atleast 80%. In situations where the compression ratio does not meet the desired target, we can explore adjusting the model parameters as a potential solution. This involves making modifications to parameters such as the quantization error threshold or increasing the number of cells and then rerunning the trainHVT function again.

As this is already done in HVT Vignette: please refer for more information.

Model Parameters

Number of Cells at each Level = 500
Maximum Depth = 1
Quantization Error Threshold = 0.1
Error Metric = Max
Distance Metric = Euclidean
Dimensionality Reduction method = Sammon

set.seed(240)
torus_mapA <- trainHVT(
  torus_train,
  n_cells = 500,
  depth = 1,
  quant.err = 0.1,
  normalize = FALSE,
  distance_metric = "L2_Norm",
  error_metric = "max",
  quant_method = "kmeans",
  dim_reduction_method = "sammon"
)

Let’s check the compression summary for torus.

displayTable(data = torus_mapA[[3]]$compression_summary,columnName = 'percentOfCellsBelowQuantizationErrorThreshold', value = 0.8, tableType = "compression")

segmentLevel	noOfCells	noOfCellsBelowQuantizationError	percentOfCellsBelowQuantizationErrorThreshold	parameters
1	500	448	0.9	n_cells: 500 quant.err: 0.1 distance_metric: L2_Norm error_metric: max quant_method: kmeans

We successfully compressed 90% of the data using n_cells parameter as 500, the next step involves performing data projection on the compressed data. In this step, the compressed data will be transformed and projected onto a lower-dimensional space to visualize and analyze the data in a more manageable form.

As per the manual, torus_mapA[[3]] gives us detailed information about the hierarchical vector quantized data. torus_mapA[[3]][['summary']] gives a nice tabular data containing no of points, Quantization Error and the codebook.

The datatable displayed below is the summary from torus_mapA showing Cell.ID, Centroids and Quantization Error for each of the 500 cells. For the sake of brevity, we are displaying only the first 100 rows.

displayTable(data =torus_mapA[[3]][['summary']], columnName= 'Quant.Error', value = 0.1, tableType = "summary", scroll = TRUE)

Segment.Level	Segment.Parent	Segment.Child	n	Cell.ID	Quant.Error	x	y	z
1	1	1	25	133	0.08	-0.92	-0.74	0.57
1	1	2	19	145	0.06	-0.21	-1.17	-0.58
1	1	3	14	174	0.04	-1.06	-0.01	0.33
1	1	4	9	491	0.05	2.16	1.86	0.53
1	1	5	18	199	0.08	-1.67	1.53	-0.96
1	1	6	18	306	0.08	1.73	-1.15	0.99
1	1	7	24	85	0.08	-2.39	0.64	-0.87
1	1	8	15	164	0.04	-0.97	-0.29	0.18
1	1	9	16	458	0.07	1.95	1.33	0.93
1	1	10	22	413	0.1	-0.05	2.68	-0.72
1	1	11	11	495	0.06	1.92	2.18	0.42
1	1	12	13	30	0.05	-1.76	-1.72	0.88
1	1	13	10	317	0.05	-0.69	1.90	1.00
1	1	14	23	27	0.09	-2.50	-0.86	-0.75
1	1	15	17	358	0.07	1.80	-0.42	-0.99
1	1	16	16	209	0.06	-1.44	1.27	-0.99
1	1	17	10	479	0.06	1.35	2.51	0.53
1	1	18	12	295	0.05	-0.05	1.03	0.25
1	1	19	33	203	0.07	0.60	-1.29	0.81
1	1	20	16	465	0.07	2.58	0.64	0.74
1	1	21	13	370	0.05	0.28	1.51	0.88
1	1	22	15	426	0.07	2.37	-0.21	0.92
1	1	23	19	139	0.07	0.38	-1.77	0.98
1	1	24	23	131	0.06	-0.57	-1.01	-0.54
1	1	25	27	242	0.07	0.76	-0.87	0.53
1	1	26	27	330	0.05	0.64	0.78	-0.12
1	1	27	16	178	0.09	-2.20	1.94	-0.32
1	1	28	22	31	0.1	-0.82	-2.52	-0.75
1	1	29	16	163	0.05	-0.97	-0.26	-0.10
1	1	30	19	37	0.11	-0.78	-2.40	0.83
1	1	31	25	175	0.06	-1.18	0.34	-0.64
1	1	32	21	363	0.09	-0.86	2.63	-0.63
1	1	33	19	355	0.06	-0.43	1.97	1.00
1	1	34	22	297	0.08	1.37	-0.78	0.90
1	1	35	24	108	0.11	1.37	-2.65	-0.06
1	1	36	23	249	0.07	0.75	-0.66	0.01
1	1	37	19	219	0.08	-1.87	2.14	-0.53
1	1	38	31	104	0.1	1.13	-2.63	0.48
1	1	39	19	245	0.11	2.03	-1.93	-0.58
1	1	40	27	36	0.08	-2.36	-1.02	0.81
1	1	41	9	300	0.04	0.16	0.99	-0.05
1	1	42	22	357	0.08	1.78	-0.78	0.99
1	1	43	8	485	0.05	1.71	2.15	0.66
1	1	44	16	424	0.08	2.81	-1.01	0.02
1	1	45	16	56	0.08	-2.04	-0.84	0.97
1	1	46	17	142	0.05	-0.88	-0.59	-0.34
1	1	47	17	492	0.07	2.69	1.30	-0.10
1	1	48	24	155	0.06	-0.40	-0.99	0.35
1	1	49	19	172	0.06	0.13	-1.27	0.69
1	1	50	15	21	0.08	-1.63	-2.04	0.78
1	1	51	18	128	0.08	-1.79	0.64	-0.99
1	1	52	20	445	0.08	2.64	0.07	-0.76
1	1	53	37	220	0.08	0.48	-0.89	0.14
1	1	54	17	59	0.07	-1.94	-0.76	-0.99
1	1	55	20	158	0.08	1.25	-2.15	-0.87
1	1	56	14	442	0.09	0.33	2.73	0.65
1	1	57	15	493	0.08	1.67	2.43	-0.31
1	1	58	10	345	0.05	0.59	0.95	0.47
1	1	59	25	273	0.07	1.18	-0.58	-0.72
1	1	60	19	40	0.09	-2.91	0.05	-0.40
1	1	61	25	34	0.12	-0.26	-2.96	0.19
1	1	62	14	283	0.05	-0.40	1.38	-0.83
1	1	63	16	391	0.08	1.38	0.66	0.88
1	1	64	13	341	0.05	0.25	1.23	0.66
1	1	65	22	275	0.09	-1.49	2.32	0.64
1	1	66	16	462	0.07	2.77	0.26	0.61
1	1	67	23	126	0.06	-1.09	-0.51	-0.61
1	1	68	11	226	0.07	-1.90	2.04	0.60
1	1	69	32	375	0.1	2.45	-1.68	0.18
1	1	70	22	188	0.11	1.75	-2.34	-0.36
1	1	71	17	3	0.09	-2.32	-1.86	0.19
1	1	72	20	266	0.09	-1.36	1.94	0.92
1	1	73	15	441	0.08	0.40	2.75	-0.61
1	1	74	16	89	0.06	-0.79	-1.36	-0.90
1	1	75	16	394	0.06	1.68	0.14	0.95
1	1	76	28	168	0.07	-0.17	-1.01	0.20
1	1	77	23	461	0.09	2.93	-0.15	0.32
1	1	78	21	153	0.06	-1.04	-0.25	-0.36
1	1	79	22	314	0.05	1.24	-0.37	0.71
1	1	80	12	166	0.04	0.36	-1.34	-0.79
1	1	81	23	136	0.08	-0.49	-1.14	0.64
1	1	82	26	88	0.11	-2.69	0.88	0.53
1	1	83	11	453	0.07	1.26	2.04	0.91
1	1	84	20	389	0.08	-0.43	2.52	0.82
1	1	85	12	490	0.07	1.63	2.43	0.37
1	1	86	24	84	0.11	0.62	-2.57	0.75
1	1	87	28	321	0.07	1.42	-0.23	-0.82
1	1	88	30	187	0.1	0.74	-1.62	0.97
1	1	89	25	152	0.05	-0.87	-0.49	-0.03
1	1	90	18	331	0.05	0.54	0.85	0.14
1	1	91	25	149	0.08	-1.49	0.47	-0.90
1	1	92	19	214	0.05	-0.87	0.49	0.02
1	1	93	19	235	0.06	-0.90	0.89	0.68
1	1	94	28	208	0.09	0.42	-1.07	0.52
1	1	95	16	487	0.11	1.28	2.68	-0.19
1	1	96	17	15	0.09	-2.50	-1.34	0.53
1	1	97	19	234	0.05	-0.70	0.73	0.16
1	1	98	17	228	0.08	-1.56	1.53	0.98
1	1	99	16	359	0.08	-0.82	2.40	0.83
1	1	100	17	417	0.08	-0.12	2.69	0.71

Now let us understand what each column in the above table means:

Segment.Level - Level of the cell. In this case, we have performed Vector Quantization for depth 1. Hence Segment Level is 1.
Segment.Parent - Parent segment of the cell.
Segment.Child (Cell.Number) - The children of a particular cell. In this case, it is the total number of cells at which we achieved the defined compression percentage.
n - No of points in each cell.
Cell.ID - Cell_ID’s are generated for the multivariate data using 1-D Sammon’s Projection algorithm.
Quant.Error - Quantization Error for each cell.

All the columns after this will contain centroids for each cell. They can also be called a codebook, which represents a collection of all centroids or codewords.

Now let’s try to understand plotHVT function. The parameters have been explained in detail below:

plotHVT <-(hvt.results, line.width, color.vec, pch1, centroid.size, 
           centroid.color,title, maxDepth, child.level, hmap.cols,
           quant.error.hmap, n_cells.hmap, label.size, 
           sepration_width, layer_opacity, cell_id,
           dim_size, plot.type = '2Dhvt')

hvt.results - (1D/2Dproj/2Dhvt/2Dheatmap/surface_plot) A list obtained from the trainHVT function while performing hierarchical vector quantization on training dataset. This list provides an overview of the hierarchical vector quantized data, including diagnostics, tessellation details, Sammon’s projection coordinates, and model input information.
line.width - (2Dhvt/2Dheatmap) A vector indicating the line widths of the tessellation boundaries for each layer.
color.vec - (2Dhvt/2Dheatmap) A vector indicating the colors of the tessellations boundaries at each layer.
pch1 - (2Dhvt/2Dheatmap) Symbol, It plots the centroids with a particular symbol such as (solid circle, bullet, filled square, filled diamond) in the tessellations.(default = 21 i.e filled circle).
centroid.size - (2Dhvt/2Dheatmap) Vector of Size of centroids for each level of tessellations.
centroid.color - (2Dhvt/2Dheatmap) Vector of color of centroids for each level of tessellations.
title - (2Dhvt) Set a title for the plot (default = NULL).
maxDepth - (2Dhvt) An integer indicating the number of levels.
cell_id - (2Dhvt) Logical. To indicate whether the plot should have Cell IDs or not for the level 1. (default = FALSE)
child.level - (2Dheatmap/surface_plot) A Number indicating the level for which the heat map is to be plotted.
hmap.cols - (2Dheatmap/surface_plot) A Number or a Character which is the column number or column name from the dataset indicating the variables for which the heat map is to be plotted.
label.size - (2Dheatmap) The size by which the tessellation labels should be scaled.(default = 0.5)
quant.error.hmap - (2Dheatmap) A number indicating the quantization error threshold.
sepration_width - (surface_plot) An integer indicating the width between two Levels.
layer_opacity - (surface_plot) A vector indicating the opacity of each layer/ level.
dim_size - (surface_plot) An integer indicating the dimension size used to create the matrix for the plot.
plot.type - A Character indicating which type of plot should be generated. Accepted entries are ‘1D’,‘2Dproj’, ‘2Dhvt’,‘2Dheatmap’, ‘surface_plot’. Default value is ‘2Dhvt’.

Let’s plot the Voronoi tessellation for layer 1 (map A).

plotHVT(torus_mapA,
        line.width = c(0.4), 
        color.vec = c("navy blue"),
        centroid.size = 0.01,
        maxDepth = 1,
        plot.type = "2Dhvt")

Figure 3: The Voronoi Tessellation for layer 1 (map A) shown for the 500 cells in the dataset ’torus’

4.1 Heatmaps

Now let’s plot the Voronoi Tessellation with the heatmap overlaid for all the features in the torus dataset for better visualization and interpretation of data patterns and distributions.

The heatmaps displayed below provides a visual representation of the spatial characteristics of the torus dataset, allowing us to observe patterns and trends in the distribution of each of the features (x,y,z). The sheer green shades highlight regions with higher values in each of the heatmaps, while the indigo shades indicate areas with the lowest values in each of the heatmaps. By analyzing these heatmaps, we can gain insights into the variations and relationships between each of these features within the torus dataset.

  plotHVT(
  torus_mapA,
  child.level = 1,
  hmap.cols = "x",
  line.width = c(0.2),
  color.vec = c("navy blue"),
  centroid.size = 0.1,
  plot.type = '2Dheatmap'
)

Figure 4: The Voronoi Tessellation with the heat map overlaid for variable ’x’ in the ’torus’ dataset

  plotHVT(
  torus_mapA,
  child.level = 1,
  hmap.cols = "y",
  line.width = c(0.2),
  color.vec = c("navy blue"),
  centroid.size = 0.1,
  plot.type = '2Dheatmap'
)

Figure 5: The Voronoi Tessellation with the heat map overlaid for variable ’y’ in the ’torus’ dataset

  plotHVT(
  torus_mapA,
  child.level = 1,
  hmap.cols = "z",
  line.width = c(0.2),
  color.vec = c("navy blue"),
  centroid.size = 0.1,
  plot.type = '2Dheatmap'
)

Figure 6: The Voronoi Tessellation with the heat map overlaid for variable ’z’ in the ’torus’ dataset

5. Map B : Compressed Novelty Map

Let us try to visualize the Map B from the diagram below.

Figure 7: Data Segregation with highlighted bounding box in red around map B

In this section, we will manually figure out the novelty cells from the plotted torus_mapA and store it in identified_Novelty_cells variable.

Note: For manual selecting the novelty cells from map A, one can enhance its interactivity by adding plotly elements to the code. This will transform map A into an interactive plot, allowing users to actively engage with the data. By hovering over the centroids of the cells, a tag containing segment child information will be displayed. Users can explore the map by hovering over different cells and selectively choose the novelty cells they wish to consider. Added an image for reference.

Figure 8: Manually selecting novelty cells

The removeNovelty function removes the identified novelty cell(s) from the training dataset (containing 9600 datapoints) and stores those records separately.

It takes input as the cell number (Segment.Child) of the manually identified novelty cell(s) and the compressed HVT map (torus_mapA) with 500 cells. It returns a list of two items: data with novelty, and data without novelty.

NOTE: As we are using torus dataset here, the identified novelty cells given are for demo purpose.

identified_Novelty_cells <<- c(273,44,61,486,185,425)   #as a example
output_list <- removeNovelty(identified_Novelty_cells, torus_mapA)
data_with_novelty <- output_list[[1]]
data_without_novelty <- output_list[[2]]

Let’s have a look at the data with novelty(containing 115 records).

novelty_data <- data_with_novelty
novelty_data$Row.No <- row.names(novelty_data)
novelty_data <- novelty_data %>% dplyr::select("Row.No","Cell.ID","Cell.Number","x","y","z")
colnames(novelty_data) <- c("Row.No","Cell.ID","Segment.Child","x","y","z")
Table(novelty_data,scroll = TRUE)

Row.No	Cell.ID	Segment.Child	x	y	z
1	424	44	2.7839	-1.0776	-0.1712
2	424	44	2.8089	-1.0384	0.1027
3	424	44	2.8404	-0.9040	0.1952
4	424	44	2.7834	-1.0866	0.1544
5	424	44	2.8208	-0.9473	0.2193
6	424	44	2.7804	-1.0582	-0.2226
7	424	44	2.8795	-0.8408	0.0226
8	424	44	2.7738	-1.1262	-0.1121
9	424	44	2.7538	-1.1860	-0.0569
10	424	44	2.8513	-0.9218	-0.0828
11	424	44	2.8754	-0.8550	0.0168
12	424	44	2.8450	-0.8996	0.1792
13	424	44	2.8239	-0.9397	0.2172
14	424	44	2.7871	-1.0527	-0.2026
15	424	44	2.7875	-1.1082	-0.0220
16	424	44	2.7661	-1.1507	0.0905
17	34	61	-0.3149	-2.9384	0.2958
18	34	61	-0.3078	-2.9675	0.1812
19	34	61	-0.1469	-2.9921	0.0927
20	34	61	-0.3766	-2.9762	0.0092
21	34	61	-0.0344	-2.9993	0.0303
22	34	61	-0.2807	-2.9525	0.2592
23	34	61	-0.3967	-2.9725	0.0484
24	34	61	-0.2519	-2.9034	0.4049
25	34	61	-0.3169	-2.9822	0.0443
26	34	61	-0.1057	-2.9757	0.2107
27	34	61	0.0958	-2.9784	0.1994
28	34	61	-0.3598	-2.9046	0.3757
29	34	61	-0.5300	-2.9485	0.0921
30	34	61	-0.2574	-2.9769	0.1544
31	34	61	-0.4312	-2.9677	0.0486
32	34	61	0.0796	-2.9885	0.1440
33	34	61	-0.2803	-2.9049	0.3957
34	34	61	-0.4258	-2.9397	0.2417
35	34	61	-0.3847	-2.9574	0.1871
36	34	61	-0.1814	-2.9475	0.3027
37	34	61	-0.4657	-2.9341	0.2396
38	34	61	-0.2817	-2.9829	0.0871
39	34	61	-0.3100	-2.9449	0.2759
40	34	61	-0.0367	-2.9262	0.3764
41	34	61	-0.0928	-2.9950	0.0848
42	75	185	-2.8203	0.9904	-0.1467
43	75	185	-2.8178	1.0260	0.0499
44	75	185	-2.7501	1.1484	-0.1977
45	75	185	-2.8307	0.8870	-0.2570
46	75	185	-2.9216	0.6631	-0.0905
47	75	185	-2.7794	1.1095	-0.1211
48	75	185	-2.8862	0.7563	-0.1801
49	75	185	-2.7889	1.0811	-0.1333
50	75	185	-2.8045	1.0304	0.1555
51	75	185	-2.8893	0.7432	-0.1815
52	75	185	-2.8085	1.0402	-0.1003
53	75	185	-2.7684	1.1089	-0.1877
54	75	185	-2.8008	1.0713	-0.0508
55	75	185	-2.8734	0.8593	-0.0420
56	75	185	-2.8926	0.7896	0.0560
57	75	185	-2.8014	1.0351	0.1638
58	75	185	-2.8382	0.9661	-0.0614
59	75	185	-2.7733	1.1066	-0.1675
60	75	185	-2.8765	0.8519	-0.0099
61	75	185	-2.9258	0.6607	-0.0332
62	75	185	-2.8318	0.9591	0.1427
63	439	273	2.9450	-0.5316	0.1218
64	439	273	2.9041	-0.7280	0.1098
65	439	273	2.9111	-0.6332	0.2030
66	439	273	2.9095	-0.6207	0.2223
67	439	273	2.8605	-0.7913	0.2510
68	439	273	2.9184	-0.6856	-0.0661
69	439	273	2.8971	-0.7568	0.1061
70	439	273	2.8758	-0.6541	0.3144
71	439	273	2.9496	-0.4882	0.1430
72	439	273	2.9188	-0.6454	0.1457
73	439	273	2.9351	-0.5220	0.1932
74	439	273	2.8530	-0.8358	0.2313
75	439	273	2.8969	-0.5663	0.3069
76	439	273	2.8809	-0.8085	0.1250
77	439	273	2.8340	-0.8588	0.2755
78	460	425	0.5660	2.9195	0.2270
79	460	425	0.4825	2.9331	-0.2327
80	460	425	0.2922	2.9667	0.1938
81	460	425	0.7219	2.8642	0.3005
82	460	425	0.5100	2.9548	0.0551
83	460	425	0.5103	2.9319	0.2180
84	460	425	0.6264	2.9337	-0.0202
85	460	425	0.4241	2.9696	-0.0208
86	460	425	0.4568	2.9565	-0.1292
87	460	425	0.4127	2.9640	0.1212
88	460	425	0.2388	2.9833	0.1195
89	460	425	0.4408	2.9674	0.0030
90	460	425	0.5544	2.9221	0.2254
91	460	425	0.3024	2.9847	0.0031
92	460	425	0.3711	2.9462	0.2453
93	460	425	0.4730	2.9532	0.1347
94	19	486	-0.9027	-2.8262	0.2552
95	19	486	-0.7470	-2.9053	0.0186
96	19	486	-0.9246	-2.8381	0.1728
97	19	486	-0.9065	-2.8593	0.0313
98	19	486	-0.7323	-2.9085	-0.0371
99	19	486	-1.0349	-2.7844	0.2410
100	19	486	-1.1207	-2.7825	0.0230
101	19	486	-1.0549	-2.7973	0.1442
102	19	486	-0.8786	-2.8665	-0.0609
103	19	486	-0.9398	-2.7706	0.3783
104	19	486	-0.8161	-2.8680	0.1897
105	19	486	-1.0239	-2.8185	-0.0510
106	19	486	-0.9253	-2.7881	0.3475
107	19	486	-0.9820	-2.8178	0.1782
108	19	486	-0.8810	-2.8624	0.1005
109	19	486	-0.7873	-2.8533	0.2804
110	19	486	-1.0393	-2.7889	0.2167
111	19	486	-0.5913	-2.9309	0.1414
112	19	486	-0.9948	-2.8299	0.0252
113	19	486	-0.7686	-2.8947	-0.1001
114	19	486	-0.9815	-2.8025	0.2455
115	19	486	-0.7111	-2.8678	0.2977

5.1 Voronoi Tessellation with highlighted novelty cell

The plotNovelCells function is used to plot the Voronoi tessellation using the compressed HVT map (torus_mapA) containing 500 cells and highlights the identified novelty cell(s) i.e 6 cells (containing 115 records) in red on the map.

plotNovelCells(identified_Novelty_cells, torus_mapA,line.width = c(0.4),centroid.size = 0.01)

Figure 9: The Voronoi Tessellation constructed using the compressed HVT map (map A) with the novelty cell(s) highlighted in red

We pass the dataframe with novelty records (115 records) to trainHVT function along with other model parameters mentioned below to generate map B (layer2)

Model Parameters

Number of Cells at each Level = 11
Maximum Depth = 1
Quantization Error Threshold = 0.1
Error Metric = Max
Distance Metric = Euclidean
Dimensionality Reduction method = Sammon

colnames(data_with_novelty) <- c("Cell.ID","Segment.Child","x","y","z")
data_with_novelty <- data_with_novelty[,-1:-2]
mapA_scale_summary = torus_mapA[[3]]$scale_summary
torus_mapB <- trainHVT(data_with_novelty,
                  n_cells = 11,   
                  depth = 1,
                  quant.err = 0.1,
                  normalize = FALSE,
                  distance_metric = "L2_Norm",
                  error_metric = "max",
                  quant_method = "kmeans",
                  dim_reduction_method = "sammon")

The datatable displayed below is the summary from map B (layer 2) showing Cell.ID, Centroids and Quantization Error for each of the 11 cells.

displayTable(data =torus_mapB[[3]][['summary']], columnName= 'Quant.Error', value = 0.1, tableType = "summary")

Segment.Level	Segment.Parent	Segment.Child	n	Cell.ID	Quant.Error	x	y	z
1	1	1	6	6	0.05	-0.03	-2.99	0.13
1	1	2	7	2	0.07	-0.94	-2.84	0.00
1	1	3	9	10	0.1	0.46	2.94	0.20
1	1	4	7	11	0.06	0.46	2.96	-0.05
1	1	5	15	8	0.08	2.90	-0.68	0.18
1	1	6	21	9	0.1	-2.83	0.95	-0.07
1	1	7	11	5	0.05	-0.38	-2.96	0.12
1	1	8	16	7	0.08	2.81	-1.01	0.02
1	1	9	8	4	0.07	-0.25	-2.93	0.34
1	1	10	9	1	0.05	-0.98	-2.80	0.24
1	1	11	6	3	0.06	-0.73	-2.89	0.15

Now let’s check the compression summary for HVT (torus_mapB). The table below shows no of cells, no of cells having quantization error below threshold and percentage of cells having quantization error below threshold for each level.

displayTable(data = torus_mapB[[3]]$compression_summary,columnName = 'percentOfCellsBelowQuantizationErrorThreshold', value = 0.8, tableType = "compression")

segmentLevel	noOfCells	noOfCellsBelowQuantizationError	percentOfCellsBelowQuantizationErrorThreshold	parameters
1	11	10	0.91	n_cells: 11 quant.err: 0.1 distance_metric: L2_Norm error_metric: max quant_method: kmeans

As it can be seen from the table above, 91% of the cells have hit the quantization threshold error.Since we are successfully able to attain the desired compression percentage, so we will not further subdivide the cells

6. Map C : Compressed Map without Novelty

Let us try to visualize the compressed Map C from the diagram below.

Figure 10:Data Segregation with highlighted bounding box in red around compressed map C

6.1 Iteration 1

With the Novelties removed, we construct another hierarchical Voronoi tessellation map C layer 2 on the data without Novelty (containing 9485 records) and below mentioned model parameters.

Model Parameters

Number of Cells at Level 1 = 10
Number of Cells at Level 2 = 100 (10 x 10)
Maximum Depth = 2
Quantization Error Threshold = 0.1
Error Metric = Max
Distance Metric = Euclidean
Dimensionality Reduction method = Sammon

torus_mapC <- trainHVT(dataset  = data_without_novelty,
                  n_cells = 10,
                  depth = 2,
                  quant.err = 0.1,
                  normalize = FALSE,
                  distance_metric = "L2_Norm",
                  error_metric = "max",
                  quant_method = "kmeans",
                  dim_reduction_method = "sammon")

Now let’s check the compression summary for HVT (torus_mapC) where n_cell was set to 15. The table below shows no of cells, no of cells having quantization error below threshold and percentage of cells having quantization error below threshold for each level.

displayTable(data = torus_mapC[[3]]$compression_summary,columnName = 'percentOfCellsBelowQuantizationErrorThreshold', value = 0.8, tableType = "compression")

segmentLevel	noOfCells	noOfCellsBelowQuantizationError	percentOfCellsBelowQuantizationErrorThreshold	parameters
1	10	0	0	n_cells: 10 quant.err: 0.1 distance_metric: L2_Norm error_metric: max quant_method: kmeans
2	100	0	0	n_cells: 10 quant.err: 0.1 distance_metric: L2_Norm error_metric: max quant_method: kmeans

As it can be seen from the table above, 0% of the cells have hit the quantization threshold error in level 1 and 0% of the cells have hit the quantization threshold error in level 2

6.2 Iteration 2

Since, we are yet to achive atleast 80% compression at depth 2. Let’s try to compress again using the below mentioned set of model parameters and the data without novelty (containing 9485 records).

Model Parameters

Number of Cells at Level 1 = 46
Number of Cells at Level 2 = 2116 (46 x 46)
Maximum Depth = 2
Quantization Error Threshold = 0.1
Error Metric = Max
Distance Metric = Euclidean
Dimensionality Reduction method = Sammon

torus_mapC <- trainHVT(data_without_novelty,
                  n_cells = 46,    
                  depth = 2,
                  quant.err = 0.1,
                  normalize = FALSE,
                  distance_metric = "L2_Norm",
                  error_metric = "max",
                  quant_method = "kmeans",
                  dim_reduction_method = "sammon")

The datatable displayed below is the summary from map C (layer2). showing Cell.ID, Centroids and Quantization Error.

displayTable(data =torus_mapC[[3]][['summary']], columnName= 'Quant.Error', value = 0.1, tableType = "summary", scroll = T)

Segment.Level	Segment.Parent	Segment.Child	n	Cell.ID	Quant.Error	x	y	z
1	1	1	183	567	0.31	-1.57	2.40	0.00
1	1	2	236	355	0.31	-1.52	-1.28	-0.92
1	1	3	162	88	0.29	-1.79	-2.27	0.01
1	1	4	167	1865	0.29	2.51	-1.41	-0.20
1	1	5	183	874	0.3	0.46	-2.67	-0.51
1	1	6	251	1120	0.23	-0.16	1.00	-0.03
1	1	7	194	1576	0.25	1.39	-0.01	0.76
1	1	8	196	1208	0.32	-0.43	2.51	-0.70
1	1	9	189	2042	0.3	1.72	2.23	0.35
1	1	10	273	609	0.28	-1.19	-0.19	0.60
1	1	11	248	1320	0.28	0.24	1.46	0.81
1	1	12	257	1537	0.24	1.26	0.09	-0.63
1	1	13	187	602	0.3	-0.13	-2.43	0.79
1	1	14	207	331	0.32	-2.30	1.38	-0.56
1	1	15	154	2118	0.29	2.66	1.18	0.04
1	1	16	288	1465	0.28	0.59	1.27	-0.77
1	1	17	148	2003	0.29	2.76	-0.16	0.49
1	1	18	269	886	0.29	-0.92	1.40	-0.89
1	1	19	170	153	0.32	-2.53	-0.57	-0.71
1	1	20	243	1251	0.23	0.83	-0.67	0.34
1	1	21	189	1908	0.31	1.86	1.25	-0.88
1	1	22	265	880	0.26	-0.18	-1.08	0.39
1	1	23	265	467	0.31	-1.64	0.11	-0.89
1	1	24	258	807	0.23	-0.92	0.43	-0.16
1	1	25	184	1657	0.28	1.92	-1.17	0.88
1	1	26	151	352	0.32	-0.83	-2.43	-0.71
1	1	27	264	695	0.24	-0.83	-0.63	-0.25
1	1	28	166	1387	0.29	1.45	-2.41	0.40
1	1	29	288	1404	0.23	0.80	0.63	0.12
1	1	30	177	1852	0.28	0.94	2.46	-0.64
1	1	31	217	1101	0.28	0.70	-1.62	0.92
1	1	32	177	1362	0.27	1.38	-1.93	-0.83
1	1	33	182	287	0.27	-2.08	-0.47	0.93
1	1	34	238	815	0.28	-0.18	-1.40	-0.76
1	1	35	172	1983	0.28	2.55	0.13	-0.72
1	1	36	151	64	0.29	-2.46	-1.40	0.32
1	1	37	205	968	0.29	-0.83	2.08	0.89
1	1	38	238	1184	0.23	0.74	-0.96	-0.58
1	1	39	221	436	0.29	-1.18	-1.51	0.93
1	1	40	172	1607	0.32	0.24	2.65	0.62
1	1	41	143	125	0.31	-2.86	0.04	0.20
1	1	42	138	1891	0.24	2.15	0.58	0.92
1	1	43	229	412	0.3	-2.13	1.14	0.81
1	1	44	170	1646	0.24	1.81	-0.83	-0.94
1	1	45	190	1753	0.29	1.42	1.28	0.93
1	1	46	230	814	0.25	-1.06	0.93	0.78
2	1	1	3	845	0.03	-1.16	2.63	0.48
2	1	2	4	365	0.07	-2.10	2.11	0.23
2	1	3	3	409	0.09	-2.01	2.11	0.41
2	1	4	5	542	0.08	-1.60	2.41	-0.45
2	1	5	3	585	0.06	-1.52	2.14	-0.78
2	1	6	3	497	0.09	-1.76	2.38	0.26
2	1	7	2	604	0.03	-1.48	2.59	-0.20
2	1	8	5	434	0.13	-1.90	2.26	-0.31
2	1	9	2	766	0.02	-1.28	2.31	-0.77
2	1	10	4	957	0.1	-0.96	2.76	0.37
2	1	11	7	665	0.1	-1.42	2.60	0.27
2	1	12	4	878	0.07	-1.11	2.77	0.16
2	1	13	6	505	0.13	-1.70	2.45	-0.18
2	1	14	2	836	0.05	-1.16	2.75	-0.18
2	1	15	5	498	0.12	-1.74	2.06	-0.71
2	1	16	2	343	0.03	-2.20	1.95	0.33
2	1	17	8	790	0.14	-1.28	2.64	0.35
2	1	18	7	593	0.1	-1.51	2.29	-0.67
2	1	19	3	784	0.09	-1.27	2.71	0.10
2	1	20	5	644	0.05	-1.47	2.47	0.48
2	1	21	4	648	0.13	-1.43	2.55	-0.39
2	1	22	6	433	0.09	-1.91	2.11	-0.53
2	1	23	3	367	0.05	-2.08	2.15	0.13
2	1	24	6	479	0.12	-1.77	2.26	-0.49
2	1	25	3	341	0.08	-2.17	2.06	0.05
2	1	26	5	1004	0.08	-0.87	2.86	0.12
2	1	27	4	442	0.13	-1.89	2.32	0.09
2	1	28	3	627	0.06	-1.45	2.63	0.02
2	1	29	5	440	0.1	-1.93	2.22	0.34
2	1	30	4	405	0.07	-1.99	2.24	-0.08
2	1	31	3	420	0.07	-2.00	2.00	0.56
2	1	32	4	753	0.07	-1.33	2.68	-0.11
2	1	33	4	809	0.05	-1.20	2.58	-0.53
2	1	34	4	916	0.12	-1.03	2.81	-0.13
2	1	35	6	735	0.11	-1.34	2.48	-0.57
2	1	36	4	601	0.1	-1.53	2.35	0.60
2	1	37	5	910	0.1	-1.05	2.75	-0.31
2	1	38	7	553	0.13	-1.61	2.42	0.41
2	1	39	2	336	0.06	-2.18	2.05	-0.14
2	1	40	4	493	0.12	-1.82	2.06	0.66
2	1	41	2	379	0.04	-2.11	1.97	0.45
2	1	42	2	587	0.05	-1.52	2.59	0.01
2	1	43	3	555	0.09	-1.63	2.20	0.67
2	1	44	2	494	0.04	-1.80	2.22	0.51
2	1	45	3	615	0.03	-1.51	2.28	0.68
2	1	46	2	334	0.02	-2.22	2.00	0.17
2	2	1	6	297	0.07	-1.85	-0.84	-1.00
2	2	2	6	351	0.07	-1.33	-1.54	-1.00
2	2	3	4	386	0.07	-1.50	-1.04	-0.98
2	2	4	6	448	0.09	-1.35	-0.95	-0.94
2	2	5	7	346	0.11	-1.67	-0.90	-0.99
2	2	6	7	475	0.11	-0.93	-1.57	-0.98
2	2	7	5	178	0.07	-2.14	-1.22	-0.89
2	2	8	5	575	0.07	-0.87	-1.14	-0.83

Now let’s check the compression summary for HVT (torus_mapC). The table below shows no of cells, no of cells having quantization error below threshold and percentage of cells having quantization error below threshold for each level.

displayTable(data = torus_mapC[[3]]$compression_summary,columnName = 'percentOfCellsBelowQuantizationErrorThreshold', value = 0.8, tableType = "compression")

segmentLevel	noOfCells	noOfCellsBelowQuantizationError	percentOfCellsBelowQuantizationErrorThreshold	parameters
1	46	0	0	n_cells: 46 quant.err: 0.1 distance_metric: L2_Norm error_metric: max quant_method: kmeans
2	2116	1748	0.83	n_cells: 46 quant.err: 0.1 distance_metric: L2_Norm error_metric: max quant_method: kmeans

As it can be seen from the table above, 0% of the cells have hit the quantization threshold error in level 1 and 83% of the cells have hit the quantization threshold error in level 2

Let’s plot the Voronoi tessellation for layer 2 (map C)

plotHVT(torus_mapC,
        line.width = c(0.2,0.1), 
        color.vec = c("navyblue","steelblue"),
        centroid.size = 0.1,
        maxDepth = 2, 
        plot.type = '2Dhvt')

Figure 11: The Voronoi Tessellation for layer 2 (map C) shown for the 928 cells in the dataset ’torus’ at level 2

6.3 Heatmaps

Now let’s plot all the features for each cell at level two as a heatmap for better visualization.

The heatmaps displayed below provides a visual representation of the spatial characteristics of the torus dataset, allowing us to observe patterns and trends in the distribution of each of the features (x,y,z). The sheer green shades highlight regions with higher values in each of the heatmaps, while the indigo shades indicate areas with the lowest values in each of the heatmaps. By analyzing these heatmaps, we can gain insights into the variations and relationships between each of these features within the torus dataset.

  plotHVT(
  torus_mapC,
  child.level = 2,
  hmap.cols = "x",
  line.width = c(0.2,0.1),
  color.vec = c("navyblue","steelblue"),
  centroid.size = 0.1,
  plot.type = '2Dheatmap'
)

Figure 12: The Voronoi Tessellation with the heat map overlaid for feature x in the ’torus’ dataset

  plotHVT(
  torus_mapC,
  child.level = 2,
  hmap.cols = "y",
  line.width = c(0.2,0.1),
  color.vec = c("navyblue","steelblue"),
  centroid.size = 0.1,
  plot.type = '2Dheatmap'
)

Figure 13: The Voronoi Tessellation with the heat map overlaid for feature y in the ’torus’ dataset

  plotHVT(
  torus_mapC,
  child.level = 2,
  hmap.cols = "z",
  line.width = c(0.2,0.1),
  color.vec = c("navyblue","steelblue"),
  centroid.size = 0.1,
  plot.type = '2Dheatmap'
)

Figure 14: The Voronoi Tessellation with the heat map overlaid for feature z in the ’torus’ dataset

We now have the set of maps (map A, map B & map C) which will be used to score, which map and cell each test record is assigned to.

7. Scoring

Now once we have built the model, let us try to score using our testing dataset (containing 2400 data points) which cell and which layer each point belongs to.

The scoreLayeredHVT function is used to score the testing dataset using the scored set of maps. This function takes an input - a testing dataset and a set of maps (map A, map B, map C).

Now, Let us understand the scoreLayeredHVT function.

scoreLayeredHVT(data,
                hvt_mapA,
                hvt_mapB,
                hvt_mapC,
                child.level = 1,
                mad.threshold = 0.2,
                normalize = TRUE,
                distance_metric="L1_Norm",
                error_metric="max",
                yVar)

Each of the parameters of scoreLayeredHVT function has been explained below:

Before that, the approach of scoreLayeredHVT function is to use scoreHVT function to score the test data against the given results of trainHVT which is referred as ‘map’ here. Hence the scoreLayeredHVT scores the test dataset against map A, B & C and further process and merge the final output. So the arguments used in scoreHVT is important here for smooth execution of function.

data - A dataframe containing the test dataset. The dataframe should have all the variable(features) used for training.
hvt_mapA - Result obtained from trainHVT function while performing hierarchical vector quantization on train data. This list containes information about the hierarchical vector quantized data along with a summary section.
hvt_mapB - Result obtained from trainHVT function while performing hierarchical vector quantization on data with novelty data.It is a subset of the training data obtained as a result of removeNovelty function (1st element).
hvt_mapC - Result obtained from trainHVT function while performing hierarchical vector quantization on data without novelty. It is a subset of the training data obtained as a result of removeNovelty function (2nd element).
child.level - A number indicating the depth for which the heat map is to be plotted. Each depth represents a different level of clustering or partitioning of the data.
mad.threshold - A numeric value indicating the permissible Mean Absolute Deviation which is obtained from Minimum Intra centroid plot(when diagnose is set to TRUE in trainHVT). mad.threshold value is important since it is used in anomaly detection. Default value is 0.2 NOTE: for a given datapoint, when the quantization error is above mad.threshold it is denoted as anomaly else not.
normalize - A logical value indicating if the dataset should be normalized. When set to TRUE, the data (testing dataset) is standardized by mean and sd of the training dataset referred from the trainHVT(). When set to FALSE, the data is used as such without any changes.
distance_metric - The distance metric can be L1_Norm(Manhattan) or L2_Norm(Euclidean). The metric is used when calculating distance between each datapoint(in test dataset) with the centroids obtained from results of trainHVT. Default is L1_Norm.
error_metric - The error metric can be mean or max. max will return the max of m values and mean will take mean of m values where each value is a distance between the datapoint and centroid of the cell. This helps in calculating the scored quantization error. Default value is max.
yVar - A character or a vector representing the name of the dependent variable(s)

When normalize is set to TRUE, the scoreHVT function has an inbuilt feature to standardize the testing dataset based on the mean and standard deviation of the training dataset from the trainHVT results.

Steps involved in Normalizing the data:
From the trainHVT results (here it is torus_mapA, torus_mapB & torus_mapC) the summary is accessed which is the third list object.
From the summary list, the scale.summary object is taken which have two entries that stores mean_data and std_data of all the columns in training dataset.
Those two entries are taken for the center and scale parameter in scale() which normalizes the testing dataset similar to training dataset in scoreHVT()
Here in scoreLayeredHVT function the testing dataset is scored against all the maps (A, B & C) by using scoreHVT function and the results are merged and further processed.
This approach ensures that the model that is evaluated on testing dataset scaled in the same way as the training dataset, maintaining consistency and improving the model’s ability to generalize to new, unseen data.

The function score based on the HVT maps - map A, map B and map C, constructed using trainHVT function. For each test record, the function will assign that record to Layer1 or Layer2. Layer1 contains the cell ids from map A and Layer 2 contains cell ids from map B (novelty map) and map C (map without novelty).

Scoring Algorithm

The Scoring algorithm recursively calculates the distance between each point in the testing dataset and the cell centroids for each level. The following steps explain the scoring method for a single point in the test dataset:

Calculate the distance between the point and the centroid of all the cells in the first level.
Find the cell whose centroid has minimum distance to the point.
Check if the cell drills down further to form more cells.
If it doesn’t, return the path. Or else repeat steps 1 to 4 till we reach a level at which the cell doesn’t drill down further.

Note : The Scoring algorithm will not work if some of the variables used to perform quantization are missing. In the testing dataset, we should not remove any features.

validation_data <- torus_test
new_score <- scoreLayeredHVT(
    data=validation_data,
    hvt_mapA = torus_mapA,
    hvt_mapB = torus_mapB,
    hvt_mapC = torus_mapC,
    normalize = FALSE
  )

Let’s see which cell and layer each point belongs to and check the Mean Absolute Difference for each of the 2400 records. For the sake of brevity, we are only displaying the first 100 rows.

act_pred <- new_score[["actual_predictedTable"]]
rownames(act_pred) <- NULL
act_pred %>% head(100) %>%as.data.frame() %>%Table(scroll = TRUE)

Row.Number	act_x	act_y	act_z	Layer1.Cell.ID	Layer2.Cell.ID	pred_x	pred_y	pred_z	diff
1	-2.6282	0.5656	-0.7253	A85	C153	-2.5258976	-0.5697529	-0.7072982	0.4185524
2	2.7471	-0.9987	-0.3848	A425	C1865	2.5077850	-1.4109928	-0.1993299	0.2790259
3	-2.4446	-1.6528	0.3097	A3	C64	-2.4619927	-1.3983722	0.3219391	0.0946865
4	-2.6487	-0.5745	0.7040	A41	C287	-2.0844505	-0.4682857	0.9330797	0.2998478
5	-0.2676	-1.0800	-0.4611	A157	C815	-0.1826176	-1.4024576	-0.7633130	0.2365510
6	-1.1130	-0.6516	-0.7040	A126	C695	-0.8306652	-0.6299318	-0.2497557	0.2527491
7	2.0288	1.9519	0.5790	A491	C2042	1.7210566	2.2261741	0.3547593	0.2687527
8	-2.4799	1.6863	-0.0470	A140	C331	-2.2995517	1.3755594	-0.5613184	0.3351357
9	-0.4105	-1.1610	-0.6398	A119	C815	-0.1826176	-1.4024576	-0.7633130	0.1976176
10	-0.2545	-1.6160	-0.9314	A83	C815	-0.1826176	-1.4024576	-0.7633130	0.1511706
11	1.1500	0.3945	-0.6205	A352	C1537	1.2572988	0.0921132	-0.6335630	0.1409162
12	-1.2557	-1.1369	0.9520	A67	C436	-1.1822271	-1.5123679	0.9261489	0.1582640
13	-0.5449	-2.6892	-0.6684	A43	C352	-0.8252530	-2.4340675	-0.7088662	0.1919839
14	2.9093	0.7222	-0.0697	A478	C2118	2.6593221	1.1768851	0.0397240	0.2713623
15	2.3205	1.2520	-0.7711	A476	C1908	1.8601725	1.2505847	-0.8798926	0.1901785
16	1.4772	-0.5194	-0.9008	A298	C1646	1.8050471	-0.8284412	-0.9447865	0.2269582
17	-1.3176	-2.6541	0.2690	A11	C88	-1.7876407	-2.2655926	0.0079136	0.3732115
18	1.0687	0.1211	-0.3812	A316	C1537	1.2572988	0.0921132	-0.6335630	0.1566495
19	-0.9632	0.3283	-0.1866	A195	C807	-0.9247605	0.4324310	-0.1556399	0.0578435
20	2.5616	0.4634	0.7976	A465	C1891	2.1489362	0.5766913	0.9229370	0.2170973
21	2.8473	-0.9303	-0.0955	A424	B7	2.8100750	-1.0120500	0.0204813	0.0783188
22	-0.5293	-0.8566	0.1173	A154	C880	-0.1846958	-1.0806849	0.3878845	0.2797579
23	-1.9898	-2.1766	0.3150	A2	C88	-1.7876407	-2.2655926	0.0079136	0.1994128
24	-0.8845	-1.2219	-0.8709	A105	C355	-1.5176280	-1.2759788	-0.9228856	0.2463975
25	0.1553	2.2566	0.9651	A405	C1607	0.2441140	2.6498128	0.6154320	0.2772316
26	2.4262	-0.6069	-0.8655	A383	C1646	1.8050471	-0.8284412	-0.9447865	0.3073269
27	-0.0667	-1.4627	-0.8444	A120	C815	-0.1826176	-1.4024576	-0.7633130	0.0857490
28	-0.0655	-1.3311	-0.7448	A151	C815	-0.1826176	-1.4024576	-0.7633130	0.0689961
29	1.9592	1.5104	0.8806	A458	C1753	1.4176163	1.2761653	0.9340511	0.2764232
30	1.2332	2.5452	0.5603	A479	C2042	1.7210566	2.2261741	0.3547593	0.3374744
31	-0.8720	0.4903	0.0287	A214	C807	-0.9247605	0.4324310	-0.1556399	0.0983231
32	0.2194	-1.7686	0.9760	A139	C1101	0.6990641	-1.6243336	0.9238327	0.2253659
33	1.5052	0.0445	-0.8694	A351	C1537	1.2572988	0.0921132	-0.6335630	0.1771171
34	-2.8410	-0.8651	0.2439	A17	C125	-2.8632007	0.0406280	0.1977916	0.3246790
35	1.3203	-2.5967	0.4077	A104	C1387	1.4533301	-2.4118663	0.3963452	0.1097396
36	-1.5648	1.5577	0.9781	A228	C412	-2.1320349	1.1370603	0.8097262	0.3854162
37	0.3589	-1.0419	-0.4400	A205	C1184	0.7356987	-0.9623697	-0.5815899	0.1993063
38	-0.2900	-2.0106	0.9995	A76	C602	-0.1333930	-2.4335626	0.7935524	0.2618390
39	0.5300	1.3668	0.8455	A374	C1320	0.2436560	1.4647407	0.8135907	0.1387313
40	1.0254	-0.6738	0.6344	A279	C1251	0.8259309	-0.6708288	0.3378642	0.1663254
41	-0.9306	0.3664	0.0154	A214	C807	-0.9247605	0.4324310	-0.1556399	0.0809702
42	2.3888	-1.0670	0.7875	A384	C1657	1.9231207	-1.1734554	0.8763315	0.2203221
43	-0.9830	-0.2043	-0.0897	A163	C695	-0.8306652	-0.6299318	-0.2497557	0.2460074
44	0.9499	0.3135	0.0261	A326	C1404	0.7957924	0.6327927	0.1192434	0.1888479
45	-1.8079	-1.4936	0.9386	A44	C436	-1.1822271	-1.5123679	0.9261489	0.2189640
46	1.8399	-1.9295	-0.7459	A245	C1362	1.3773650	-1.9334028	-0.8315864	0.1840414
47	-0.3304	-1.8481	0.9925	A76	C602	-0.1333930	-2.4335626	0.7935524	0.3271390
48	-2.2806	-1.8984	0.2536	A3	C64	-2.4619927	-1.3983722	0.3219391	0.2499199
49	-2.3323	1.7320	0.4252	A207	C412	-2.1320349	1.1370603	0.8097262	0.3932437
50	0.5520	0.8441	0.1308	A331	C1404	0.7957924	0.6327927	0.1192434	0.1555521
51	-0.9449	2.2273	0.9078	A289	C968	-0.8343712	2.0839820	0.8888732	0.0909246
52	0.2334	-1.4612	-0.8540	A132	C815	-0.1826176	-1.4024576	-0.7633130	0.1884824
53	2.7387	0.9703	0.4244	A481	C2118	2.6593221	1.1768851	0.0397240	0.2235463
54	0.3561	1.1619	-0.6199	A340	C1465	0.5898708	1.2696326	-0.7664024	0.1626686
55	1.7006	1.5569	-0.9522	A452	C1908	1.8601725	1.2505847	-0.8798926	0.1793984
56	1.7244	-0.5698	0.9829	A357	C1657	1.9231207	-1.1734554	0.8763315	0.3029815
57	0.9922	1.1438	-0.8741	A373	C1465	0.5898708	1.2696326	-0.7664024	0.2119531
58	-0.3022	-1.3611	0.7956	A130	C880	-0.1846958	-1.0806849	0.3878845	0.2685449
59	-0.9693	1.0602	0.8261	A236	C814	-1.0599848	0.9345948	0.7788200	0.0878567
60	1.1313	-0.3595	-0.5824	A294	C1537	1.2572988	0.0921132	-0.6335630	0.2095917
61	-0.7561	-2.5384	-0.7611	A31	C352	-0.8252530	-2.4340675	-0.7088662	0.0752397
62	2.3168	1.8924	0.1302	A499	C2118	2.6593221	1.1768851	0.0397240	0.3828377
63	1.2363	-2.6444	-0.3939	A108	C874	0.4550169	-2.6699721	-0.5138344	0.3089299
64	-1.3204	-0.6281	0.8430	A111	C609	-1.1912839	-0.1941707	0.6042769	0.2672562
65	1.3733	1.1877	0.9829	A409	C1753	1.4176163	1.2761653	0.9340511	0.0605435
66	1.0874	-0.1278	0.4251	A333	C1576	1.3876845	-0.0060711	0.7561144	0.2510093
67	2.1300	-1.2171	-0.8914	A301	C1646	1.8050471	-0.8284412	-0.9447865	0.2556661
68	1.6863	-0.5945	0.9773	A357	C1657	1.9231207	-1.1734554	0.8763315	0.3055815
69	0.8504	1.0927	-0.7882	A373	C1465	0.5898708	1.2696326	-0.7664024	0.1530865
70	0.3029	1.0731	0.4656	A336	C1320	0.2436560	1.4647407	0.8135907	0.2662918
71	-1.4724	1.1331	0.9899	A210	C814	-1.0599848	0.9345948	0.7788200	0.2740001
72	-0.5452	-1.2243	0.7514	A136	C880	-0.1846958	-1.0806849	0.3878845	0.2892116
73	-1.6866	2.1137	0.7101	A226	C968	-0.8343712	2.0839820	0.8888732	0.3535733
74	1.2012	-2.0386	-0.9305	A158	C1362	1.3773650	-1.9334028	-0.8315864	0.1267586
75	-0.2108	2.3579	0.9301	A405	C968	-0.8343712	2.0839820	0.8888732	0.3129054
76	-0.5982	1.3776	-0.8671	A265	C886	-0.9236587	1.3992335	-0.8902766	0.1234229
77	-0.2116	-1.0573	-0.3878	A157	C815	-0.1826176	-1.4024576	-0.7633130	0.2498843
78	-0.7802	-0.9000	-0.5880	A118	C695	-0.8306652	-0.6299318	-0.2497557	0.2195926
79	1.0850	-1.6815	1.0000	A196	C1101	0.6990641	-1.6243336	0.9238327	0.1730899
80	1.5563	0.1715	-0.9008	A351	C1537	1.2572988	0.0921132	-0.6335630	0.2152083
81	-0.3790	1.4273	0.8522	A318	C1320	0.2436560	1.4647407	0.8135907	0.2329020
82	-1.2769	-0.2633	0.7178	A122	C609	-1.1912839	-0.1941707	0.6042769	0.0894228
83	-1.6039	2.4566	0.3575	A257	C567	-1.5739055	2.3988776	-0.0018896	0.1490355
84	-0.9297	2.4281	-0.8000	A309	C1208	-0.4306005	2.5131337	-0.7020495	0.2273612
85	0.5324	-0.8526	0.1016	A220	C1251	0.8259309	-0.6708288	0.3378642	0.2371888
86	0.3928	1.5433	-0.9132	A362	C1465	0.5898708	1.2696326	-0.7664024	0.2058453
87	1.0031	0.3850	-0.3786	A327	C1537	1.2572988	0.0921132	-0.6335630	0.2673495
88	-0.7562	0.7889	-0.4207	A232	C807	-0.9247605	0.4324310	-0.1556399	0.2633632
89	-1.0870	-0.7523	-0.7350	A102	C695	-0.8306652	-0.6299318	-0.2497557	0.2879824
90	-1.8671	-0.8423	-0.9988	A59	C355	-1.5176280	-1.2759788	-0.9228856	0.2863551
91	0.8325	-0.9413	0.6689	A242	C1251	0.8259309	-0.6708288	0.3378642	0.2026920
92	-0.3355	0.9636	0.2005	A277	C1120	-0.1584637	1.0003434	-0.0305096	0.1482631
93	-1.0089	-0.6007	0.5639	A133	C609	-1.1912839	-0.1941707	0.6042769	0.2097634
94	1.7725	1.7153	-0.8845	A452	C1908	1.8601725	1.2505847	-0.8798926	0.1856651
95	0.5539	-0.8888	0.3037	A220	C1251	0.8259309	-0.6708288	0.3378642	0.1747221
96	0.8149	-2.6016	0.6874	A84	C1387	1.4533301	-2.4118663	0.3963452	0.3730729
97	0.1104	1.7654	-0.9729	A379	C1465	0.5898708	1.2696326	-0.7664024	0.3939119
98	1.0107	0.3118	0.3349	A326	C1404	0.7957924	0.6327927	0.1192434	0.2505190
99	2.2697	-0.3642	0.9543	A403	C1891	2.1489362	0.5766913	0.9229370	0.3643394
100	0.4983	-0.8672	-0.0185	A220	C1251	0.8259309	-0.6708288	0.3378642	0.2934554

hist(act_pred$diff, breaks = 30, col = "blue", main = "Mean Absolute Difference", xlab = "Difference")

Figure 16: Mean Absolute Difference

8. Executive Summary

We have considered torus dataset for creating a scored sequence of maps using scoreLayeredHVT() in this vignette.
Our goal is to achieve data compression upto atleast 80%.
We construct a compressed HVT map (torus_mapA) using the trainHVT() on the training dataset by setting n_cells to 500 and quant.error to 0.1 and we were able to attain a compression of 90%.
Based on the output of the above step, we manually identify the novelty cell(s) from the plotted map A. For this dataset, we identify the 6 cells as the novelty cells. (since torus dataset does not have outliers we are using this for demo purpose.)
We pass the identified novelty cell(s) as a parameter to the removeNovelty() along with HVT torus_mapA. The function removes that novelty cell(s) from the dataset and stores them separately. It also returns the data without novelty(s).
The plotNovelCells() constructs hierarchical voronoi tessellations and highlights the identified novelty cell(s) in red.
The data with novelty is then passed to the trainHVT() to construct another HVT map (torus_mapB). But here, we set the parameters n_cells = 10, depth = 2 etc. when constructing the map.
The data without novelty is then passed to the trainHVT() to construct another HVT map (torus_mapC). But here, we set the parameters n_cells = 46, depth = 2 etc. when constructing the map.
Finally, the set of maps - torus_mapA,torus_mapB,torus_mapC are passed to the scoreLayeredHVT() along with the test dataset to score which map and what cell each test record is assigned to.
The output of scoreLayeredHVT is a dataset with two columns Layer1.Cell.ID and Layer2.Cell.ID. Layer1.Cell.ID contains cell ids from map A in the form A1,A2,A3…. and Layer2.Cell.ID contains cell ids from map B as B1,B2… depending on the identified novelties and map C as C1,C2,C3…..

HVT Scoring Cells with Layers using scoreLayeredHVT

Zubin Dowlaty, Srinivasan Sudarsanam, Somya Shambhawi, Vishwavani

2024-09-05

1. Abstract

2. Import Code Modules

3. Example : HVT with the Torus dataset

4. Map A : Base Compressed Map

4.1 Heatmaps

5. Map B : Compressed Novelty Map

5.1 Voronoi Tessellation with highlighted novelty cell

6. Map C : Compressed Map without Novelty

6.1 Iteration 1

6.2 Iteration 2

6.3 Heatmaps

7. Scoring

8. Executive Summary

9. References