HVT Scoring Cells with Layers using scoreLayeredHVT

Zubin Dowlaty, Srinivasan Sudarsanam, Somya Shambhawi, Vishwavani

2024-09-05

1. Abstract

The HVT package is a collection of R functions to facilitate building topology preserving maps for rich multivariate data analysis. Tending towards a big data preponderance, a large number of rows. A collection of R functions for this typical workflow is organized below:

  1. Data Compression: Vector quantization (VQ), HVQ (hierarchical vector quantization) using means or medians. This step compresses the rows (long data frame) using a compression objective.

  2. Data Projection: Dimension projection of the compressed cells to 1D,2D or Interactive surface plot with the Sammons Non-linear Algorithm. This step creates topology preserving map (also called embedding ) coordinates into the desired output dimension.

  3. Tessellation: Create cells required for object visualization using the Voronoi Tessellation method, package includes heatmap plots for hierarchical Voronoi tessellations (HVT). This step enables data insights, visualization, and interaction with the topology preserving map. Useful for semi-supervised tasks.

  4. Scoring: Scoring new data sets and recording their assignment using the map objects from the above steps, in a sequence of maps if required.

  5. Temporal Analysis and Visualization: A Collection of new functions that leverages the capacity of the HVT package by analyzing time series data for its underlying patterns, calculation of transitioning probabilities and the visualizations for the flow of data over time.

2. Import Code Modules

Here is the guide to install the HVT package. This helps user to install the most recent version of the HVT package.

###direct installation###
#install.packages("HVT")

#or

###git repo installation###
#library(devtools)
#devtools::install_github(repo = "Mu-Sigma/HVT")

NOTE: At the time documenting this vignette, the updated changes were not still in CRAN, hence we are sourcing the scripts from the R folder directly to the session environment.

# Sourcing required code scripts for HVT
script_dir <- "../R"
r_files <- list.files(script_dir, pattern = "\\.R$", full.names = TRUE)
invisible(lapply(r_files, function(file) { source(file, echo = FALSE); }))

3. Example : HVT with the Torus dataset

In this section, we will see how we can use the package to visualize multidimensional data by projecting them to two dimensions using Sammon’s projection and further used for Scoring.

Data Understanding

First of all, let us see how to generate data for torus. We are using a library geozoo for this purpose. Geo Zoo (stands for Geometric Zoo) is a compilation of geometric objects ranging from three to 10 dimensions. Geo Zoo contains regular or well-known objects, eg cube and sphere, and some abstract objects, e.g. Boy’s surface, Torus and Hyper-Torus.

Here, we will generate a 3D torus (a torus is a surface of revolution generated by revolving a circle in three-dimensional space one full revolution about an axis that is coplanar with the circle) with 12000 points.

Raw Torus Dataset

The torus dataset includes the following columns:

Lets, explore the torus dataset containing 12000 points. For the sake of brevity we are displaying first 6 rows.

set.seed(240)
# Here p represents dimension of object, n represents number of points
torus <- geozoo::torus(p = 3,n = 12000)
torus_df <- data.frame(torus$points)
colnames(torus_df) <- c("x","y","z")
torus_df <- torus_df %>% round(4)
Table(head(torus_df), scroll = FALSE)
x y z
-2.6282 0.5656 -0.7253
-1.4179 -0.8903 0.9455
-1.0308 1.1066 -0.8731
1.8847 0.1895 0.9944
-1.9506 -2.2507 0.2071
-1.4824 0.9229 0.9672

Now let’s have a look at structure of the torus dataset.

str(torus_df)
## 'data.frame':    12000 obs. of  3 variables:
##  $ x: num  -2.63 -1.42 -1.03 1.88 -1.95 ...
##  $ y: num  0.566 -0.89 1.107 0.19 -2.251 ...
##  $ z: num  -0.725 0.946 -0.873 0.994 0.207 ...

Data distribution

This section displays four objects.

Variable Histograms: The histogram distribution of all the features in the dataset.

Box Plots: Box plots for all the features in the dataset. These plots will display the median and Interquartile range of each column at a panel level.

Correlation Matrix: This calculates the Pearson correlation which is a bivariate correlation value measuring the linear correlation between two numeric columns. The output plot is shown as a matrix.

Summary EDA: The table provides descriptive statistics for all the features in the dataset.

It uses an inbuilt function called edaPlots to display the above-mentioned four objects.

edaPlots(torus_df, output_type = "summary", n_cols = 3)
edaPlots(torus_df, output_type = "histogram", n_cols = 3)

edaPlots(torus_df, output_type = "boxplot", n_cols = 3)

edaPlots(torus_df, output_type = "correlation", n_cols = 3)

Train - Test Split

Let us split the torus dataset into train and test. We will randomly select 80% of the torus dataset as train and remaining as test.

smp_size <- floor(0.80 * nrow(torus_df))
set.seed(279)
train_ind <- sample(seq_len(nrow(torus_df)), size = smp_size)
torus_train <- torus_df[train_ind, ]
torus_test <- torus_df[-train_ind, ]

Training Dataset

Now, lets have a look at the selected training dataset containing (9600 data points). For the sake of brevity we are displaying first six rows.

rownames(torus_train) <- NULL
Table(head(torus_train), scroll= FALSE)
x y z
1.7958 -0.4204 -0.9878
0.7115 -2.3528 -0.8889
1.9285 1.2034 0.9620
1.0175 0.0344 -0.1894
-0.2736 1.1298 -0.5464
1.8976 2.2391 0.3545

Now lets have a look at structure of the training dataset.

str(torus_train)
## 'data.frame':    9600 obs. of  3 variables:
##  $ x: num  1.796 0.712 1.929 1.018 -0.274 ...
##  $ y: num  -0.4204 -2.3528 1.2034 0.0344 1.1298 ...
##  $ z: num  -0.988 -0.889 0.962 -0.189 -0.546 ...

Data Distribution

edaPlots(torus_train, output_type = "summary", n_cols = 3)
edaPlots(torus_train,output_type = "histogram", n_cols = 3)

edaPlots(torus_train, output_type = "boxplot", n_cols = 3)

edaPlots(torus_train, output_type = "correlation", n_cols = 3)

Testing Dataset

Now, lets have a look at testing dataset containing(2400 data points).For the sake of brevity we are displaying first six rows.

rownames(torus_test) <- NULL
Table(head(torus_test), scroll = FALSE)
x y z
-2.6282 0.5656 -0.7253
2.7471 -0.9987 -0.3848
-2.4446 -1.6528 0.3097
-2.6487 -0.5745 0.7040
-0.2676 -1.0800 -0.4611
-1.1130 -0.6516 -0.7040

Now lets have a look at structure of the testing dataset.

str(torus_test)
## 'data.frame':    2400 obs. of  3 variables:
##  $ x: num  -2.628 2.747 -2.445 -2.649 -0.268 ...
##  $ y: num  0.566 -0.999 -1.653 -0.575 -1.08 ...
##  $ z: num  -0.725 -0.385 0.31 0.704 -0.461 ...

Data Distribution

edaPlots(torus_test, output_type = "summary", n_cols = 3)
edaPlots(torus_test,output_type = "histogram", n_cols = 3)

edaPlots(torus_test, output_type = "boxplot", n_cols = 3)

edaPlots(torus_test, output_type = "correlation", n_cols = 3)

4. Map A : Base Compressed Map

Let us try to visualize the compressed Map A from the diagram below.

Figure 1: Data Segregation with highlighted bounding box in red around compressed map A

Figure 1: Data Segregation with highlighted bounding box in red around compressed map A

This package can perform vector quantization using the following algorithms -

For more information on vector quantization, refer the following link.

The trainHVT function constructs highly compressed hierarchical Voronoi tessellations. The raw data is first scaled and this scaled data is supplied as input to the vector quantization algorithm. The vector quantization algorithm compresses the dataset until a user-defined compression percentage rate is achieved using a parameter called quantization error which acts as a threshold and determines the compression percentage. It means that for a given user-defined compression percentage we get the ‘n’ number of cells, then all of these cells formed will have a quantization error less than the threshold quantization error.

Let’s try to comprehend the trainHVT first before moving ahead.

trainHVT(
  data,
  min_compression_perc,
  n_cells,
  depth,
  quant.err,
  normalize,
  distance_metric = c("L1_Norm", "L2_Norm"),
  error_metric = c("mean", "max"),
  quant_method = c("kmeans", "kmedoids"),
  dim_reduction_method = c("sammon" , "tsne" , "umap")
  scale_summary = NA,
  diagnose = FALSE,
  hvt_validation = FALSE,
  train_validation_split_ratio = 0.8,
  projection.scale,
  tsne_perplexity,tsne_theta,tsne_verbose,
  tsne_eta,tsne_max_iter,
  umap_n_neighbors,umap_min_dist
)

Each of the parameters of trainHVT function have been explained below:

The output of trainHVT function (list of 7 elements) have been explained below with an image attached for clear understanding.

NOTE: Here the attached image is the snapshot of output list generated from map A which can be referred later in this section

Figure 2: The Output list generated by trainHVT function.

Figure 2: The Output list generated by trainHVT function.

We will use the trainHVT function to compress our data while preserving essential features of the dataset. Our goal is to achieve data compression upto atleast 80%. In situations where the compression ratio does not meet the desired target, we can explore adjusting the model parameters as a potential solution. This involves making modifications to parameters such as the quantization error threshold or increasing the number of cells and then rerunning the trainHVT function again.

As this is already done in HVT Vignette: please refer for more information.

Model Parameters

set.seed(240)
torus_mapA <- trainHVT(
  torus_train,
  n_cells = 500,
  depth = 1,
  quant.err = 0.1,
  normalize = FALSE,
  distance_metric = "L2_Norm",
  error_metric = "max",
  quant_method = "kmeans",
  dim_reduction_method = "sammon"
)

Let’s check the compression summary for torus.

displayTable(data = torus_mapA[[3]]$compression_summary,columnName = 'percentOfCellsBelowQuantizationErrorThreshold', value = 0.8, tableType = "compression")
segmentLevel noOfCells noOfCellsBelowQuantizationError percentOfCellsBelowQuantizationErrorThreshold parameters
1 500 448 0.9 n_cells: 500 quant.err: 0.1 distance_metric: L2_Norm error_metric: max quant_method: kmeans

We successfully compressed 90% of the data using n_cells parameter as 500, the next step involves performing data projection on the compressed data. In this step, the compressed data will be transformed and projected onto a lower-dimensional space to visualize and analyze the data in a more manageable form.

As per the manual, torus_mapA[[3]] gives us detailed information about the hierarchical vector quantized data. torus_mapA[[3]][['summary']] gives a nice tabular data containing no of points, Quantization Error and the codebook.

The datatable displayed below is the summary from torus_mapA showing Cell.ID, Centroids and Quantization Error for each of the 500 cells. For the sake of brevity, we are displaying only the first 100 rows.

displayTable(data =torus_mapA[[3]][['summary']], columnName= 'Quant.Error', value = 0.1, tableType = "summary", scroll = TRUE)
Segment.Level Segment.Parent Segment.Child n Cell.ID Quant.Error x y z
1 1 1 25 133 0.08 -0.92 -0.74 0.57
1 1 2 19 145 0.06 -0.21 -1.17 -0.58
1 1 3 14 174 0.04 -1.06 -0.01 0.33
1 1 4 9 491 0.05 2.16 1.86 0.53
1 1 5 18 199 0.08 -1.67 1.53 -0.96
1 1 6 18 306 0.08 1.73 -1.15 0.99
1 1 7 24 85 0.08 -2.39 0.64 -0.87
1 1 8 15 164 0.04 -0.97 -0.29 0.18
1 1 9 16 458 0.07 1.95 1.33 0.93
1 1 10 22 413 0.1 -0.05 2.68 -0.72
1 1 11 11 495 0.06 1.92 2.18 0.42
1 1 12 13 30 0.05 -1.76 -1.72 0.88
1 1 13 10 317 0.05 -0.69 1.90 1.00
1 1 14 23 27 0.09 -2.50 -0.86 -0.75
1 1 15 17 358 0.07 1.80 -0.42 -0.99
1 1 16 16 209 0.06 -1.44 1.27 -0.99
1 1 17 10 479 0.06 1.35 2.51 0.53
1 1 18 12 295 0.05 -0.05 1.03 0.25
1 1 19 33 203 0.07 0.60 -1.29 0.81
1 1 20 16 465 0.07 2.58 0.64 0.74
1 1 21 13 370 0.05 0.28 1.51 0.88
1 1 22 15 426 0.07 2.37 -0.21 0.92
1 1 23 19 139 0.07 0.38 -1.77 0.98
1 1 24 23 131 0.06 -0.57 -1.01 -0.54
1 1 25 27 242 0.07 0.76 -0.87 0.53
1 1 26 27 330 0.05 0.64 0.78 -0.12
1 1 27 16 178 0.09 -2.20 1.94 -0.32
1 1 28 22 31 0.1 -0.82 -2.52 -0.75
1 1 29 16 163 0.05 -0.97 -0.26 -0.10
1 1 30 19 37 0.11 -0.78 -2.40 0.83
1 1 31 25 175 0.06 -1.18 0.34 -0.64
1 1 32 21 363 0.09 -0.86 2.63 -0.63
1 1 33 19 355 0.06 -0.43 1.97 1.00
1 1 34 22 297 0.08 1.37 -0.78 0.90
1 1 35 24 108 0.11 1.37 -2.65 -0.06
1 1 36 23 249 0.07 0.75 -0.66 0.01
1 1 37 19 219 0.08 -1.87 2.14 -0.53
1 1 38 31 104 0.1 1.13 -2.63 0.48
1 1 39 19 245 0.11 2.03 -1.93 -0.58
1 1 40 27 36 0.08 -2.36 -1.02 0.81
1 1 41 9 300 0.04 0.16 0.99 -0.05
1 1 42 22 357 0.08 1.78 -0.78 0.99
1 1 43 8 485 0.05 1.71 2.15 0.66
1 1 44 16 424 0.08 2.81 -1.01 0.02
1 1 45 16 56 0.08 -2.04 -0.84 0.97
1 1 46 17 142 0.05 -0.88 -0.59 -0.34
1 1 47 17 492 0.07 2.69 1.30 -0.10
1 1 48 24 155 0.06 -0.40 -0.99 0.35
1 1 49 19 172 0.06 0.13 -1.27 0.69
1 1 50 15 21 0.08 -1.63 -2.04 0.78
1 1 51 18 128 0.08 -1.79 0.64 -0.99
1 1 52 20 445 0.08 2.64 0.07 -0.76
1 1 53 37 220 0.08 0.48 -0.89 0.14
1 1 54 17 59 0.07 -1.94 -0.76 -0.99
1 1 55 20 158 0.08 1.25 -2.15 -0.87
1 1 56 14 442 0.09 0.33 2.73 0.65
1 1 57 15 493 0.08 1.67 2.43 -0.31
1 1 58 10 345 0.05 0.59 0.95 0.47
1 1 59 25 273 0.07 1.18 -0.58 -0.72
1 1 60 19 40 0.09 -2.91 0.05 -0.40
1 1 61 25 34 0.12 -0.26 -2.96 0.19
1 1 62 14 283 0.05 -0.40 1.38 -0.83
1 1 63 16 391 0.08 1.38 0.66 0.88
1 1 64 13 341 0.05 0.25 1.23 0.66
1 1 65 22 275 0.09 -1.49 2.32 0.64
1 1 66 16 462 0.07 2.77 0.26 0.61
1 1 67 23 126 0.06 -1.09 -0.51 -0.61
1 1 68 11 226 0.07 -1.90 2.04 0.60
1 1 69 32 375 0.1 2.45 -1.68 0.18
1 1 70 22 188 0.11 1.75 -2.34 -0.36
1 1 71 17 3 0.09 -2.32 -1.86 0.19
1 1 72 20 266 0.09 -1.36 1.94 0.92
1 1 73 15 441 0.08 0.40 2.75 -0.61
1 1 74 16 89 0.06 -0.79 -1.36 -0.90
1 1 75 16 394 0.06 1.68 0.14 0.95
1 1 76 28 168 0.07 -0.17 -1.01 0.20
1 1 77 23 461 0.09 2.93 -0.15 0.32
1 1 78 21 153 0.06 -1.04 -0.25 -0.36
1 1 79 22 314 0.05 1.24 -0.37 0.71
1 1 80 12 166 0.04 0.36 -1.34 -0.79
1 1 81 23 136 0.08 -0.49 -1.14 0.64
1 1 82 26 88 0.11 -2.69 0.88 0.53
1 1 83 11 453 0.07 1.26 2.04 0.91
1 1 84 20 389 0.08 -0.43 2.52 0.82
1 1 85 12 490 0.07 1.63 2.43 0.37
1 1 86 24 84 0.11 0.62 -2.57 0.75
1 1 87 28 321 0.07 1.42 -0.23 -0.82
1 1 88 30 187 0.1 0.74 -1.62 0.97
1 1 89 25 152 0.05 -0.87 -0.49 -0.03
1 1 90 18 331 0.05 0.54 0.85 0.14
1 1 91 25 149 0.08 -1.49 0.47 -0.90
1 1 92 19 214 0.05 -0.87 0.49 0.02
1 1 93 19 235 0.06 -0.90 0.89 0.68
1 1 94 28 208 0.09 0.42 -1.07 0.52
1 1 95 16 487 0.11 1.28 2.68 -0.19
1 1 96 17 15 0.09 -2.50 -1.34 0.53
1 1 97 19 234 0.05 -0.70 0.73 0.16
1 1 98 17 228 0.08 -1.56 1.53 0.98
1 1 99 16 359 0.08 -0.82 2.40 0.83
1 1 100 17 417 0.08 -0.12 2.69 0.71

Now let us understand what each column in the above table means:

All the columns after this will contain centroids for each cell. They can also be called a codebook, which represents a collection of all centroids or codewords.

Now let’s try to understand plotHVT function. The parameters have been explained in detail below:

plotHVT <-(hvt.results, line.width, color.vec, pch1, centroid.size, 
           centroid.color,title, maxDepth, child.level, hmap.cols,
           quant.error.hmap, n_cells.hmap, label.size, 
           sepration_width, layer_opacity, cell_id,
           dim_size, plot.type = '2Dhvt')

Let’s plot the Voronoi tessellation for layer 1 (map A).

plotHVT(torus_mapA,
        line.width = c(0.4), 
        color.vec = c("navy blue"),
        centroid.size = 0.01,
        maxDepth = 1,
        plot.type = "2Dhvt") 

Figure 3: The Voronoi Tessellation for layer 1 (map A) shown for the 500 cells in the dataset ’torus’

4.1 Heatmaps

Now let’s plot the Voronoi Tessellation with the heatmap overlaid for all the features in the torus dataset for better visualization and interpretation of data patterns and distributions.

The heatmaps displayed below provides a visual representation of the spatial characteristics of the torus dataset, allowing us to observe patterns and trends in the distribution of each of the features (x,y,z). The sheer green shades highlight regions with higher values in each of the heatmaps, while the indigo shades indicate areas with the lowest values in each of the heatmaps. By analyzing these heatmaps, we can gain insights into the variations and relationships between each of these features within the torus dataset.

  plotHVT(
  torus_mapA,
  child.level = 1,
  hmap.cols = "x",
  line.width = c(0.2),
  color.vec = c("navy blue"),
  centroid.size = 0.1,
  plot.type = '2Dheatmap'
) 

Figure 4: The Voronoi Tessellation with the heat map overlaid for variable ’x’ in the ’torus’ dataset

  plotHVT(
  torus_mapA,
  child.level = 1,
  hmap.cols = "y",
  line.width = c(0.2),
  color.vec = c("navy blue"),
  centroid.size = 0.1,
  plot.type = '2Dheatmap'
) 

Figure 5: The Voronoi Tessellation with the heat map overlaid for variable ’y’ in the ’torus’ dataset

  plotHVT(
  torus_mapA,
  child.level = 1,
  hmap.cols = "z",
  line.width = c(0.2),
  color.vec = c("navy blue"),
  centroid.size = 0.1,
  plot.type = '2Dheatmap'
) 

Figure 6: The Voronoi Tessellation with the heat map overlaid for variable ’z’ in the ’torus’ dataset

5. Map B : Compressed Novelty Map

Let us try to visualize the Map B from the diagram below.

Figure 7: Data Segregation with highlighted bounding box in red around map B

Figure 7: Data Segregation with highlighted bounding box in red around map B

In this section, we will manually figure out the novelty cells from the plotted torus_mapA and store it in identified_Novelty_cells variable.

Note: For manual selecting the novelty cells from map A, one can enhance its interactivity by adding plotly elements to the code. This will transform map A into an interactive plot, allowing users to actively engage with the data. By hovering over the centroids of the cells, a tag containing segment child information will be displayed. Users can explore the map by hovering over different cells and selectively choose the novelty cells they wish to consider. Added an image for reference.

Figure 8: Manually selecting novelty cells

Figure 8: Manually selecting novelty cells

The removeNovelty function removes the identified novelty cell(s) from the training dataset (containing 9600 datapoints) and stores those records separately.

It takes input as the cell number (Segment.Child) of the manually identified novelty cell(s) and the compressed HVT map (torus_mapA) with 500 cells. It returns a list of two items: data with novelty, and data without novelty.

NOTE: As we are using torus dataset here, the identified novelty cells given are for demo purpose.

identified_Novelty_cells <<- c(273,44,61,486,185,425)   #as a example
output_list <- removeNovelty(identified_Novelty_cells, torus_mapA)
data_with_novelty <- output_list[[1]]
data_without_novelty <- output_list[[2]]

Let’s have a look at the data with novelty(containing 115 records).

novelty_data <- data_with_novelty
novelty_data$Row.No <- row.names(novelty_data)
novelty_data <- novelty_data %>% dplyr::select("Row.No","Cell.ID","Cell.Number","x","y","z")
colnames(novelty_data) <- c("Row.No","Cell.ID","Segment.Child","x","y","z")
Table(novelty_data,scroll = TRUE)
Row.No Cell.ID Segment.Child x y z
1 424 44 2.7839 -1.0776 -0.1712
2 424 44 2.8089 -1.0384 0.1027
3 424 44 2.8404 -0.9040 0.1952
4 424 44 2.7834 -1.0866 0.1544
5 424 44 2.8208 -0.9473 0.2193
6 424 44 2.7804 -1.0582 -0.2226
7 424 44 2.8795 -0.8408 0.0226
8 424 44 2.7738 -1.1262 -0.1121
9 424 44 2.7538 -1.1860 -0.0569
10 424 44 2.8513 -0.9218 -0.0828
11 424 44 2.8754 -0.8550 0.0168
12 424 44 2.8450 -0.8996 0.1792
13 424 44 2.8239 -0.9397 0.2172
14 424 44 2.7871 -1.0527 -0.2026
15 424 44 2.7875 -1.1082 -0.0220
16 424 44 2.7661 -1.1507 0.0905
17 34 61 -0.3149 -2.9384 0.2958
18 34 61 -0.3078 -2.9675 0.1812
19 34 61 -0.1469 -2.9921 0.0927
20 34 61 -0.3766 -2.9762 0.0092
21 34 61 -0.0344 -2.9993 0.0303
22 34 61 -0.2807 -2.9525 0.2592
23 34 61 -0.3967 -2.9725 0.0484
24 34 61 -0.2519 -2.9034 0.4049
25 34 61 -0.3169 -2.9822 0.0443
26 34 61 -0.1057 -2.9757 0.2107
27 34 61 0.0958 -2.9784 0.1994
28 34 61 -0.3598 -2.9046 0.3757
29 34 61 -0.5300 -2.9485 0.0921
30 34 61 -0.2574 -2.9769 0.1544
31 34 61 -0.4312 -2.9677 0.0486
32 34 61 0.0796 -2.9885 0.1440
33 34 61 -0.2803 -2.9049 0.3957
34 34 61 -0.4258 -2.9397 0.2417
35 34 61 -0.3847 -2.9574 0.1871
36 34 61 -0.1814 -2.9475 0.3027
37 34 61 -0.4657 -2.9341 0.2396
38 34 61 -0.2817 -2.9829 0.0871
39 34 61 -0.3100 -2.9449 0.2759
40 34 61 -0.0367 -2.9262 0.3764
41 34 61 -0.0928 -2.9950 0.0848
42 75 185 -2.8203 0.9904 -0.1467
43 75 185 -2.8178 1.0260 0.0499
44 75 185 -2.7501 1.1484 -0.1977
45 75 185 -2.8307 0.8870 -0.2570
46 75 185 -2.9216 0.6631 -0.0905
47 75 185 -2.7794 1.1095 -0.1211
48 75 185 -2.8862 0.7563 -0.1801
49 75 185 -2.7889 1.0811 -0.1333
50 75 185 -2.8045 1.0304 0.1555
51 75 185 -2.8893 0.7432 -0.1815
52 75 185 -2.8085 1.0402 -0.1003
53 75 185 -2.7684 1.1089 -0.1877
54 75 185 -2.8008 1.0713 -0.0508
55 75 185 -2.8734 0.8593 -0.0420
56 75 185 -2.8926 0.7896 0.0560
57 75 185 -2.8014 1.0351 0.1638
58 75 185 -2.8382 0.9661 -0.0614
59 75 185 -2.7733 1.1066 -0.1675
60 75 185 -2.8765 0.8519 -0.0099
61 75 185 -2.9258 0.6607 -0.0332
62 75 185 -2.8318 0.9591 0.1427
63 439 273 2.9450 -0.5316 0.1218
64 439 273 2.9041 -0.7280 0.1098
65 439 273 2.9111 -0.6332 0.2030
66 439 273 2.9095 -0.6207 0.2223
67 439 273 2.8605 -0.7913 0.2510
68 439 273 2.9184 -0.6856 -0.0661
69 439 273 2.8971 -0.7568 0.1061
70 439 273 2.8758 -0.6541 0.3144
71 439 273 2.9496 -0.4882 0.1430
72 439 273 2.9188 -0.6454 0.1457
73 439 273 2.9351 -0.5220 0.1932
74 439 273 2.8530 -0.8358 0.2313
75 439 273 2.8969 -0.5663 0.3069
76 439 273 2.8809 -0.8085 0.1250
77 439 273 2.8340 -0.8588 0.2755
78 460 425 0.5660 2.9195 0.2270
79 460 425 0.4825 2.9331 -0.2327
80 460 425 0.2922 2.9667 0.1938
81 460 425 0.7219 2.8642 0.3005
82 460 425 0.5100 2.9548 0.0551
83 460 425 0.5103 2.9319 0.2180
84 460 425 0.6264 2.9337 -0.0202
85 460 425 0.4241 2.9696 -0.0208
86 460 425 0.4568 2.9565 -0.1292
87 460 425 0.4127 2.9640 0.1212
88 460 425 0.2388 2.9833 0.1195
89 460 425 0.4408 2.9674 0.0030
90 460 425 0.5544 2.9221 0.2254
91 460 425 0.3024 2.9847 0.0031
92 460 425 0.3711 2.9462 0.2453
93 460 425 0.4730 2.9532 0.1347
94 19 486 -0.9027 -2.8262 0.2552
95 19 486 -0.7470 -2.9053 0.0186
96 19 486 -0.9246 -2.8381 0.1728
97 19 486 -0.9065 -2.8593 0.0313
98 19 486 -0.7323 -2.9085 -0.0371
99 19 486 -1.0349 -2.7844 0.2410
100 19 486 -1.1207 -2.7825 0.0230
101 19 486 -1.0549 -2.7973 0.1442
102 19 486 -0.8786 -2.8665 -0.0609
103 19 486 -0.9398 -2.7706 0.3783
104 19 486 -0.8161 -2.8680 0.1897
105 19 486 -1.0239 -2.8185 -0.0510
106 19 486 -0.9253 -2.7881 0.3475
107 19 486 -0.9820 -2.8178 0.1782
108 19 486 -0.8810 -2.8624 0.1005
109 19 486 -0.7873 -2.8533 0.2804
110 19 486 -1.0393 -2.7889 0.2167
111 19 486 -0.5913 -2.9309 0.1414
112 19 486 -0.9948 -2.8299 0.0252
113 19 486 -0.7686 -2.8947 -0.1001
114 19 486 -0.9815 -2.8025 0.2455
115 19 486 -0.7111 -2.8678 0.2977

5.1 Voronoi Tessellation with highlighted novelty cell

The plotNovelCells function is used to plot the Voronoi tessellation using the compressed HVT map (torus_mapA) containing 500 cells and highlights the identified novelty cell(s) i.e 6 cells (containing 115 records) in red on the map.

plotNovelCells(identified_Novelty_cells, torus_mapA,line.width = c(0.4),centroid.size = 0.01)

Figure 9: The Voronoi Tessellation constructed using the compressed HVT map (map A) with the novelty cell(s) highlighted in red

We pass the dataframe with novelty records (115 records) to trainHVT function along with other model parameters mentioned below to generate map B (layer2)

Model Parameters

colnames(data_with_novelty) <- c("Cell.ID","Segment.Child","x","y","z")
data_with_novelty <- data_with_novelty[,-1:-2]
mapA_scale_summary = torus_mapA[[3]]$scale_summary
torus_mapB <- trainHVT(data_with_novelty,
                  n_cells = 11,   
                  depth = 1,
                  quant.err = 0.1,
                  normalize = FALSE,
                  distance_metric = "L2_Norm",
                  error_metric = "max",
                  quant_method = "kmeans",
                  dim_reduction_method = "sammon")

The datatable displayed below is the summary from map B (layer 2) showing Cell.ID, Centroids and Quantization Error for each of the 11 cells.

displayTable(data =torus_mapB[[3]][['summary']], columnName= 'Quant.Error', value = 0.1, tableType = "summary")
Segment.Level Segment.Parent Segment.Child n Cell.ID Quant.Error x y z
1 1 1 6 6 0.05 -0.03 -2.99 0.13
1 1 2 7 2 0.07 -0.94 -2.84 0.00
1 1 3 9 10 0.1 0.46 2.94 0.20
1 1 4 7 11 0.06 0.46 2.96 -0.05
1 1 5 15 8 0.08 2.90 -0.68 0.18
1 1 6 21 9 0.1 -2.83 0.95 -0.07
1 1 7 11 5 0.05 -0.38 -2.96 0.12
1 1 8 16 7 0.08 2.81 -1.01 0.02
1 1 9 8 4 0.07 -0.25 -2.93 0.34
1 1 10 9 1 0.05 -0.98 -2.80 0.24
1 1 11 6 3 0.06 -0.73 -2.89 0.15

Now let’s check the compression summary for HVT (torus_mapB). The table below shows no of cells, no of cells having quantization error below threshold and percentage of cells having quantization error below threshold for each level.

displayTable(data = torus_mapB[[3]]$compression_summary,columnName = 'percentOfCellsBelowQuantizationErrorThreshold', value = 0.8, tableType = "compression")
segmentLevel noOfCells noOfCellsBelowQuantizationError percentOfCellsBelowQuantizationErrorThreshold parameters
1 11 10 0.91 n_cells: 11 quant.err: 0.1 distance_metric: L2_Norm error_metric: max quant_method: kmeans

As it can be seen from the table above, 91% of the cells have hit the quantization threshold error.Since we are successfully able to attain the desired compression percentage, so we will not further subdivide the cells

6. Map C : Compressed Map without Novelty

Let us try to visualize the compressed Map C from the diagram below.

Figure 10:Data Segregation with highlighted bounding box in red around compressed map C

Figure 10:Data Segregation with highlighted bounding box in red around compressed map C

6.1 Iteration 1

With the Novelties removed, we construct another hierarchical Voronoi tessellation map C layer 2 on the data without Novelty (containing 9485 records) and below mentioned model parameters.

Model Parameters

torus_mapC <- trainHVT(dataset  = data_without_novelty,
                  n_cells = 10,
                  depth = 2,
                  quant.err = 0.1,
                  normalize = FALSE,
                  distance_metric = "L2_Norm",
                  error_metric = "max",
                  quant_method = "kmeans",
                  dim_reduction_method = "sammon")

Now let’s check the compression summary for HVT (torus_mapC) where n_cell was set to 15. The table below shows no of cells, no of cells having quantization error below threshold and percentage of cells having quantization error below threshold for each level.

displayTable(data = torus_mapC[[3]]$compression_summary,columnName = 'percentOfCellsBelowQuantizationErrorThreshold', value = 0.8, tableType = "compression")
segmentLevel noOfCells noOfCellsBelowQuantizationError percentOfCellsBelowQuantizationErrorThreshold parameters
1 10 0 0 n_cells: 10 quant.err: 0.1 distance_metric: L2_Norm error_metric: max quant_method: kmeans
2 100 0 0 n_cells: 10 quant.err: 0.1 distance_metric: L2_Norm error_metric: max quant_method: kmeans

As it can be seen from the table above, 0% of the cells have hit the quantization threshold error in level 1 and 0% of the cells have hit the quantization threshold error in level 2

6.2 Iteration 2

Since, we are yet to achive atleast 80% compression at depth 2. Let’s try to compress again using the below mentioned set of model parameters and the data without novelty (containing 9485 records).

Model Parameters

torus_mapC <- trainHVT(data_without_novelty,
                  n_cells = 46,    
                  depth = 2,
                  quant.err = 0.1,
                  normalize = FALSE,
                  distance_metric = "L2_Norm",
                  error_metric = "max",
                  quant_method = "kmeans",
                  dim_reduction_method = "sammon")

The datatable displayed below is the summary from map C (layer2). showing Cell.ID, Centroids and Quantization Error.

displayTable(data =torus_mapC[[3]][['summary']], columnName= 'Quant.Error', value = 0.1, tableType = "summary", scroll = T)
Segment.Level Segment.Parent Segment.Child n Cell.ID Quant.Error x y z
1 1 1 183 567 0.31 -1.57 2.40 0.00
1 1 2 236 355 0.31 -1.52 -1.28 -0.92
1 1 3 162 88 0.29 -1.79 -2.27 0.01
1 1 4 167 1865 0.29 2.51 -1.41 -0.20
1 1 5 183 874 0.3 0.46 -2.67 -0.51
1 1 6 251 1120 0.23 -0.16 1.00 -0.03
1 1 7 194 1576 0.25 1.39 -0.01 0.76
1 1 8 196 1208 0.32 -0.43 2.51 -0.70
1 1 9 189 2042 0.3 1.72 2.23 0.35
1 1 10 273 609 0.28 -1.19 -0.19 0.60
1 1 11 248 1320 0.28 0.24 1.46 0.81
1 1 12 257 1537 0.24 1.26 0.09 -0.63
1 1 13 187 602 0.3 -0.13 -2.43 0.79
1 1 14 207 331 0.32 -2.30 1.38 -0.56
1 1 15 154 2118 0.29 2.66 1.18 0.04
1 1 16 288 1465 0.28 0.59 1.27 -0.77
1 1 17 148 2003 0.29 2.76 -0.16 0.49
1 1 18 269 886 0.29 -0.92 1.40 -0.89
1 1 19 170 153 0.32 -2.53 -0.57 -0.71
1 1 20 243 1251 0.23 0.83 -0.67 0.34
1 1 21 189 1908 0.31 1.86 1.25 -0.88
1 1 22 265 880 0.26 -0.18 -1.08 0.39
1 1 23 265 467 0.31 -1.64 0.11 -0.89
1 1 24 258 807 0.23 -0.92 0.43 -0.16
1 1 25 184 1657 0.28 1.92 -1.17 0.88
1 1 26 151 352 0.32 -0.83 -2.43 -0.71
1 1 27 264 695 0.24 -0.83 -0.63 -0.25
1 1 28 166 1387 0.29 1.45 -2.41 0.40
1 1 29 288 1404 0.23 0.80 0.63 0.12
1 1 30 177 1852 0.28 0.94 2.46 -0.64
1 1 31 217 1101 0.28 0.70 -1.62 0.92
1 1 32 177 1362 0.27 1.38 -1.93 -0.83
1 1 33 182 287 0.27 -2.08 -0.47 0.93
1 1 34 238 815 0.28 -0.18 -1.40 -0.76
1 1 35 172 1983 0.28 2.55 0.13 -0.72
1 1 36 151 64 0.29 -2.46 -1.40 0.32
1 1 37 205 968 0.29 -0.83 2.08 0.89
1 1 38 238 1184 0.23 0.74 -0.96 -0.58
1 1 39 221 436 0.29 -1.18 -1.51 0.93
1 1 40 172 1607 0.32 0.24 2.65 0.62
1 1 41 143 125 0.31 -2.86 0.04 0.20
1 1 42 138 1891 0.24 2.15 0.58 0.92
1 1 43 229 412 0.3 -2.13 1.14 0.81
1 1 44 170 1646 0.24 1.81 -0.83 -0.94
1 1 45 190 1753 0.29 1.42 1.28 0.93
1 1 46 230 814 0.25 -1.06 0.93 0.78
2 1 1 3 845 0.03 -1.16 2.63 0.48
2 1 2 4 365 0.07 -2.10 2.11 0.23
2 1 3 3 409 0.09 -2.01 2.11 0.41
2 1 4 5 542 0.08 -1.60 2.41 -0.45
2 1 5 3 585 0.06 -1.52 2.14 -0.78
2 1 6 3 497 0.09 -1.76 2.38 0.26
2 1 7 2 604 0.03 -1.48 2.59 -0.20
2 1 8 5 434 0.13 -1.90 2.26 -0.31
2 1 9 2 766 0.02 -1.28 2.31 -0.77
2 1 10 4 957 0.1 -0.96 2.76 0.37
2 1 11 7 665 0.1 -1.42 2.60 0.27
2 1 12 4 878 0.07 -1.11 2.77 0.16
2 1 13 6 505 0.13 -1.70 2.45 -0.18
2 1 14 2 836 0.05 -1.16 2.75 -0.18
2 1 15 5 498 0.12 -1.74 2.06 -0.71
2 1 16 2 343 0.03 -2.20 1.95 0.33
2 1 17 8 790 0.14 -1.28 2.64 0.35
2 1 18 7 593 0.1 -1.51 2.29 -0.67
2 1 19 3 784 0.09 -1.27 2.71 0.10
2 1 20 5 644 0.05 -1.47 2.47 0.48
2 1 21 4 648 0.13 -1.43 2.55 -0.39
2 1 22 6 433 0.09 -1.91 2.11 -0.53
2 1 23 3 367 0.05 -2.08 2.15 0.13
2 1 24 6 479 0.12 -1.77 2.26 -0.49
2 1 25 3 341 0.08 -2.17 2.06 0.05
2 1 26 5 1004 0.08 -0.87 2.86 0.12
2 1 27 4 442 0.13 -1.89 2.32 0.09
2 1 28 3 627 0.06 -1.45 2.63 0.02
2 1 29 5 440 0.1 -1.93 2.22 0.34
2 1 30 4 405 0.07 -1.99 2.24 -0.08
2 1 31 3 420 0.07 -2.00 2.00 0.56
2 1 32 4 753 0.07 -1.33 2.68 -0.11
2 1 33 4 809 0.05 -1.20 2.58 -0.53
2 1 34 4 916 0.12 -1.03 2.81 -0.13
2 1 35 6 735 0.11 -1.34 2.48 -0.57
2 1 36 4 601 0.1 -1.53 2.35 0.60
2 1 37 5 910 0.1 -1.05 2.75 -0.31
2 1 38 7 553 0.13 -1.61 2.42 0.41
2 1 39 2 336 0.06 -2.18 2.05 -0.14
2 1 40 4 493 0.12 -1.82 2.06 0.66
2 1 41 2 379 0.04 -2.11 1.97 0.45
2 1 42 2 587 0.05 -1.52 2.59 0.01
2 1 43 3 555 0.09 -1.63 2.20 0.67
2 1 44 2 494 0.04 -1.80 2.22 0.51
2 1 45 3 615 0.03 -1.51 2.28 0.68
2 1 46 2 334 0.02 -2.22 2.00 0.17
2 2 1 6 297 0.07 -1.85 -0.84 -1.00
2 2 2 6 351 0.07 -1.33 -1.54 -1.00
2 2 3 4 386 0.07 -1.50 -1.04 -0.98
2 2 4 6 448 0.09 -1.35 -0.95 -0.94
2 2 5 7 346 0.11 -1.67 -0.90 -0.99
2 2 6 7 475 0.11 -0.93 -1.57 -0.98
2 2 7 5 178 0.07 -2.14 -1.22 -0.89
2 2 8 5 575 0.07 -0.87 -1.14 -0.83

Now let’s check the compression summary for HVT (torus_mapC). The table below shows no of cells, no of cells having quantization error below threshold and percentage of cells having quantization error below threshold for each level.

displayTable(data = torus_mapC[[3]]$compression_summary,columnName = 'percentOfCellsBelowQuantizationErrorThreshold', value = 0.8, tableType = "compression")
segmentLevel noOfCells noOfCellsBelowQuantizationError percentOfCellsBelowQuantizationErrorThreshold parameters
1 46 0 0 n_cells: 46 quant.err: 0.1 distance_metric: L2_Norm error_metric: max quant_method: kmeans
2 2116 1748 0.83 n_cells: 46 quant.err: 0.1 distance_metric: L2_Norm error_metric: max quant_method: kmeans

As it can be seen from the table above, 0% of the cells have hit the quantization threshold error in level 1 and 83% of the cells have hit the quantization threshold error in level 2

Let’s plot the Voronoi tessellation for layer 2 (map C)

plotHVT(torus_mapC,
        line.width = c(0.2,0.1), 
        color.vec = c("navyblue","steelblue"),
        centroid.size = 0.1,
        maxDepth = 2, 
        plot.type = '2Dhvt')

Figure 11: The Voronoi Tessellation for layer 2 (map C) shown for the 928 cells in the dataset ’torus’ at level 2

6.3 Heatmaps

Now let’s plot all the features for each cell at level two as a heatmap for better visualization.

The heatmaps displayed below provides a visual representation of the spatial characteristics of the torus dataset, allowing us to observe patterns and trends in the distribution of each of the features (x,y,z). The sheer green shades highlight regions with higher values in each of the heatmaps, while the indigo shades indicate areas with the lowest values in each of the heatmaps. By analyzing these heatmaps, we can gain insights into the variations and relationships between each of these features within the torus dataset.

  plotHVT(
  torus_mapC,
  child.level = 2,
  hmap.cols = "x",
  line.width = c(0.2,0.1),
  color.vec = c("navyblue","steelblue"),
  centroid.size = 0.1,
  plot.type = '2Dheatmap'
) 

Figure 12: The Voronoi Tessellation with the heat map overlaid for feature x in the ’torus’ dataset

  plotHVT(
  torus_mapC,
  child.level = 2,
  hmap.cols = "y",
  line.width = c(0.2,0.1),
  color.vec = c("navyblue","steelblue"),
  centroid.size = 0.1,
  plot.type = '2Dheatmap'
) 

Figure 13: The Voronoi Tessellation with the heat map overlaid for feature y in the ’torus’ dataset

  plotHVT(
  torus_mapC,
  child.level = 2,
  hmap.cols = "z",
  line.width = c(0.2,0.1),
  color.vec = c("navyblue","steelblue"),
  centroid.size = 0.1,
  plot.type = '2Dheatmap'
) 

Figure 14: The Voronoi Tessellation with the heat map overlaid for feature z in the ’torus’ dataset

We now have the set of maps (map A, map B & map C) which will be used to score, which map and cell each test record is assigned to.

7. Scoring

Now once we have built the model, let us try to score using our testing dataset (containing 2400 data points) which cell and which layer each point belongs to.

The scoreLayeredHVT function is used to score the testing dataset using the scored set of maps. This function takes an input - a testing dataset and a set of maps (map A, map B, map C).

Now, Let us understand the scoreLayeredHVT function.

scoreLayeredHVT(data,
                hvt_mapA,
                hvt_mapB,
                hvt_mapC,
                child.level = 1,
                mad.threshold = 0.2,
                normalize = TRUE,
                distance_metric="L1_Norm",
                error_metric="max",
                yVar)

Each of the parameters of scoreLayeredHVT function has been explained below:

Before that, the approach of scoreLayeredHVT function is to use scoreHVT function to score the test data against the given results of trainHVT which is referred as ‘map’ here. Hence the scoreLayeredHVT scores the test dataset against map A, B & C and further process and merge the final output. So the arguments used in scoreHVT is important here for smooth execution of function.

When normalize is set to TRUE, the scoreHVT function has an inbuilt feature to standardize the testing dataset based on the mean and standard deviation of the training dataset from the trainHVT results.

The function score based on the HVT maps - map A, map B and map C, constructed using trainHVT function. For each test record, the function will assign that record to Layer1 or Layer2. Layer1 contains the cell ids from map A and Layer 2 contains cell ids from map B (novelty map) and map C (map without novelty).

Scoring Algorithm

The Scoring algorithm recursively calculates the distance between each point in the testing dataset and the cell centroids for each level. The following steps explain the scoring method for a single point in the test dataset:

  1. Calculate the distance between the point and the centroid of all the cells in the first level.
  2. Find the cell whose centroid has minimum distance to the point.
  3. Check if the cell drills down further to form more cells.
  4. If it doesn’t, return the path. Or else repeat steps 1 to 4 till we reach a level at which the cell doesn’t drill down further.

Note : The Scoring algorithm will not work if some of the variables used to perform quantization are missing. In the testing dataset, we should not remove any features.

validation_data <- torus_test
new_score <- scoreLayeredHVT(
    data=validation_data,
    hvt_mapA = torus_mapA,
    hvt_mapB = torus_mapB,
    hvt_mapC = torus_mapC,
    normalize = FALSE
  )

Let’s see which cell and layer each point belongs to and check the Mean Absolute Difference for each of the 2400 records. For the sake of brevity, we are only displaying the first 100 rows.

act_pred <- new_score[["actual_predictedTable"]]
rownames(act_pred) <- NULL
act_pred %>% head(100) %>%as.data.frame() %>%Table(scroll = TRUE)
Row.Number act_x act_y act_z Layer1.Cell.ID Layer2.Cell.ID pred_x pred_y pred_z diff
1 -2.6282 0.5656 -0.7253 A85 C153 -2.5258976 -0.5697529 -0.7072982 0.4185524
2 2.7471 -0.9987 -0.3848 A425 C1865 2.5077850 -1.4109928 -0.1993299 0.2790259
3 -2.4446 -1.6528 0.3097 A3 C64 -2.4619927 -1.3983722 0.3219391 0.0946865
4 -2.6487 -0.5745 0.7040 A41 C287 -2.0844505 -0.4682857 0.9330797 0.2998478
5 -0.2676 -1.0800 -0.4611 A157 C815 -0.1826176 -1.4024576 -0.7633130 0.2365510
6 -1.1130 -0.6516 -0.7040 A126 C695 -0.8306652 -0.6299318 -0.2497557 0.2527491
7 2.0288 1.9519 0.5790 A491 C2042 1.7210566 2.2261741 0.3547593 0.2687527
8 -2.4799 1.6863 -0.0470 A140 C331 -2.2995517 1.3755594 -0.5613184 0.3351357
9 -0.4105 -1.1610 -0.6398 A119 C815 -0.1826176 -1.4024576 -0.7633130 0.1976176
10 -0.2545 -1.6160 -0.9314 A83 C815 -0.1826176 -1.4024576 -0.7633130 0.1511706
11 1.1500 0.3945 -0.6205 A352 C1537 1.2572988 0.0921132 -0.6335630 0.1409162
12 -1.2557 -1.1369 0.9520 A67 C436 -1.1822271 -1.5123679 0.9261489 0.1582640
13 -0.5449 -2.6892 -0.6684 A43 C352 -0.8252530 -2.4340675 -0.7088662 0.1919839
14 2.9093 0.7222 -0.0697 A478 C2118 2.6593221 1.1768851 0.0397240 0.2713623
15 2.3205 1.2520 -0.7711 A476 C1908 1.8601725 1.2505847 -0.8798926 0.1901785
16 1.4772 -0.5194 -0.9008 A298 C1646 1.8050471 -0.8284412 -0.9447865 0.2269582
17 -1.3176 -2.6541 0.2690 A11 C88 -1.7876407 -2.2655926 0.0079136 0.3732115
18 1.0687 0.1211 -0.3812 A316 C1537 1.2572988 0.0921132 -0.6335630 0.1566495
19 -0.9632 0.3283 -0.1866 A195 C807 -0.9247605 0.4324310 -0.1556399 0.0578435
20 2.5616 0.4634 0.7976 A465 C1891 2.1489362 0.5766913 0.9229370 0.2170973
21 2.8473 -0.9303 -0.0955 A424 B7 2.8100750 -1.0120500 0.0204813 0.0783188
22 -0.5293 -0.8566 0.1173 A154 C880 -0.1846958 -1.0806849 0.3878845 0.2797579
23 -1.9898 -2.1766 0.3150 A2 C88 -1.7876407 -2.2655926 0.0079136 0.1994128
24 -0.8845 -1.2219 -0.8709 A105 C355 -1.5176280 -1.2759788 -0.9228856 0.2463975
25 0.1553 2.2566 0.9651 A405 C1607 0.2441140 2.6498128 0.6154320 0.2772316
26 2.4262 -0.6069 -0.8655 A383 C1646 1.8050471 -0.8284412 -0.9447865 0.3073269
27 -0.0667 -1.4627 -0.8444 A120 C815 -0.1826176 -1.4024576 -0.7633130 0.0857490
28 -0.0655 -1.3311 -0.7448 A151 C815 -0.1826176 -1.4024576 -0.7633130 0.0689961
29 1.9592 1.5104 0.8806 A458 C1753 1.4176163 1.2761653 0.9340511 0.2764232
30 1.2332 2.5452 0.5603 A479 C2042 1.7210566 2.2261741 0.3547593 0.3374744
31 -0.8720 0.4903 0.0287 A214 C807 -0.9247605 0.4324310 -0.1556399 0.0983231
32 0.2194 -1.7686 0.9760 A139 C1101 0.6990641 -1.6243336 0.9238327 0.2253659
33 1.5052 0.0445 -0.8694 A351 C1537 1.2572988 0.0921132 -0.6335630 0.1771171
34 -2.8410 -0.8651 0.2439 A17 C125 -2.8632007 0.0406280 0.1977916 0.3246790
35 1.3203 -2.5967 0.4077 A104 C1387 1.4533301 -2.4118663 0.3963452 0.1097396
36 -1.5648 1.5577 0.9781 A228 C412 -2.1320349 1.1370603 0.8097262 0.3854162
37 0.3589 -1.0419 -0.4400 A205 C1184 0.7356987 -0.9623697 -0.5815899 0.1993063
38 -0.2900 -2.0106 0.9995 A76 C602 -0.1333930 -2.4335626 0.7935524 0.2618390
39 0.5300 1.3668 0.8455 A374 C1320 0.2436560 1.4647407 0.8135907 0.1387313
40 1.0254 -0.6738 0.6344 A279 C1251 0.8259309 -0.6708288 0.3378642 0.1663254
41 -0.9306 0.3664 0.0154 A214 C807 -0.9247605 0.4324310 -0.1556399 0.0809702
42 2.3888 -1.0670 0.7875 A384 C1657 1.9231207 -1.1734554 0.8763315 0.2203221
43 -0.9830 -0.2043 -0.0897 A163 C695 -0.8306652 -0.6299318 -0.2497557 0.2460074
44 0.9499 0.3135 0.0261 A326 C1404 0.7957924 0.6327927 0.1192434 0.1888479
45 -1.8079 -1.4936 0.9386 A44 C436 -1.1822271 -1.5123679 0.9261489 0.2189640
46 1.8399 -1.9295 -0.7459 A245 C1362 1.3773650 -1.9334028 -0.8315864 0.1840414
47 -0.3304 -1.8481 0.9925 A76 C602 -0.1333930 -2.4335626 0.7935524 0.3271390
48 -2.2806 -1.8984 0.2536 A3 C64 -2.4619927 -1.3983722 0.3219391 0.2499199
49 -2.3323 1.7320 0.4252 A207 C412 -2.1320349 1.1370603 0.8097262 0.3932437
50 0.5520 0.8441 0.1308 A331 C1404 0.7957924 0.6327927 0.1192434 0.1555521
51 -0.9449 2.2273 0.9078 A289 C968 -0.8343712 2.0839820 0.8888732 0.0909246
52 0.2334 -1.4612 -0.8540 A132 C815 -0.1826176 -1.4024576 -0.7633130 0.1884824
53 2.7387 0.9703 0.4244 A481 C2118 2.6593221 1.1768851 0.0397240 0.2235463
54 0.3561 1.1619 -0.6199 A340 C1465 0.5898708 1.2696326 -0.7664024 0.1626686
55 1.7006 1.5569 -0.9522 A452 C1908 1.8601725 1.2505847 -0.8798926 0.1793984
56 1.7244 -0.5698 0.9829 A357 C1657 1.9231207 -1.1734554 0.8763315 0.3029815
57 0.9922 1.1438 -0.8741 A373 C1465 0.5898708 1.2696326 -0.7664024 0.2119531
58 -0.3022 -1.3611 0.7956 A130 C880 -0.1846958 -1.0806849 0.3878845 0.2685449
59 -0.9693 1.0602 0.8261 A236 C814 -1.0599848 0.9345948 0.7788200 0.0878567
60 1.1313 -0.3595 -0.5824 A294 C1537 1.2572988 0.0921132 -0.6335630 0.2095917
61 -0.7561 -2.5384 -0.7611 A31 C352 -0.8252530 -2.4340675 -0.7088662 0.0752397
62 2.3168 1.8924 0.1302 A499 C2118 2.6593221 1.1768851 0.0397240 0.3828377
63 1.2363 -2.6444 -0.3939 A108 C874 0.4550169 -2.6699721 -0.5138344 0.3089299
64 -1.3204 -0.6281 0.8430 A111 C609 -1.1912839 -0.1941707 0.6042769 0.2672562
65 1.3733 1.1877 0.9829 A409 C1753 1.4176163 1.2761653 0.9340511 0.0605435
66 1.0874 -0.1278 0.4251 A333 C1576 1.3876845 -0.0060711 0.7561144 0.2510093
67 2.1300 -1.2171 -0.8914 A301 C1646 1.8050471 -0.8284412 -0.9447865 0.2556661
68 1.6863 -0.5945 0.9773 A357 C1657 1.9231207 -1.1734554 0.8763315 0.3055815
69 0.8504 1.0927 -0.7882 A373 C1465 0.5898708 1.2696326 -0.7664024 0.1530865
70 0.3029 1.0731 0.4656 A336 C1320 0.2436560 1.4647407 0.8135907 0.2662918
71 -1.4724 1.1331 0.9899 A210 C814 -1.0599848 0.9345948 0.7788200 0.2740001
72 -0.5452 -1.2243 0.7514 A136 C880 -0.1846958 -1.0806849 0.3878845 0.2892116
73 -1.6866 2.1137 0.7101 A226 C968 -0.8343712 2.0839820 0.8888732 0.3535733
74 1.2012 -2.0386 -0.9305 A158 C1362 1.3773650 -1.9334028 -0.8315864 0.1267586
75 -0.2108 2.3579 0.9301 A405 C968 -0.8343712 2.0839820 0.8888732 0.3129054
76 -0.5982 1.3776 -0.8671 A265 C886 -0.9236587 1.3992335 -0.8902766 0.1234229
77 -0.2116 -1.0573 -0.3878 A157 C815 -0.1826176 -1.4024576 -0.7633130 0.2498843
78 -0.7802 -0.9000 -0.5880 A118 C695 -0.8306652 -0.6299318 -0.2497557 0.2195926
79 1.0850 -1.6815 1.0000 A196 C1101 0.6990641 -1.6243336 0.9238327 0.1730899
80 1.5563 0.1715 -0.9008 A351 C1537 1.2572988 0.0921132 -0.6335630 0.2152083
81 -0.3790 1.4273 0.8522 A318 C1320 0.2436560 1.4647407 0.8135907 0.2329020
82 -1.2769 -0.2633 0.7178 A122 C609 -1.1912839 -0.1941707 0.6042769 0.0894228
83 -1.6039 2.4566 0.3575 A257 C567 -1.5739055 2.3988776 -0.0018896 0.1490355
84 -0.9297 2.4281 -0.8000 A309 C1208 -0.4306005 2.5131337 -0.7020495 0.2273612
85 0.5324 -0.8526 0.1016 A220 C1251 0.8259309 -0.6708288 0.3378642 0.2371888
86 0.3928 1.5433 -0.9132 A362 C1465 0.5898708 1.2696326 -0.7664024 0.2058453
87 1.0031 0.3850 -0.3786 A327 C1537 1.2572988 0.0921132 -0.6335630 0.2673495
88 -0.7562 0.7889 -0.4207 A232 C807 -0.9247605 0.4324310 -0.1556399 0.2633632
89 -1.0870 -0.7523 -0.7350 A102 C695 -0.8306652 -0.6299318 -0.2497557 0.2879824
90 -1.8671 -0.8423 -0.9988 A59 C355 -1.5176280 -1.2759788 -0.9228856 0.2863551
91 0.8325 -0.9413 0.6689 A242 C1251 0.8259309 -0.6708288 0.3378642 0.2026920
92 -0.3355 0.9636 0.2005 A277 C1120 -0.1584637 1.0003434 -0.0305096 0.1482631
93 -1.0089 -0.6007 0.5639 A133 C609 -1.1912839 -0.1941707 0.6042769 0.2097634
94 1.7725 1.7153 -0.8845 A452 C1908 1.8601725 1.2505847 -0.8798926 0.1856651
95 0.5539 -0.8888 0.3037 A220 C1251 0.8259309 -0.6708288 0.3378642 0.1747221
96 0.8149 -2.6016 0.6874 A84 C1387 1.4533301 -2.4118663 0.3963452 0.3730729
97 0.1104 1.7654 -0.9729 A379 C1465 0.5898708 1.2696326 -0.7664024 0.3939119
98 1.0107 0.3118 0.3349 A326 C1404 0.7957924 0.6327927 0.1192434 0.2505190
99 2.2697 -0.3642 0.9543 A403 C1891 2.1489362 0.5766913 0.9229370 0.3643394
100 0.4983 -0.8672 -0.0185 A220 C1251 0.8259309 -0.6708288 0.3378642 0.2934554
hist(act_pred$diff, breaks = 30, col = "blue", main = "Mean Absolute Difference", xlab = "Difference")

Figure 16: Mean Absolute Difference

8. Executive Summary

9. References

  1. Topology Preserving Maps

  2. Vector Quantization

  3. K-means

  4. Sammon’s Projection

  5. Voronoi Tessellations