The HVT package is a collection of R functions to facilitate building topology preserving maps for rich multivariate data analysis. Tending towards a big data preponderance, a large number of rows. A collection of R functions for this typical workflow is organized below:
Data Compression: Vector Quantization (VQ), HVQ (Hierarchical Vector Quantization) using means or medians. This step compresses the rows (long data frame) using a compression objective.
Data Projection: Dimension projection of the compressed cells to 1D, 2D and Interactive surface plot with the Sammons Non-linear Algorithm. This step creates a topology preserving map (also called embeddings) coordinates into the desired output dimension.
Tessellation: Create cells required for object visualization using the Voronoi Tessellation method, package includes heatmap plots for Hierarchical Voronoi Tessellations (HVT). This step enables data insights, visualization, and interaction with the topology preserving map. Useful for semi-supervised tasks.
Scoring: Scoring data sets and recording their assignment using the map objects from the above steps, in a sequence of maps if required.
Temporal Analysis and Visualization: A Collection of functions that leverages the capacity of the HVT package by analyzing time series data for its underlying patterns, calculation of transitioning probabilities and the visualizations for the flow of data over time.
Compression is a technique used to reduce the data size while preserving its essential information, allowing for efficient storage and decompression to reconstruct the original data. While Vector quantization (VQ) is a technique used in data compression to represent a set of data points with a smaller number of representative vectors. It achieves compression by exploiting redundancies or patterns in the data and replacing similar data points with representative vectors.
This package offers several advantages for performing data
compression as it is designed to handle
high-dimensional data
more efficiently. It provides a
hierarchical compression approach
, allowing
multi-resolution representation of the data. The hierarchical structure
enables efficient compression
and storage
of
the data while preserving different levels of detail. HVT aims to
preserve the topological structure of the data during compression.
Spatial data
with irregular shapes
and complex
structures in high-dimensional data can contain valuable information
about relationships and patterns. HVT seeks to capture and retain these
topological characteristics, enabling meaningful analysis and
visualization.This package employs tessellation to divide the compressed
data space into distinct cells or regions while preserving the topology
of the original data. This means that the relationships and connectivity
between data points are maintained in the compressed representation.
This package can perform vector quantization using the following algorithms-
The second and third steps are iterated until a predefined number of iterations is reached or the clusters converge. The runtime for the algorithm is O(n).
The second and third steps are iterated until a predefined number of iterations is reached or the clusters converge. The runtime for the algorithm is O(k * (n-k)^2).
These algorithm divides the dataset recursively into cells using \(k-means\) or \(k-medoids\) algorithm. The maximum number of subsets are decided by setting \(n_cells\) to, say five, in order to divide the dataset into maximum of five subsets. These five subsets are further divided into five subsets(or less), resulting in a total of twenty five (5*5) subsets. The recursion terminates when the cells either contain less than three data point or a stop criterion is reached. In this case, the stop criterion is set to when the cell error exceeds the quantization threshold.
The steps for this method are as follows:
The stop criterion is when the quantization error of a cell satisfies one of the below conditions:
The quantization error for a cell is defined as follows:
\[QE = \max_i(||A-F_i||_{p})\]
where
Let us try to understand quantization error with an example.
An example of a 2 dimensional VQ is shown above.
In the above image, we can see 5 cells with each cell containing a certain number of points. The centroid for each cell is shown in blue. These centroids are also known as codewords since they represent all the points in that cell. The set of all codewords is called a codebook.
Now we want to calculate quantization error for each cell. For the
sake of simplicity, let’s consider only one cell having centroid
A
and m
data points \(F_i\) for calculating quantization
error.
For each point, we calculate the distance between the point and the centroid.
\[ d = ||A - F_i||_{p} \]
In the above equation, p = 1 means L1_Norm
distance
whereas p = 2 means L2_Norm
distance. In the package, the
L1_Norm
distance is chosen by default. The user can pass
either L1_Norm
, L2_Norm
or a custom function
to calculate the distance between two points in n dimensions.
\[QE = \max_i(||A-F_i||_{p})\]
Now, we take the maximum calculated distance of all m points. This
gives us the furthest distance of a point in the cell from the centroid,
which we refer to as Quantization Error
. If the
Quantization Error is higher than the given threshold, the centroid/
codevector is not a good representation for the points in the cell. Now
we can perform further Vector Quantization on these points and repeat
the above steps.
Please note that the user can select mean, max or any custom function
to calculate the Quantization Error. The custom function takes a vector
of m value (where each value is a distance between point in
n
dimensions and centroids) and returns a single value
which is the Quantization Error for the cell.
If we select mean
as the error metric, the above
Quantization Error equation will look like this:
\[QE = \frac{1}{m}\sum_{i=1}^m||A-F_i||_{p}\]
Projection mainly involves converting data from its original form to a different space or coordinate system while preserving certain properties of it. By projecting data into a common coordinate system, spatial relationships, distances, areas, and other spatial attributes can be accurately measured and compared.
HVT performs projection as part of its workflow to visualize and explore high-dimensional data. The projection step in HVT involves mapping the compressed data, represented by the hierarchical structure of cells, onto a lower-dimensional space for visualization purposes, as human perception is more suited to interpreting information in lower-dimensional spaces.Users can zoom in/out, rotate, and explore different regions of the projected space to gain insights and understand the data from different perspectives.
Sammon’s projection is an algorithm used in this package to map a high-dimensional space to a space of lower dimensionality while attempting to preserve the structure of inter-point distances in the projection. It is particularly suited for use in exploratory data analysis and is usually considered a non-linear approach since the mapping cannot be represented as a linear combination of the original variables. The centroids are plotted in 2D after performing Sammon’s projection at every level of the tessellation.
Denoting the distance between \(i^{th}\) and \(j^{th}\) objects in the original space by \(d_{ij}^*\), and the distance between their projections by \(d_{ij}\). Sammon’s mapping aims to minimize the below error function, which is often referred to as Sammon’s stress or Sammon’s error.
\[E=\frac{1}{\sum_{i<j} d_{ij}^*}\sum_{i<j}\frac{(d_{ij}^*-d_{ij})^2}{d_{ij}^*}\]
The minimization of this can be performed either by gradient descent, as proposed initially, or by other means, usually involving iterative methods. The number of iterations need to be experimentally determined and convergent solutions are not always guaranteed. Many implementations prefer to use the first Principal Components as a starting configuration.
A Voronoi diagram is a way of dividing space into a number of
regions. A set of points (called seeds, sites, or generators) is
specified beforehand and for each seed, there will be a corresponding
region consisting of all points within proximity of that seed. These
regions are called Voronoi cells
. It is complementary to
Delaunay triangulation
is a geometrical algorithm used to
create a triangulated mesh from a set of points in a plane which has the
property that no data point lies within the circumcircle of any triangle
in the triangulation. This property guarantees that the resulting cells
in the tessellation do not overlap with each other.
By using Delaunay triangulation
, HVT can achieve a
partitioning of the data space into distinct and non-overlapping
regions, which is crucial for accurately representing and analyzing the
compressed data.Additionally, the use of Delaunay triangulation for
tessellation ensures that the resulting cells have well-defined shapes,
typically triangles in two dimensions or tetrahedra in three
dimensions.
The hierarchical structure resulting from tessellation preserves the inherent structure and relationships within the data. It captures clusters, subclusters, and other patterns in the data, allowing for a more organized and interpretable representation. The hierarchical structure reduces redundancy and enables more compact representations.
Tessellate: Constructing Voronoi Tesselation
In this package, we use sammons
from the package
MASS
to project higher dimensional data to a 2D space. The
function hvq
called from the trainHVT
function
returns hierarchical quantized data which will be the input for
construction of the tessellations. The data is then represented in 2D
coordinates and the tessellations are plotted using these coordinates as
centroids. We use the package deldir
for this purpose. The
deldir
package computes the Delaunay triangulation (and
hence the Dirichlet or Voronoi tessellation) of a planar point set
according to the second (iterative) algorithm of Lee and Schacter. For
subsequent levels, transformation is performed on the 2D coordinates to
get all the points within its parent tile. Tessellations are plotted
using these transformed points as centroids. The lines in the
tessellations are chopped in places so that they do not protrude outside
the parent polygon. This is done for all the subsequent levels.
Scoring basically refers to the process of chalking up or estimating future values or outcomes based on existing data patterns.In training process, a model is developed based on historical data or a training dataset, and this model is then used to score new, unseen data. The model captures the underlying patterns, trends, and relationships present in the training data, allowing it to pin point the cell of the similar or related data points.
In this package, we use scoreHVT
function to score each
point in the testing dataset.
Scoring Algorithm
The Scoring algorithm recursively calculates the distance between each point in the testing dataset and the cell centroids for each level. The following steps explain the scoring method for a single point in the testing dataset:
This chunk verifies the installation of all the necessary packages to successfully run this vignette, if not, installs them and attach all the packages in the session environment.
<- c("dplyr", "kableExtra", "geozoo", "plotly", "purrr",
list.of.packages "data.table", "gridExtra","plyr","HVT")
<-list.of.packages[!(list.of.packages %in% installed.packages()[, "Package"])]
new.packages if (length(new.packages))
install.packages(new.packages, dependencies = TRUE, repos='https://cloud.r-project.org/')
invisible(lapply(list.of.packages, library, character.only = TRUE))
In this section we explore the capacity of the package to visualize multidimensional data by projecting them to two dimensions using Sammon’s projection and further used for Scoring.
Data Understanding
First of all, let us see how to generate data for torus. We are using
a library geozoo
for this purpose. Geo Zoo (stands for
Geometric Zoo) is a compilation of geometric objects ranging from three
to ten dimensions. Geo Zoo contains regular or well-known objects, eg
cube and sphere, and some abstract objects, e.g. Boy’s surface, Torus
and Hyper-Torus.
Here, we will generate a 3D torus (a torus is a surface of revolution generated by revolving a circle in three-dimensional space one full revolution about an axis that is coplanar with the circle) with 12000 points.
Torus Dataset
The torus dataset includes the following columns:
Lets, explore the torus dataset containing 12000 points. For the sake of brevity we are displaying first 6 rows.
set.seed(240)
# Here p represents dimension of object, n represents number of points
<- geozoo::torus(p = 3,n = 12000)
torus <- data.frame(torus$points)
torus_df colnames(torus_df) <- c("x","y","z")
<- torus_df %>% round(4)
torus_df displayTable(head(torus_df))
x | y | z |
---|---|---|
-2.6282 | 0.5656 | -0.7253 |
-1.4179 | -0.8903 | 0.9455 |
-1.0308 | 1.1066 | -0.8731 |
1.8847 | 0.1895 | 0.9944 |
-1.9506 | -2.2507 | 0.2071 |
-1.4824 | 0.9229 | 0.9672 |
Let’s visualize the torus (donut) in 3D Space.
plot_ly(x = torus_df$x, y = torus_df$y, z = torus_df$z, type = 'scatter3d',mode = 'markers',
marker = list(color = torus_df$z,colorscale = c('red', 'blue'),showscale = TRUE,size = 3,colorbar = list(title = 'z'))) %>%
layout(scene = list(xaxis = list(title = 'x'),yaxis = list(title = 'y'),zaxis = list(title = 'z'),
aspectratio = list(x = 1, y = 1, z = 0.4)))
Now let’s have a look at structure of the torus dataset.
str(torus_df)
## 'data.frame': 12000 obs. of 3 variables:
## $ x: num -2.63 -1.42 -1.03 1.88 -1.95 ...
## $ y: num 0.566 -0.89 1.107 0.19 -2.251 ...
## $ z: num -0.725 0.946 -0.873 0.994 0.207 ...
Data distribution
This section displays four objects.
Variable Histograms: The histogram distribution of all the features in the dataset.
Box Plots: Box plots for all the features in the dataset. These plots will display the median and Interquartile range of each column at a panel level.
Correlation Matrix: This calculates the Pearson correlation which is a bivariate correlation value measuring the linear correlation between two numeric columns. The output plot is shown as a matrix.
Summary EDA: The table provides descriptive statistics for all the features in the dataset.
It uses an inbuilt function called edaPlots
to display
the above-mentioned four objects.
NOTE: The input dataset should be a data frame object and the columns should be only numeric type.
edaPlots(torus_df, output_type = 'summary', n_cols = 3)
edaPlots(torus_df, output_type = 'histogram', n_cols = 3)
edaPlots(torus_df, output_type = 'boxplot', n_cols = 3)
edaPlots(torus_df, output_type = 'correlation', n_cols = 3)
Train - Test Split
Let us split the torus dataset into train and test. We will randomly select 80% of the data as train and remaining as test.
<- floor(0.80 * nrow(torus_df))
smp_size set.seed(279)
<- sample(seq_len(nrow(torus_df)), size = smp_size)
train_ind <- torus_df[train_ind, ]
torus_train <- torus_df[-train_ind, ] torus_test
Training Dataset
Now, lets have a look at the selected training dataset containing (9600 data points). For the sake of brevity we are displaying first six rows.
rownames(torus_train) <- NULL
displayTable(head(torus_train))
x | y | z |
---|---|---|
1.7958 | -0.4204 | -0.9878 |
0.7115 | -2.3528 | -0.8889 |
1.9285 | 1.2034 | 0.9620 |
1.0175 | 0.0344 | -0.1894 |
-0.2736 | 1.1298 | -0.5464 |
1.8976 | 2.2391 | 0.3545 |
Now lets have a look at structure of the training data.
str(torus_train)
## 'data.frame': 9600 obs. of 3 variables:
## $ x: num 1.796 0.712 1.929 1.018 -0.274 ...
## $ y: num -0.4204 -2.3528 1.2034 0.0344 1.1298 ...
## $ z: num -0.988 -0.889 0.962 -0.189 -0.546 ...
Data Distribution
edaPlots(torus_train, n_cols = 3, output_type = "summary")
edaPlots(torus_train, n_cols = 3, output_type = "histogram")
edaPlots(torus_train, n_cols = 3, output_type = "boxplot")
edaPlots(torus_train, n_cols = 3, output_type = "correlation")
Testing Dataset
Now, lets have a look at testing dataset containing(2400 data points).For the sake of brevity we are displaying first six rows.
rownames(torus_test) <- NULL
displayTable(head(torus_test))
x | y | z |
---|---|---|
-2.6282 | 0.5656 | -0.7253 |
2.7471 | -0.9987 | -0.3848 |
-2.4446 | -1.6528 | 0.3097 |
-2.6487 | -0.5745 | 0.7040 |
-0.2676 | -1.0800 | -0.4611 |
-1.1130 | -0.6516 | -0.7040 |
Now lets have a look at structure of the test data.
str(torus_test)
## 'data.frame': 2400 obs. of 3 variables:
## $ x: num -2.628 2.747 -2.445 -2.649 -0.268 ...
## $ y: num 0.566 -0.999 -1.653 -0.575 -1.08 ...
## $ z: num -0.725 -0.385 0.31 0.704 -0.461 ...
Data Distribution
edaPlots(torus_test,n_cols = 3, output_type = "summary")
edaPlots(torus_test,n_cols = 3, output_type = "histogram")
edaPlots(torus_test,n_cols = 3, output_type = "boxplot")
edaPlots(torus_test,n_cols = 3, output_type = "correlation")
Note: The steps of compression, projection, and tessellation are iteratively performed until a minimum compression rate of 80% is achieved. Once the desired compression is attained, the resulting model object is used for scoring using the scoreHVT() function
The core function for compression in the workflow is
HVQ
, which is called within the trainHVT
function. we have a parameter called quantization error. This parameter
acts as a threshold and determines the number of levels in the
hierarchy. It means that, if there are ‘n’ number of levels in the
hierarchy, then all the clusters formed till this level will have
quantization error equal or greater than the threshold quantization
error. The user can define the number of clusters in the first level of
hierarchy and then each cluster in first level is sub-divided into the
same number of clusters as there are in the first level. This process
continues and each group is divided into smaller clusters as long as the
threshold quantization error is met. The output of this technique will
be hierarchically arranged vector quantized data.
However, let’s try to comprehend the trainHVT function first before moving on.
trainHVT(
data,
min_compression_perc,
n_cells,
depth,
quant.err,normalize = TRUE,
distance_metric = c("L1_Norm", "L2_Norm"),
error_metric = c("mean", "max"),
quant_method = c("kmeans", "kmedoids"),
dim_reduction_method = c("sammon" , "tsne" , "umap")
scale_summary = NA,
diagnose = FALSE,
hvt_validation = FALSE,
train_validation_split_ratio = 0.8,
projection.scale,
tsne_perplexity,tsne_theta,tsne_verbose,
tsne_eta,tsne_max_iter,
umap_n_neighbors,umap_min_dist )
Each of the parameters of trainHVT function have been explained below:
data
- A dataframe, with numeric
columns (features) that will be used for training the model.
min_compression_perc
- An integer,
indicating the minimum compression percentage to be achieved for the
dataset. It indicates the desired level of reduction in dataset size
compared to its original size.
n_cells
- An integer, indicating
the number of cells per hierarchy (level). This parameter determines the
granularity or level of detail in the hierarchical vector
quantization.
depth
- An integer, indicating the
number of levels. A depth of 1 means no hierarchy (single level), while
higher values indicate multiple levels (hierarchy).
quant.err
- A number indicating the
quantization error threshold. A cell will only breakdown into further
cells if the quantization error of the cell is above the defined
quantization error threshold.
normalize
- A logical value
indicating if the dataset should be normalized. When set to TRUE, scales
the values of all features to have a mean of 0 and a standard deviation
of 1 (Z-score)
distance_metric
- The distance
metric can be L1_Norm
(Manhattan) or
L2_Norm
(Euclidean). L1_Norm
is selected by
default. The distance metric is used to calculate the distance between a
datapoint and its centroid.
error_metric
- The error metric can
be mean
or max
. max
is selected
by default. max
will return the max of m
values and mean
will take mean of m
values
where each value is a distance between a point and centroid of the
cell.
quant_method
- The quantization
method can be kmeans
or kmedoids
. Kmeans uses
means (centroids) as cluster centers while Kmedoids uses actual data
points (medoids) as cluster centers. kmeans
is selected by
default.
dim_reduction_method
- The
dimensionality reduction method to be chosen. options are ‘tsne’ ,
‘umap’ & ‘sammon’. Default is ‘sammon’.
scale_summary
- A list with user
defined mean and standard deviation values for all the features in the
dataset. Pass the scale summary when normalize is set to FALSE.
diagnose
- A logical value
indicating whether user wants to perform diagnostics on the model.
Default value is FALSE.
hvt_validation
- A logical value
indicating whether user wants to holdout a validation set and find mean
absolute deviation of the validation points from the centroid. Default
value is FALSE.
train_validation_split_ratio
- A
numeric value indicating train validation split ratio. This argument is
only used when hvt_validation has been set to TRUE. Default value for
the argument is 0.8.
projection.scale
- A number
indicating the scale factor for the tessellations to visualize the
sub-tessellations well enough. It helps in adjusting the visual
representation of the hierarchy to make the sub-tessellations more
visible. Default is 10.
tsne_perplexity
- A numeric,
balances the attention t-SNE gives to local and global aspects of the
data. Lower values focus more on local structure, while higher values
consider more global structure. It is recommended to be between 5 and
50. Default value is 30.
tsne_theta
- A numeric,
speed/accuracy trade-off parameter for Barnes-Hut approximation. If set
to 0, exact t-SNE is performed, which is slower. If set to greater than
0, an approximation is used, which speeds up the process but may reduce
accuracy. Default value is 0.5
tsne_eta (learning_rate)
- A
numeric, learning rate for t-SNE optimization.Determines the step size
during optimization. If too low, the algorithm might get stuck in local
minima; if too high, the solution may become unstable. Default value is
200.
tsne_max_iter
- An integer, maximum
number of iterations. Number of iterations for the optimization process.
More iterations can improve results but increase computation time.
Default value is 1000.
umap_n_neighbors
- An integer, the
size of the local neighborhood (in terms of number of neighboring sample
points) used for manifold approximation, controls the balance between
local and global structure in the data, smaller values focus on local
structure, while larger values capture more global structures. Default
value is 15.
umap_min_dist
- A numeric, the
minimum distance between points in the embedded space, controls how
tightly UMAP packs points together, lower values result in a more
clustered embedding. Default value is 0.1
The output of trainHVT function (list of 7 elements) have been explained below with an image attached for clear understanding.
NOTE: Here the attached image is the snapshot of output list generated from iteration 1 which can be referred later in this section
The ‘1st element’ is a list containing information related to plotting tessellations. This information might include coordinates, boundaries, or other details necessary for visualizing the tessellations
The ‘2nd element’ is a list containing information related to Sammon’s projection coordinates of the data points in the reduced-dimensional space.
The ‘3rd element’ is a list containing detailed information about the hierarchical vector quantized data along with a summary section containing no of points, Quantization Error and the centroids for each cell for 2D.
The ‘4th element’ is a list that contains all the diagnostics information of the model when diagnose is set to TRUE. Otherwise NA.
The ‘5th element’ is a list that contains all the information required to generates a Mean Absolute Deviation (MAD) plot, if hvt_validation is set to TRUE. Otherwise NA
The ‘6th element’ is a list containing detailed information about
the hierarchical vector quantized data along with a summary section
containing no of points, Quantization Error and the centroids for each
cell which is the output of hvq
.
The ‘7th element’ (model info) is a list that contains model generated timestamp, input parameters passed to the model, validation results and the dimensionality reduction evaluation metrics table.
We will use the trainHVT
function to compress our data
while preserving essential features of the dataset. Our goal is to
achieve data compression upto atleast 80%
. In situations
where the compression ratio does not meet the desired target, we can
explore adjusting the model parameters as a potential solution. This
involves making modifications to parameters such as the
quantization error threshold
or
increasing the number of cells
and then rerunning the
trainHVT function again.
In our example we will iteratively increase the number of cells until the desired compression percentage is reached instead of increasing the quantization threshold because it may reduce the level of detail captured in the data representation
Iteration 1:
We will pass the below mentioned model parameters along with torus
training dataset (containing 9600 datapoints) to trainHVT
function.
Model Parameters
<- trainHVT(
hvt.torus
torus_train,n_cells = 100,
depth = 1,
quant.err = 0.1,
normalize = FALSE,
distance_metric = "L2_Norm",
error_metric = "max",
quant_method = "kmeans",
dim_reduction_method = "sammon"
)
Initial stress : 0.01711 stress after 10 iters: 0.01391, magic = 0.500 stress after 20 iters: 0.01391, magic = 0.500
displayTable(hvt.torus[[3]]$compression_summary)
segmentLevel | noOfCells | noOfCellsBelowQuantizationError | percentOfCellsBelowQuantizationErrorThreshold | parameters |
---|---|---|---|---|
1 | 100 | 0 | 0 | n_cells: 100 quant.err: 0.1 distance_metric: L2_Norm error_metric: max quant_method: kmeans |
As it can be seen from the table above, none of the 100 cells have reached the quantization threshold error. Therefore we can further subdivide the cells by increasing the n_cells parameters and then see if desired compression (80%) is reached
Let’s take a look on the 1D projection of this iteration. The output
of hvq from the trainHVT
function is then passed to the
plotHVT
function, which applies Sammon’s 1D using
MASS
package. The resulting 1D Sammon’s points are used to
determine their corresponding cell IDs and subsequently plotted in a
plotly object.
plotHVT(hvt.torus, plot.type = '1D')
Iteration 2:
Let’s retry by increasing the n_cells parameter to 300 for torus training dataset (containing 9600 datapoints).
Model Parameters
<- trainHVT(
hvt.torus2
torus_train,n_cells = 300,
depth = 1,
quant.err = 0.1,
normalize = FALSE,
distance_metric = "L2_Norm",
error_metric = "max",
quant_method = "kmeans",
dim_reduction_method = "sammon"
)
Initial stress : 0.01813 stress after 10 iters: 0.01456, magic = 0.500 stress after 20 iters: 0.01455, magic = 0.500
displayTable(hvt.torus2[[3]]$compression_summary)
segmentLevel | noOfCells | noOfCellsBelowQuantizationError | percentOfCellsBelowQuantizationErrorThreshold | parameters |
---|---|---|---|---|
1 | 300 | 143 | 0.48 | n_cells: 300 quant.err: 0.1 distance_metric: L2_Norm error_metric: max quant_method: kmeans |
It can be observed from the table above that only 143 cells
out of 300 i.e. 48%
of the cells reached the Quantization
Error threshold. Therefore we can further subdivide the cells by
increasing the n_cells parameters and then see if 80% compression is
reached
plotHVT(hvt.torus2, plot.type = '1D')
Iteration 3:
Since we are yet to achieve the compression of atleast 80%, lets try again by increasing the n_cells parameter to 500 for torus training dataset (containing 9600 datapoints) .
Model Parameters
set.seed(240)
<- trainHVT(
hvt.torus3
torus_train,n_cells = 500,
depth = 1,
quant.err = 0.1,
normalize = FALSE,
distance_metric = "L2_Norm",
error_metric = "max",
quant_method = "kmeans",
dim_reduction_method = "sammon"
)
Initial stress : 0.01925 stress after 10 iters: 0.01527, magic = 0.500 stress after 20 iters: 0.01527, magic = 0.500
displayTable(hvt.torus3[[3]]$compression_summary)
segmentLevel | noOfCells | noOfCellsBelowQuantizationError | percentOfCellsBelowQuantizationErrorThreshold | parameters |
---|---|---|---|---|
1 | 500 | 448 | 0.9 | n_cells: 500 quant.err: 0.1 distance_metric: L2_Norm error_metric: max quant_method: kmeans |
By increasing the number of cells to 500, we were
successfully able to compress 90%
of the data, so we will
not further subdivide the cells
We successfully compressed data to 90% using n_cells parameter as 500, the next step involves performing data projection on the compressed data. In this step, the compressed data will be transformed and projected onto a lower-dimensional space to visualize and analyze the data in a more manageable form.
plotHVT(hvt.torus3, plot.type = '1D')
This section focusses on projecting the sammon’s dimensionality reduction from multi dimension to 2D. The following plots will have the centroids plotted in 2D space with x coordinate of the centroid points on X-axis and y coordinate of the centroid points on Y-axis.
Now let’s try to understand plotHVT function. The parameters have been explained in detail below:
<-(hvt.results, line.width, color.vec, pch1, centroid.size,
plotHVT
title, maxDepth, child.level, hmap.cols, centroid.color,
quant.error.hmap, n_cells.hmap, cell_id,
label.size, sepration_width, layer_opacity,plot.type = '2Dhvt') dim_size,
hvt.results
-
(1D/2Dproj/2Dhvt/2Dheatmap/surface_plot) A list obtained from the
trainHVT function. This list provides an overview of the hierarchical
vector quantized data, including diagnostics, tessellation details,
Sammon’s projection coordinates, and model input information.
line.width
- (2Dhvt/2Dheatmap) A
vector indicating the line widths of the tessellation boundaries for
each layer.
color.vec
- (2Dhvt/2Dheatmap) A
vector indicating the colors of the tessellations boundaries at each
layer.
pch1
- (2Dhvt/2Dheatmap) Symbol. It
plots the centroids with a particular symbol such as (solid circle,
bullet, filled square, filled diamond) in the tessellations.(default =
21 i.e filled circle).
centroid.size
- (2Dhvt/2Dheatmap)
Vector of Size of centroids for each level of tessellations.
centroid.color
- (2Dhvt/2Dheatmap)
Vector of color of centroids for each level of tessellations.
title
- (2Dhvt) Set a title for the
plot (default = NULL).
maxDepth
- (2Dhvt) An integer
indicating the number of levels.
cell_id
- (2Dhvt) Logical. To
indicate whether the plot should have Cell IDs or not for the level 1.
(default = FALSE)
child.level
-
(2Dheatmap/surface_plot) A Number indicating the level for which the
heat map is to be plotted.
hmap.cols
-
(2Dheatmap/surface_plot) A Number or a Character which is the column
number of column name from the dataset indicating the variables for
which the heat map is to be plotted.
label.size
- (2Dheatmap) The size
by which the tessellation labels should be scaled.(default =
0.5)
quant.error.hmap
- (2Dheatmap) A
number indicating the quantization error threshold.
n_cells.hmap
- (2Dheatmap) An
integer indicating the number of cells/clusters per hierarchy
sepration_width
- (surface_plot) An
integer indicating the width between two levels.
layer_opacity
- (surface_plot) A
vector indicating the opacity of each layer/level.
dim_size
- (surface_plot) An
integer indicating the dimension size used to create the matrix for the
plot.
plot.type
- A Character indicating
which type of plot should be generated. Accepted entries are ‘1D’,
‘2Dproj’,‘2Dhvt’,‘2Dheatmap’ & ‘surface_plot’. Default value is
‘2Dhvt’.
Iteration 1:
Lets see the projected Sammons 2D onto a plane with n_cell set to 100 in first iteration.
plotHVT(hvt.torus, plot.type = '2Dproj')
Iteration 2:
Lets see the projected Sammons 2D onto a plane with n_cell set to 300 in second iteration.
plotHVT(hvt.torus2, plot.type = '2Dproj')
Iteration 3:
Lets see the projected Sammons 2D onto a plane with n_cell set to 500 in third iteration.
plotHVT(hvt.torus3, plot.type = '2Dproj')
The deldir
package computes the Delaunay triangulation
(and hence the Dirichlet or Voronoi tessellation) of a planar point set
according to the second (iterative) algorithm of Lee and Schacter. For
subsequent levels, transformation is performed on the 2D coordinates to
get all the points within its parent tile. Tessellations are plotted
using these transformed points as centroids. plotHVT
is the
main function to plot hierarchical voronoi tessellation.
Iteration 1:
To enhance visualization, let’s generate a plot of the Voronoi tessellation for the first iteration where we set n_cells parameter as 100. This plot will provide a visual representation of the Voronoi regions corresponding to the data points, aiding in the analysis and understanding of the data distribution.
plotHVT(
hvt.torus,line.width = c(0.4),
color.vec = c("black"),
centroid.size = 0.6,
maxDepth = 1,
plot.type = '2Dhvt'
)
Iteration 2:
Now, let’s plot the Voronoi tessellation for the second iteration where we set n_cells parameter to 300.
plotHVT(
hvt.torus2,line.width = c(0.4),
color.vec = c("black"),
centroid.size = 0.6,
maxDepth = 1,
plot.type = '2Dhvt'
)
Iteration 3:
Now, let’s plot the Voronoi tessellation again, for the third iteration where we set n_cells parameter to 500.
plotHVT(
hvt.torus3,line.width = c(0.4),
color.vec = c("black"),
centroid.size = 0.6,
maxDepth = 1,
plot.type = '2Dhvt'
)
From the presented plot, the inherent structure of the donut can be easily observed in the two-dimensional space
Now let’s plot the Voronoi Tessellation with the heatmap overlaid for all the features in the torus data for better visualization and interpretation of data patterns and distributions.
The heatmaps displayed below provides a visual representation of the spatial characteristics of the torus, allowing us to observe patterns and trends in the distribution of each of the features (x,y,z). The sheer green shades highlight regions with higher coordinate values in each of the heatmaps, while the indigo shades indicate areas with the lowest coordinate values in each of the heatmaps. By analyzing these heatmaps, we can gain insights into the variations and relationships between each of these features within the torus structure.
plotHVT(
hvt.torus3,child.level = 1,
hmap.cols = "n",
line.width = c(0.4),
color.vec = c("black"),
centroid.size = 0.8,
plot.type = '2Dheatmap'
)
plotHVT(
hvt.torus3,child.level = 1,
hmap.cols = "x",
line.width = c(0.4),
color.vec = c("black"),
centroid.size = 0.8,
plot.type = '2Dheatmap'
)
plotHVT(
hvt.torus3,child.level = 1,
hmap.cols = "y",
line.width = c(0.4),
color.vec = c("black"),
centroid.size = 0.8,
plot.type = '2Dheatmap'
)
plotHVT(
hvt.torus3,child.level = 1,
hmap.cols = "z",
line.width = c(0.4),
color.vec = c("black"),
centroid.size = 0.8,
plot.type = '2Dheatmap'
)
Let’s try to comprehend the scoreHVT function first before moving on
scoreHVT(dataset,
hvt.results.model,
child.level,
mad.threshold,
line.width,
color.vec,
normalize,
distance_metric,
error_metric,
yVar,
analysis.plots, names.column)
The parameters for the function scoreHVT
are explained
below:
dataset
- A dataframe containing
the testing dataset. The dataframe should have all the
variable(features) used for training.
hvt.results.model
- A list obtained
from the trainHVT function while performing hierarchical vector
quantization on training data. This list provides an overview of the
hierarchical vector quantized data, including diagnostics, tessellation
details, Sammon’s projection coordinates, and model input
information.
child.level
- A number indicating
the depth for which the heat map is to be plotted. Each depth represents
a different level of clustering or partitioning of the data.
mad.threshold
- A numeric value
indicating the permissible Mean Absolute Deviation which is obtained
from Minimum Intra centroid plot(when diagnose is set to TRUE in
trainHVT). mad.threshold
value is important since it is
used in anomaly detection.Default value is 0.2 NOTE: for a given
datapoint, when the quantization error is above
mad.threshold
it is denoted as anomaly else
not.
line.width
- A vector indicating
the line widths of the tessellation boundaries for each layer. (Optional
Parameters)
color.vec
- A vector indicating the
colors of the tessellations boundaries at each layer. (Optional
Parameters)
normalize
- A logical value
indicating if the dataset should be normalized. When set to TRUE, the
data (testing dataset) is standardized by mean and sd of the training
dataset referred from the trainHVT(). When set to FALSE, the
data
is used as such without any changes.
distance_metric
- The distance
metric can be L1_Norm
(Manhattan) or
L2_Norm
(Euclidean). The metric is used when calculating
distance between each datapoint(in test dataset) with the centroids
obtained from results of trainHVT. Default is
L1_Norm
.
error_metric
- The error metric can
be mean
or max
. max
will return
the max of m
values and mean
will take mean of
m
values where each value is a distance between the
datapoint and centroid of the cell. This helps in calculating the scored
quantization error. Default value is max
.
yVar
- A character or a vector
representing the name of the dependent variable(s)
The below given arguments are used only when character column can be mapped over the scored results. since torus doesn’t have a character column, we are not using them in this vignette.
analysis.plots
- A logical value to
indicate whether to include the insight plots which are useful in
viewing the contents and clusters of cells. Default is FALSE.
names.column
- The column of names
of the datapoints which will be displayed as the contents of the cell in
‘scoredPlotly’. Default is NULL.
Now once we have built the model, let us try to score using our testing dataset (containing 2400 data points) which cell and which level each point belongs to.
set.seed(240)
<- scoreHVT(
scoring_torus
torus_test,
hvt.torus3,child.level = 1,
line.width = c(1.2),
color.vec = c("black"),
normalize = FALSE
)
Let’s see which cell and level each point belongs to and check the mean absolute difference for each of the 2400 records. For the sake of brevity, we will only show the first 100 rows
displayTable(scoring_torus[["scoredPredictedData"]])
Segment.Level | Segment.Parent | Segment.Child | n | Cell.ID | Quant.Error | centroidRadius | diff | anomalyFlag | x | y | z |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 7 | 1 | 85 | 0.1505 | 0.2454 | 0.0949 | 0 | -2.6282 | 0.5656 | -0.7253 |
1 | 1 | 102 | 1 | 425 | 0.1007 | 0.2379 | 0.1372 | 0 | 2.7471 | -0.9987 | -0.3848 |
1 | 1 | 71 | 1 | 3 | 0.1487 | 0.2729 | 0.1242 | 0 | -2.4446 | -1.6528 | 0.3097 |
1 | 1 | 246 | 1 | 41 | 0.0763 | 0.2745 | 0.1982 | 0 | -2.6487 | -0.5745 | 0.7040 |
1 | 1 | 416 | 1 | 157 | 0.0583 | 0.1715 | 0.1132 | 0 | -0.2676 | -1.0800 | -0.4611 |
1 | 1 | 67 | 1 | 126 | 0.0855 | 0.1855 | 0.1000 | 0 | -1.1130 | -0.6516 | -0.7040 |
1 | 1 | 4 | 1 | 491 | 0.0902 | 0.1616 | 0.0713 | 0 | 2.0288 | 1.9519 | 0.5790 |
1 | 1 | 328 | 1 | 140 | 0.0679 | 0.3151 | 0.2472 | 0 | -2.4799 | 1.6863 | -0.0470 |
1 | 1 | 326 | 1 | 119 | 0.0489 | 0.1505 | 0.1016 | 0 | -0.4105 | -1.1610 | -0.6398 |
1 | 1 | 265 | 1 | 83 | 0.1208 | 0.2513 | 0.1306 | 0 | -0.2545 | -1.6160 | -0.9314 |
1 | 1 | 163 | 1 | 352 | 0.064 | 0.2374 | 0.1734 | 0 | 1.1500 | 0.3945 | -0.6205 |
1 | 1 | 318 | 1 | 67 | 0.0984 | 0.2310 | 0.1326 | 0 | -1.2557 | -1.1369 | 0.9520 |
1 | 1 | 170 | 1 | 43 | 0.1599 | 0.2956 | 0.1357 | 0 | -0.5449 | -2.6892 | -0.6684 |
1 | 1 | 112 | 1 | 478 | 0.0759 | 0.2659 | 0.1899 | 0 | 2.9093 | 0.7222 | -0.0697 |
1 | 1 | 147 | 1 | 476 | 0.1152 | 0.2873 | 0.1721 | 0 | 2.3205 | 1.2520 | -0.7711 |
1 | 1 | 197 | 1 | 298 | 0.0023 | 0.1867 | 0.1844 | 0 | 1.4772 | -0.5194 | -0.9008 |
1 | 1 | 467 | 1 | 11 | 0.0324 | 0.3708 | 0.3384 | 0 | -1.3176 | -2.6541 | 0.2690 |
1 | 1 | 398 | 1 | 316 | 0.09 | 0.2136 | 0.1236 | 0 | 1.0687 | 0.1211 | -0.3812 |
1 | 1 | 331 | 1 | 195 | 0.0525 | 0.1924 | 0.1399 | 0 | -0.9632 | 0.3283 | -0.1866 |
1 | 1 | 20 | 1 | 465 | 0.0854 | 0.2132 | 0.1278 | 0 | 2.5616 | 0.4634 | 0.7976 |
1 | 1 | 44 | 1 | 424 | 0.0783 | 0.2492 | 0.1709 | 0 | 2.8473 | -0.9303 | -0.0955 |
1 | 1 | 123 | 1 | 154 | 0.0743 | 0.1830 | 0.1086 | 0 | -0.5293 | -0.8566 | 0.1173 |
1 | 1 | 382 | 1 | 2 | 0.059 | 0.2325 | 0.1735 | 0 | -1.9898 | -2.1766 | 0.3150 |
1 | 1 | 408 | 1 | 105 | 0.0706 | 0.1628 | 0.0922 | 0 | -0.8845 | -1.2219 | -0.8709 |
1 | 1 | 330 | 1 | 405 | 0.0876 | 0.2099 | 0.1223 | 0 | 0.1553 | 2.2566 | 0.9651 |
1 | 1 | 166 | 1 | 383 | 0.07 | 0.2762 | 0.2062 | 0 | 2.4262 | -0.6069 | -0.8655 |
1 | 1 | 487 | 1 | 120 | 0.0307 | 0.2063 | 0.1756 | 0 | -0.0667 | -1.4627 | -0.8444 |
1 | 1 | 201 | 1 | 151 | 0.0762 | 0.1549 | 0.0787 | 0 | -0.0655 | -1.3311 | -0.7448 |
1 | 1 | 9 | 1 | 458 | 0.0794 | 0.1979 | 0.1185 | 0 | 1.9592 | 1.5104 | 0.8806 |
1 | 1 | 17 | 1 | 479 | 0.0638 | 0.1883 | 0.1245 | 0 | 1.2332 | 2.5452 | 0.5603 |
1 | 1 | 92 | 1 | 214 | 0.0044 | 0.1470 | 0.1426 | 0 | -0.8720 | 0.4903 | 0.0287 |
1 | 1 | 23 | 1 | 139 | 0.0537 | 0.1956 | 0.1419 | 0 | 0.2194 | -1.7686 | 0.9760 |
1 | 1 | 343 | 1 | 351 | 0.0634 | 0.1937 | 0.1304 | 0 | 1.5052 | 0.0445 | -0.8694 |
1 | 1 | 125 | 1 | 17 | 0.1169 | 0.2607 | 0.1438 | 0 | -2.8410 | -0.8651 | 0.2439 |
1 | 1 | 38 | 1 | 104 | 0.0958 | 0.3065 | 0.2107 | 0 | 1.3203 | -2.5967 | 0.4077 |
1 | 1 | 98 | 1 | 228 | 0.0109 | 0.2360 | 0.2251 | 0 | -1.5648 | 1.5577 | 0.9781 |
1 | 1 | 150 | 1 | 205 | 0.057 | 0.1512 | 0.0943 | 0 | 0.3589 | -1.0419 | -0.4400 |
1 | 1 | 271 | 1 | 76 | 0.0856 | 0.2710 | 0.1853 | 0 | -0.2900 | -2.0106 | 0.9995 |
1 | 1 | 480 | 1 | 374 | 0.0788 | 0.1889 | 0.1101 | 0 | 0.5300 | 1.3668 | 0.8455 |
1 | 1 | 380 | 1 | 279 | 0.0311 | 0.2086 | 0.1776 | 0 | 1.0254 | -0.6738 | 0.6344 |
1 | 1 | 92 | 1 | 214 | 0.0609 | 0.1470 | 0.0861 | 0 | -0.9306 | 0.3664 | 0.0154 |
1 | 1 | 474 | 1 | 384 | 0.1146 | 0.1904 | 0.0758 | 0 | 2.3888 | -1.0670 | 0.7875 |
1 | 1 | 29 | 1 | 163 | 0.024 | 0.1409 | 0.1169 | 0 | -0.9830 | -0.2043 | -0.0897 |
1 | 1 | 255 | 1 | 326 | 0.0765 | 0.1339 | 0.0574 | 0 | 0.9499 | 0.3135 | 0.0261 |
1 | 1 | 168 | 1 | 44 | 0.0791 | 0.2581 | 0.1790 | 0 | -1.8079 | -1.4936 | 0.9386 |
1 | 1 | 39 | 1 | 245 | 0.1205 | 0.3291 | 0.2086 | 0 | 1.8399 | -1.9295 | -0.7459 |
1 | 1 | 271 | 1 | 76 | 0.0575 | 0.2710 | 0.2135 | 0 | -0.3304 | -1.8481 | 0.9925 |
1 | 1 | 71 | 1 | 3 | 0.046 | 0.2729 | 0.2269 | 0 | -2.2806 | -1.8984 | 0.2536 |
1 | 1 | 430 | 1 | 207 | 0.171 | 0.2216 | 0.0506 | 0 | -2.3323 | 1.7320 | 0.4252 |
1 | 1 | 90 | 1 | 331 | 0.0078 | 0.1431 | 0.1353 | 0 | 0.5520 | 0.8441 | 0.1308 |
1 | 1 | 189 | 1 | 289 | 0.0999 | 0.1932 | 0.0933 | 0 | -0.9449 | 2.2273 | 0.9078 |
1 | 1 | 117 | 1 | 132 | 0.0633 | 0.1705 | 0.1072 | 0 | 0.2334 | -1.4612 | -0.8540 |
1 | 1 | 105 | 1 | 481 | 0.0605 | 0.2114 | 0.1509 | 0 | 2.7387 | 0.9703 | 0.4244 |
1 | 1 | 351 | 1 | 340 | 0.0498 | 0.1443 | 0.0945 | 0 | 0.3561 | 1.1619 | -0.6199 |
1 | 1 | 222 | 1 | 452 | 0.0355 | 0.2440 | 0.2086 | 0 | 1.7006 | 1.5569 | -0.9522 |
1 | 1 | 42 | 1 | 357 | 0.0938 | 0.2262 | 0.1324 | 0 | 1.7244 | -0.5698 | 0.9829 |
1 | 1 | 235 | 1 | 373 | 0.0626 | 0.2439 | 0.1812 | 0 | 0.9922 | 1.1438 | -0.8741 |
1 | 1 | 483 | 1 | 130 | 0.0853 | 0.2312 | 0.1458 | 0 | -0.3022 | -1.3611 | 0.7956 |
1 | 1 | 113 | 1 | 236 | 0.0504 | 0.1672 | 0.1168 | 0 | -0.9693 | 1.0602 | 0.8261 |
1 | 1 | 110 | 1 | 294 | 0.0539 | 0.1904 | 0.1365 | 0 | 1.1313 | -0.3595 | -0.5824 |
1 | 1 | 28 | 1 | 31 | 0.0301 | 0.2992 | 0.2691 | 0 | -0.7561 | -2.5384 | -0.7611 |
1 | 1 | 427 | 1 | 499 | 0.1113 | 0.1932 | 0.0819 | 0 | 2.3168 | 1.8924 | 0.1302 |
1 | 1 | 35 | 1 | 108 | 0.1599 | 0.3358 | 0.1759 | 0 | 1.2363 | -2.6444 | -0.3939 |
1 | 1 | 297 | 1 | 111 | 0.054 | 0.2315 | 0.1775 | 0 | -1.3204 | -0.6281 | 0.8430 |
1 | 1 | 196 | 1 | 409 | 0.0662 | 0.2251 | 0.1589 | 0 | 1.3733 | 1.1877 | 0.9829 |
1 | 1 | 498 | 1 | 333 | 0.09 | 0.1568 | 0.0669 | 0 | 1.0874 | -0.1278 | 0.4251 |
1 | 1 | 423 | 1 | 301 | 0.0069 | 0.2620 | 0.2551 | 0 | 2.1300 | -1.2171 | -0.8914 |
1 | 1 | 42 | 1 | 357 | 0.1001 | 0.2262 | 0.1261 | 0 | 1.6863 | -0.5945 | 0.9773 |
1 | 1 | 235 | 1 | 373 | 0.0534 | 0.2439 | 0.1905 | 0 | 0.8504 | 1.0927 | -0.7882 |
1 | 1 | 447 | 1 | 336 | 0.0272 | 0.1517 | 0.1245 | 0 | 0.3029 | 1.0731 | 0.4656 |
1 | 1 | 451 | 1 | 210 | 0.0491 | 0.2516 | 0.2025 | 0 | -1.4724 | 1.1331 | 0.9899 |
1 | 1 | 81 | 1 | 136 | 0.0854 | 0.2252 | 0.1398 | 0 | -0.5452 | -1.2243 | 0.7514 |
1 | 1 | 68 | 1 | 226 | 0.1311 | 0.2226 | 0.0915 | 0 | -1.6866 | 2.1137 | 0.7101 |
1 | 1 | 55 | 1 | 158 | 0.0748 | 0.2447 | 0.1700 | 0 | 1.2012 | -2.0386 | -0.9305 |
1 | 1 | 330 | 1 | 405 | 0.0799 | 0.2099 | 0.1301 | 0 | -0.2108 | 2.3579 | 0.9301 |
1 | 1 | 340 | 1 | 265 | 0.068 | 0.1411 | 0.0730 | 0 | -0.5982 | 1.3776 | -0.8671 |
1 | 1 | 416 | 1 | 157 | 0.045 | 0.1715 | 0.1265 | 0 | -0.2116 | -1.0573 | -0.3878 |
1 | 1 | 205 | 1 | 118 | 0.0779 | 0.1690 | 0.0911 | 0 | -0.7802 | -0.9000 | -0.5880 |
1 | 1 | 186 | 1 | 196 | 0.0681 | 0.2310 | 0.1628 | 0 | 1.0850 | -1.6815 | 1.0000 |
1 | 1 | 343 | 1 | 351 | 0.0974 | 0.1937 | 0.0963 | 0 | 1.5563 | 0.1715 | -0.9008 |
1 | 1 | 308 | 1 | 318 | 0.0514 | 0.1653 | 0.1138 | 0 | -0.3790 | 1.4273 | 0.8522 |
1 | 1 | 434 | 1 | 122 | 0.0774 | 0.2156 | 0.1383 | 0 | -1.2769 | -0.2633 | 0.7178 |
1 | 1 | 495 | 1 | 257 | 0.0653 | 0.2059 | 0.1406 | 0 | -1.6039 | 2.4566 | 0.3575 |
1 | 1 | 217 | 1 | 309 | 0.0954 | 0.3022 | 0.2068 | 0 | -0.9297 | 2.4281 | -0.8000 |
1 | 1 | 53 | 1 | 220 | 0.0417 | 0.2250 | 0.1834 | 0 | 0.5324 | -0.8526 | 0.1016 |
1 | 1 | 254 | 1 | 362 | 0.0364 | 0.2087 | 0.1723 | 0 | 0.3928 | 1.5433 | -0.9132 |
1 | 1 | 250 | 1 | 327 | 0.0242 | 0.1993 | 0.1751 | 0 | 1.0031 | 0.3850 | -0.3786 |
1 | 1 | 192 | 1 | 232 | 0.0389 | 0.2020 | 0.1631 | 0 | -0.7562 | 0.7889 | -0.4207 |
1 | 1 | 204 | 1 | 102 | 0.0622 | 0.1680 | 0.1058 | 0 | -1.0870 | -0.7523 | -0.7350 |
1 | 1 | 54 | 1 | 59 | 0.0528 | 0.2048 | 0.1521 | 0 | -1.8671 | -0.8423 | -0.9988 |
1 | 1 | 25 | 1 | 242 | 0.0966 | 0.2225 | 0.1259 | 0 | 0.8325 | -0.9413 | 0.6689 |
1 | 1 | 464 | 1 | 277 | 0.044 | 0.2052 | 0.1612 | 0 | -0.3355 | 0.9636 | 0.2005 |
1 | 1 | 1 | 1 | 133 | 0.0797 | 0.2263 | 0.1466 | 0 | -1.0089 | -0.6007 | 0.5639 |
1 | 1 | 222 | 1 | 452 | 0.0723 | 0.2440 | 0.1717 | 0 | 1.7725 | 1.7153 | -0.8845 |
1 | 1 | 53 | 1 | 220 | 0.0784 | 0.2250 | 0.1467 | 0 | 0.5539 | -0.8888 | 0.3037 |
1 | 1 | 86 | 1 | 84 | 0.0997 | 0.3397 | 0.2400 | 0 | 0.8149 | -2.6016 | 0.6874 |
1 | 1 | 126 | 1 | 379 | 0.0833 | 0.2344 | 0.1510 | 0 | 0.1104 | 1.7654 | -0.9729 |
1 | 1 | 255 | 1 | 326 | 0.0555 | 0.1339 | 0.0784 | 0 | 1.0107 | 0.3118 | 0.3349 |
1 | 1 | 243 | 1 | 403 | 0.0874 | 0.2459 | 0.1585 | 0 | 2.2697 | -0.3642 | 0.9543 |
1 | 1 | 53 | 1 | 220 | 0.0655 | 0.2250 | 0.1596 | 0 | 0.4983 | -0.8672 | -0.0185 |
hist(scoring_torus[["actual_predictedTable"]]$diff, breaks = 20, col = "blue", main = "Mean Absolute Difference", xlab = "Difference",xlim = c(0,0.20), ylim = c(0,500))
Data Understanding
In this section, we will use the
Prices of Personal Computers
dataset. This dataset contains
6259 observations and 10 features. The dataset observes the price from
1993 to 1995 of 486 personal computers in the US. The variables are
price, speed, ram, screen, cd, etc. The dataset can be downloaded from
here.
In this example, we will compress this dataset by using hierarchical VQ via k-means and visualize the Voronoi Tessellation plots using Sammons projection. Later on, we will overlay all the variables as a heatmap to generate further insights.
Here, we load the data and store into a variable
computers
.
<- read.csv("https://raw.githubusercontent.com/Mu-Sigma/HVT/master/vignettes/sample_dataset/Computers.csv") computers
Personal Computers Dataset
The Computers dataset includes the following columns:
Let’s explore the Personal Computers Dataset containing (6259 points). For the sake of brevity we are displaying first six rows.
<- computers[,-1]
computers displayTable(head(computers))
price | speed | hd | ram | screen | cd | multi | premium | ads | trend |
---|---|---|---|---|---|---|---|---|---|
1499 | 25 | 80 | 4 | 14 | no | no | yes | 94 | 1 |
1795 | 33 | 85 | 2 | 14 | no | no | yes | 94 | 1 |
1595 | 25 | 170 | 4 | 15 | no | no | yes | 94 | 1 |
1849 | 25 | 170 | 8 | 14 | no | no | no | 94 | 1 |
3295 | 33 | 340 | 16 | 14 | no | no | yes | 94 | 1 |
3695 | 66 | 340 | 16 | 14 | no | no | yes | 94 | 1 |
Now let’s have a look at structure of the dataset.
str(computers)
## 'data.frame': 6259 obs. of 10 variables:
## $ price : int 1499 1795 1595 1849 3295 3695 1720 1995 2225 2575 ...
## $ speed : int 25 33 25 25 33 66 25 50 50 50 ...
## $ hd : int 80 85 170 170 340 340 170 85 210 210 ...
## $ ram : int 4 2 4 8 16 16 4 2 8 4 ...
## $ screen : int 14 14 15 14 14 14 14 14 14 15 ...
## $ cd : chr "no" "no" "no" "no" ...
## $ multi : chr "no" "no" "no" "no" ...
## $ premium: chr "yes" "yes" "yes" "no" ...
## $ ads : int 94 94 94 94 94 94 94 94 94 94 ...
## $ trend : int 1 1 1 1 1 1 1 1 1 1 ...
Further process will be carried out after removing non-numeric columns from the dataset, since the distribution plots will take only the continuous variables and K-means is not suitable for factor variables as the sample space for factor variables is discrete. A Euclidean distance function on such a space isn’t really meaningful. Hence, we will delete the factor variables(X, cd, multi, premium, trend) in our dataset.
<-computers %>% dplyr::select(-c( cd, multi, premium, trend)) computers
Data Distribution
This section displays four objects.
Variable Histograms: The histogram distribution of all the features in the dataset.
Box Plots: Box plots for all the features in the dataset. These plots will display the median and Interquartile range of each column at a panel level.
Correlation Matrix: This calculates the Pearson correlation which is a bivariate correlation value measuring the linear correlation between two numeric columns. The output plot is shown as a matrix.
Summary EDA: The table provides descriptive statistics for all the features in the dataset.
It uses an inbuilt function called edaPlots
to display
the above-mentioned four objects.
NOTE: The input dataset should be a data frame object and the columns should be only numeric type.
edaPlots(computers, output_type = 'summary', n_cols = 6)
edaPlots(computers, output_type = 'histogram', n_cols = 6)
edaPlots(computers, output_type = 'boxplot', n_cols = 6)
edaPlots(computers, output_type = 'correlation', n_cols = 6)
Train - Test Split
Let us split the computers data into train and test. We will randomly select 80% of the data as train and remaining as test.
<- nrow(computers)
num_rows set.seed(123)
<- sample(1:num_rows, 0.8 * num_rows)
train_indices <- computers[train_indices, ]
trainComputers <- computers[-train_indices, ] testComputers
Training Dataset
Now, lets have a look at the randomly selected training dataset containing (5007 data points). For the sake of brevity we are displaying first six rows.
<- trainComputers %>% as.data.frame() %>% round(4)
trainComputers_data <- trainComputers_data %>% dplyr::select(price,speed,hd,ram,screen,ads)
trainComputers_data row.names(trainComputers_data) <- NULL
displayTable(head(trainComputers_data))
price | speed | hd | ram | screen | ads |
---|---|---|---|---|---|
2799 | 50 | 230 | 8 | 15 | 216 |
2197 | 33 | 270 | 4 | 14 | 216 |
2744 | 50 | 340 | 8 | 17 | 275 |
2999 | 66 | 245 | 16 | 15 | 139 |
1974 | 33 | 200 | 4 | 14 | 248 |
2490 | 33 | 528 | 16 | 14 | 267 |
Now let’s have a look at structure of the training dataset.
str(trainComputers_data)
## 'data.frame': 5007 obs. of 6 variables:
## $ price : num 2799 2197 2744 2999 1974 ...
## $ speed : num 50 33 50 66 33 33 66 33 25 50 ...
## $ hd : num 230 270 340 245 200 528 424 212 528 545 ...
## $ ram : num 8 4 8 16 4 16 16 4 16 4 ...
## $ screen: num 15 14 17 15 14 14 15 17 14 15 ...
## $ ads : num 216 216 275 139 248 267 259 298 307 158 ...
Data Distribution
edaPlots(trainComputers_data,output_type = 'summary', n_cols = 6)
edaPlots(trainComputers_data,output_type = 'histogram', n_cols = 6)
edaPlots(trainComputers_data,output_type = 'boxplot', n_cols = 6)
edaPlots(trainComputers_data,output_type = 'correlation', n_cols = 6)
Testing Dataset
Now, lets have a look at the testing dataset containing (1252 data points). For the sake of brevity we are displaying first six rows.
<- testComputers %>% as.data.frame() %>% round(4)
testComputers_data <- testComputers_data %>% dplyr::select(price,speed,hd,ram,screen,ads)
testComputers_data rownames(testComputers_data) <- NULL
displayTable(head(testComputers_data))
price | speed | hd | ram | screen | ads |
---|---|---|---|---|---|
1595 | 25 | 170 | 4 | 15 | 94 |
1849 | 25 | 170 | 8 | 14 | 94 |
1720 | 25 | 170 | 4 | 14 | 94 |
2575 | 50 | 210 | 4 | 15 | 94 |
2195 | 33 | 170 | 8 | 15 | 94 |
2295 | 25 | 245 | 8 | 14 | 94 |
Now let’s have a look at structure of the testing dataset.
str(testComputers_data,output_type = 'all', n_cols = 6)
## 'data.frame': 1252 obs. of 6 variables:
## $ price : num 1595 1849 1720 2575 2195 ...
## $ speed : num 25 25 25 50 33 25 50 33 66 50 ...
## $ hd : num 170 170 170 210 170 245 212 250 130 210 ...
## $ ram : num 4 8 4 4 8 8 8 4 4 4 ...
## $ screen: num 15 14 14 15 15 14 14 15 14 17 ...
## $ ads : num 94 94 94 94 94 94 94 94 94 94 ...
Data Distribution
edaPlots(testComputers_data, output_type = "summary", n_cols = 6)
edaPlots(testComputers_data, output_type = "histogram", n_cols = 6)
edaPlots(testComputers_data, output_type = "boxplot", n_cols = 6)
edaPlots(testComputers_data, output_type = "correlation", n_cols = 6)
As we are familiar with the structure of the computers data, we will now follow the following steps to get the scores using the Computers dataset.
For more detailed information on Data Compression please refer to section 7.1 of this vignette.
We will use the trainHVT
function to compress our data
while preserving essential features of the dataset. Our goal is to
achieve data compression upto atleast 80%
. In situations
where the compression ratio does not meet the desired target, we can
explore adjusting the model parameters as a potential solution. This
involves making modifications to parameters such as the
quantization error threshold
or
increasing the number of cells
and then rerunning the
trainHVT function again.
We will pass the below mentioned model parameters along with
computers training dataset (5007) to trainHVT
function.
Model Parameters
<- trainHVT(trainComputers,
hvt.results n_cells = 300,
depth = 1,
quant.err = 0.2,
normalize = TRUE,
distance_metric = "L2_Norm",
error_metric = "max",
quant_method = "kmeans",
dim_reduction_method = "sammon")
Initial stress : 0.10815 stress after 0 iters: 0.10815
displayTable(hvt.results[[3]]$compression_summary)
segmentLevel | noOfCells | noOfCellsBelowQuantizationError | percentOfCellsBelowQuantizationErrorThreshold | parameters |
---|---|---|---|---|
1 | 300 | 274 | 0.91 | n_cells: 300 quant.err: 0.2 distance_metric: L2_Norm error_metric: max quant_method: kmeans |
As it can be seen from the table above,
91%
of the cells have reached the
quantization threshold error. Since we are successfully able to attain
the desired compression percentage, so we will not further subdivide the
cells
hvt.results[[3]]
gives us detailed
information about the hierarchical vector quantized data.
hvt.results[[3]][['summary']]
gives a
nice tabular data containing no of points, Quantization Error and the
centroids.
The datatable displayed below is the summary from hvt.results showing Cell.IDs, Centroids and Quantization Error for the 300 cells.
For the sake of brevity, we are displaying only the first 100 rows.
displayTable(data =hvt.results[[3]][['summary']])
Segment.Level | Segment.Parent | Segment.Child | n | Cell.ID | Quant.Error | price | speed | hd | ram | screen | ads |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 6 | 126 | 0.0832 | 1.1495 | -0.0817 | -0.6096 | -0.7590 | -0.6741 | 0.6739 |
1 | 1 | 2 | 11 | 43 | 0.0725 | -1.7157 | -0.8905 | -0.2227 | -0.7590 | -0.6741 | -0.6568 |
1 | 1 | 3 | 17 | 244 | 0.1033 | 0.4081 | -0.8905 | 1.1953 | 1.3676 | 0.4307 | -0.2385 |
1 | 1 | 4 | 15 | 114 | 0.0692 | -0.3511 | -0.9666 | 0.0053 | -0.0501 | -0.6741 | 0.0452 |
1 | 1 | 5 | 9 | 127 | 0.1007 | -0.9816 | 0.3412 | -0.2850 | -0.0501 | -0.6741 | -0.0827 |
1 | 1 | 6 | 22 | 14 | 0.1775 | -0.0292 | 2.2969 | -0.9025 | -0.9040 | -0.5737 | 1.3182 |
1 | 1 | 7 | 18 | 64 | 0.0953 | -0.4638 | -0.0817 | -1.1286 | -1.0347 | -0.6741 | 0.2654 |
1 | 1 | 8 | 14 | 22 | 0.0638 | -1.5650 | -1.2439 | -0.8245 | -0.7590 | -0.6741 | 0.4892 |
1 | 1 | 9 | 26 | 31 | 0.0858 | -1.4013 | -0.9490 | -0.8283 | -0.7590 | -0.6741 | -0.9204 |
1 | 1 | 10 | 10 | 238 | 0.1198 | 0.6985 | 0.7223 | 0.3667 | 1.3676 | 0.4307 | -0.1255 |
1 | 1 | 11 | 13 | 33 | 0.0813 | -1.1456 | -0.0817 | -1.0728 | -0.9226 | -0.6741 | 0.9871 |
1 | 1 | 12 | 18 | 72 | 0.0947 | -1.3498 | -0.0817 | -0.4244 | -0.7590 | -0.6741 | -0.7623 |
1 | 1 | 13 | 12 | 23 | 0.0804 | -1.2597 | -1.1125 | -1.1037 | -0.8771 | -0.6741 | -0.6189 |
1 | 1 | 14 | 12 | 49 | 0.1584 | -1.5999 | -0.8905 | -0.2240 | -0.6999 | -0.6741 | 0.3278 |
1 | 1 | 15 | 33 | 39 | 0.0613 | -0.6547 | -1.2710 | -0.9103 | -0.7590 | -0.6741 | 0.8114 |
1 | 1 | 16 | 4 | 91 | 0.0784 | -0.3945 | 1.0006 | -0.8801 | -0.8476 | 0.4307 | 1.4649 |
1 | 1 | 17 | 14 | 61 | 0.0954 | -1.2577 | -0.0817 | -0.7500 | -0.7590 | -0.6741 | 0.3260 |
1 | 1 | 18 | 14 | 5 | 0.1289 | -0.2814 | -1.0536 | -0.7853 | -0.7590 | 2.6404 | 0.5286 |
1 | 1 | 19 | 15 | 71 | 0.1026 | -0.7068 | -0.9666 | -0.7665 | -0.7826 | 0.4307 | -0.2783 |
1 | 1 | 20 | 12 | 100 | 0.0812 | -0.7217 | 0.6795 | -0.6622 | -0.7590 | -0.6741 | -0.7970 |
1 | 1 | 21 | 7 | 142 | 0.0983 | 0.2426 | -0.0817 | -0.0457 | -0.0501 | -0.6741 | 1.5724 |
1 | 1 | 22 | 7 | 265 | 0.173 | 2.2325 | 2.2969 | 1.2264 | -0.0501 | -0.2006 | -0.1791 |
1 | 1 | 23 | 7 | 112 | 0.0679 | 0.2708 | -0.8905 | -0.6022 | -0.0501 | -0.6741 | 0.6198 |
1 | 1 | 24 | 22 | 249 | 0.1658 | -1.0422 | 2.2969 | 0.4558 | -0.4368 | -0.3728 | -2.3031 |
1 | 1 | 25 | 12 | 3 | 0.0869 | -1.4088 | -1.1125 | -1.1823 | -0.9953 | -0.6741 | 1.5724 |
1 | 1 | 26 | 14 | 225 | 0.1912 | 1.4389 | -0.4861 | -0.3396 | -0.6577 | 2.6404 | 0.5843 |
1 | 1 | 27 | 30 | 119 | 0.0725 | -0.0717 | -1.0300 | -0.0173 | -0.0501 | -0.6741 | 0.4960 |
1 | 1 | 28 | 12 | 272 | 0.2809 | 1.5393 | 0.8142 | 2.4158 | 1.3676 | 0.3387 | 0.5417 |
1 | 1 | 29 | 16 | 129 | 0.1197 | -0.0871 | 0.7330 | -0.0640 | -0.7590 | -0.6741 | 0.7330 |
1 | 1 | 30 | 31 | 196 | 0.0954 | 0.5010 | -1.0746 | 0.4297 | 1.3676 | -0.6741 | 0.6500 |
1 | 1 | 31 | 10 | 144 | 0.0842 | 0.3005 | -0.9285 | 0.2553 | -0.0501 | -0.6741 | -0.0664 |
1 | 1 | 32 | 8 | 266 | 0.1318 | 1.0445 | 0.7865 | 0.3081 | 1.3676 | 2.6404 | 0.4482 |
1 | 1 | 33 | 16 | 84 | 0.0772 | -0.0222 | -0.0817 | -0.8029 | -0.7590 | -0.6741 | 0.8255 |
1 | 1 | 34 | 30 | 37 | 0.0542 | -1.2821 | -0.8905 | -0.7734 | -0.7590 | -0.6741 | 0.6798 |
1 | 1 | 35 | 18 | 216 | 0.1517 | 1.2976 | 0.6795 | -0.5568 | -0.2864 | 0.4307 | -1.5779 |
1 | 1 | 36 | 28 | 75 | 0.078 | -0.3076 | 0.6795 | -1.1807 | -1.0881 | -0.6741 | 0.6438 |
1 | 1 | 37 | 43 | 51 | 0.0631 | -0.7740 | -0.8905 | -0.8003 | -0.7590 | -0.6741 | 0.7455 |
1 | 1 | 38 | 33 | 102 | 0.119 | 0.0573 | 0.7054 | -0.8029 | -0.7375 | -0.6741 | 0.9630 |
1 | 1 | 39 | 16 | 7 | 0.0876 | -1.3821 | -1.1045 | -1.0130 | -0.8919 | 0.4307 | 1.2775 |
1 | 1 | 40 | 17 | 161 | 0.1464 | -0.6917 | -0.5099 | 0.2878 | -0.0501 | 0.4307 | -0.4267 |
1 | 1 | 41 | 17 | 111 | 0.1437 | -0.3714 | -0.9128 | -0.7493 | -0.3420 | 0.4307 | -1.0760 |
1 | 1 | 42 | 12 | 124 | 0.093 | -0.7508 | 0.6795 | 0.0484 | -0.7590 | -0.6741 | -0.1293 |
1 | 1 | 43 | 34 | 290 | 0.1872 | 1.7014 | 0.3436 | 2.2860 | 2.7854 | 0.4307 | -0.3864 |
1 | 1 | 44 | 15 | 121 | 0.0992 | -0.5299 | -0.9158 | 0.0129 | -0.0501 | 0.4307 | 1.3869 |
1 | 1 | 45 | 9 | 288 | 0.2998 | 3.4028 | 0.2460 | 2.7794 | 1.0526 | -0.6741 | 0.4939 |
1 | 1 | 46 | 8 | 145 | 0.1393 | -0.0483 | 0.5843 | -0.8168 | -0.8033 | 0.4307 | -0.6895 |
1 | 1 | 47 | 17 | 166 | 0.121 | -0.3750 | 0.6795 | -0.3297 | -0.0501 | 0.4307 | 0.3633 |
1 | 1 | 48 | 17 | 208 | 0.1485 | -0.1409 | 0.6347 | -0.2527 | -0.4254 | 2.6404 | 0.1007 |
1 | 1 | 49 | 22 | 252 | 0.1058 | 1.7800 | 0.6795 | 0.3055 | 1.3676 | 0.4307 | 0.8330 |
1 | 1 | 50 | 21 | 18 | 0.129 | -1.3118 | -1.0173 | -0.6174 | -0.6915 | -0.6741 | 1.5724 |
1 | 1 | 51 | 15 | 15 | 0.0689 | -1.4077 | -0.8905 | -1.1688 | -0.9717 | -0.6741 | 0.9424 |
1 | 1 | 52 | 19 | 243 | 0.0944 | 1.1288 | 0.8597 | 0.4522 | 1.3676 | -0.6741 | 1.3227 |
1 | 1 | 53 | 19 | 246 | 0.1617 | 0.8772 | 0.7070 | 0.3034 | 1.3676 | 0.4307 | 1.3460 |
1 | 1 | 54 | 14 | 55 | 0.0797 | -0.2470 | -0.9176 | -1.1493 | -0.8096 | -0.6741 | 0.3827 |
1 | 1 | 55 | 31 | 164 | 0.1151 | -0.1897 | 0.6795 | 0.1754 | -0.0501 | -0.6741 | 0.0550 |
1 | 1 | 56 | 15 | 141 | 0.1559 | -0.7953 | 2.2969 | -0.3214 | -0.7590 | -0.4531 | -1.0177 |
1 | 1 | 57 | 26 | 56 | 0.1209 | -0.7953 | 0.7124 | -0.9909 | -0.9226 | -0.6741 | 1.1417 |
1 | 1 | 58 | 14 | 133 | 0.1236 | -0.9434 | 0.6795 | -0.4670 | -0.6577 | 0.4307 | -0.2867 |
1 | 1 | 59 | 38 | 211 | 0.1007 | 1.0173 | -0.9906 | 0.3538 | 1.3676 | -0.6741 | 0.2397 |
1 | 1 | 60 | 7 | 232 | 0.0958 | 1.1313 | -0.8905 | 0.2175 | 1.3676 | 0.4307 | 0.3068 |
1 | 1 | 61 | 20 | 206 | 0.0994 | 0.5939 | 2.2969 | 0.0519 | -0.0501 | -0.6741 | 1.4219 |
1 | 1 | 62 | 26 | 28 | 0.056 | -1.2826 | -0.8905 | -0.7955 | -0.7590 | -0.6741 | 1.0538 |
1 | 1 | 63 | 15 | 74 | 0.0711 | 0.0704 | -0.8905 | -0.8257 | -0.7590 | -0.6741 | 0.4198 |
1 | 1 | 64 | 5 | 226 | 0.0749 | 0.4022 | -1.1949 | 0.4567 | 1.3676 | 0.4307 | 1.2498 |
1 | 1 | 65 | 11 | 86 | 0.0641 | -0.4305 | -0.8905 | -0.7454 | -0.0501 | -0.6741 | 0.3332 |
1 | 1 | 66 | 19 | 139 | 0.0878 | 0.0991 | 0.6795 | -0.6574 | -0.7590 | 0.4307 | 0.6781 |
1 | 1 | 67 | 7 | 264 | 0.2455 | 1.4450 | -0.4419 | 3.7273 | -0.2527 | -0.3584 | 0.6909 |
1 | 1 | 68 | 18 | 77 | 0.1142 | -0.5185 | -0.0817 | -0.8005 | -0.6015 | -0.6741 | 0.8748 |
1 | 1 | 69 | 6 | 254 | 0.1479 | 2.7324 | 0.6795 | 0.2629 | -0.1683 | 0.4307 | -1.1679 |
1 | 1 | 70 | 11 | 41 | 0.1079 | -0.5691 | -0.8905 | -0.5903 | -0.7590 | -0.6741 | 1.5724 |
1 | 1 | 71 | 15 | 287 | 0.2267 | 3.0491 | 0.6795 | -0.1430 | 1.3676 | 2.6404 | 0.2361 |
1 | 1 | 72 | 11 | 48 | 0.0637 | -0.4996 | -0.0817 | -1.1927 | -1.1134 | -0.6741 | 0.9345 |
1 | 1 | 73 | 8 | 54 | 0.0879 | -0.7625 | -0.0817 | -1.1927 | -1.1134 | 0.4307 | 0.7406 |
1 | 1 | 74 | 18 | 4 | 0.0907 | -1.2061 | -0.9962 | -1.2509 | -1.1134 | -0.6741 | -1.4771 |
1 | 1 | 75 | 13 | 10 | 0.0514 | -1.2271 | -1.2710 | -1.1927 | -1.1134 | -0.6741 | 0.9488 |
1 | 1 | 76 | 34 | 257 | 0.2013 | 0.6796 | 0.5227 | 1.2354 | 1.3676 | 0.4307 | -0.5252 |
1 | 1 | 77 | 14 | 201 | 0.1077 | 1.2716 | 0.6795 | -0.2066 | -0.0501 | 0.4307 | 0.9415 |
1 | 1 | 78 | 6 | 132 | 0.1109 | -0.1093 | -0.8905 | 0.2662 | -0.2864 | -0.6741 | -0.8049 |
1 | 1 | 79 | 10 | 1 | 0.1191 | -0.3089 | -1.0807 | -0.8771 | -0.7590 | 2.6404 | 1.2820 |
1 | 1 | 80 | 21 | 279 | 0.125 | -0.0342 | 0.5707 | 2.4344 | 1.3676 | 0.4307 | -2.2014 |
1 | 1 | 81 | 9 | 250 | 0.1507 | 1.4273 | 0.5103 | -0.4378 | 1.3676 | 0.4307 | -1.1477 |
1 | 1 | 82 | 35 | 177 | 0.091 | 0.6594 | 0.6795 | 0.1118 | -0.0501 | -0.6741 | 0.6921 |
1 | 1 | 83 | 20 | 19 | 0.1176 | -1.2361 | -1.2710 | -1.1450 | -0.7236 | -0.6741 | 0.8485 |
1 | 1 | 84 | 29 | 241 | 0.1872 | 1.2303 | 0.2857 | -0.2317 | -0.0990 | 2.6404 | 0.6643 |
1 | 1 | 85 | 5 | 138 | 0.088 | 1.1552 | -0.8905 | -0.4331 | -0.0501 | -0.6741 | 0.5910 |
1 | 1 | 86 | 44 | 44 | 0.0834 | -0.3663 | -0.9424 | -0.9589 | -0.7590 | -0.6741 | -1.6440 |
1 | 1 | 87 | 28 | 122 | 0.0847 | -0.3282 | -0.9040 | 0.3224 | -0.0501 | -0.6741 | 0.5421 |
1 | 1 | 88 | 27 | 46 | 0.0536 | -1.1122 | -0.8905 | -0.8125 | -0.7590 | -0.6741 | 0.3241 |
1 | 1 | 89 | 9 | 90 | 0.1102 | -0.4522 | 0.6795 | -1.1927 | -1.1134 | 0.4307 | 0.5776 |
1 | 1 | 90 | 12 | 210 | 0.1319 | 1.3181 | 0.6795 | 0.2207 | -0.1092 | 0.4307 | 0.3132 |
1 | 1 | 91 | 8 | 65 | 0.1556 | -0.7103 | 0.0134 | -0.4915 | -0.6704 | -0.6741 | 1.5724 |
1 | 1 | 92 | 10 | 190 | 0.2455 | -0.1356 | -0.8905 | 2.3172 | -0.0501 | -0.5636 | -0.2748 |
1 | 1 | 93 | 23 | 255 | 0.1022 | 0.1663 | -0.0817 | 1.4595 | 1.3676 | 0.4307 | -0.8755 |
1 | 1 | 94 | 12 | 125 | 0.0907 | 0.1502 | 0.6795 | -0.7285 | -0.7590 | -0.6741 | -0.2750 |
1 | 1 | 95 | 2 | 214 | 0.0485 | -0.5521 | 0.6795 | 0.2603 | 1.3676 | -0.6741 | -1.2508 |
1 | 1 | 96 | 10 | 207 | 0.1303 | -1.0519 | 0.6795 | 0.2339 | -0.1919 | 0.4307 | -2.3263 |
1 | 1 | 97 | 10 | 294 | 0.1543 | 0.4109 | 1.4882 | 2.1107 | 1.3676 | 2.6404 | -2.3101 |
1 | 1 | 98 | 18 | 178 | 0.141 | -0.1626 | 0.5526 | 0.1169 | -0.0501 | 0.4307 | 1.5485 |
1 | 1 | 99 | 18 | 21 | 0.1218 | -1.0556 | -1.0173 | -1.1899 | -1.0937 | 0.4307 | 0.5724 |
1 | 1 | 100 | 14 | 123 | 0.1019 | -0.5915 | 0.6795 | -0.5457 | -0.7590 | 0.4307 | 0.7255 |
Now let us understand what each column in the above summary table means:
Segment.Level
- Layers of the cell.
In this case, we have performed Vector Quantization for depth 1. Hence
Segment Level is 1
Segment.Parent
- Parent segment of
the cell
Segment.Child (Cell.Number)
- The
children of a particular cell. In this case, it is the total number of
cells at which we achieved the defined compression percentage
n
- No of points in each
cell
Cell.ID
- Cell_ID’s are generated
for the multivariate data using 1-D Sammon’s Projection
algorithm
Quant.Error
- Quantization Error
for each cell
All the columns after this will contain centroids for each cell. They can also be called a codebook, which represents a collection of all centroids or codewords.
plotHVT(hvt.results, plot.type = '1D')
For more detailed information on Data Projection please refer to section 7.2 of this vignette.
Lets visualize the projected Sammons 2D for n_cell set to 300 onto a plane.
plotHVT(hvt.results, plot.type = '2Dproj')
For more detailed information on voronoi tessellation please refer to section 7.3 of this vignette.
For better visualisation, let’s plot the Voronoi tessellation using
the plotHVT
function.
plotHVT(hvt.results,
line.width = c(0.2),
color.vec = c("black"),
centroid.size = 0.01,
maxDepth = 1,
plot.type = '2Dhvt')
Now let’s plot the Voronoi Tessellation with the heatmap overlaid for all the features in the computers dataset for better visualization.
The heatmaps displayed below provides a visual representation of the spatial characteristics of the computers data, allowing us to observe patterns and trends in the distribution of each of the features (price,speed,hd,ram,screen,ads). The sheer green shades highlight regions with higher values in each of the heatmaps, while the indigo shades indicate areas with the lowest values in each of the heatmaps. By analyzing these heatmaps, we can gain insights into the variations and relationships between each of these features within the computers data
plotHVT(
hvt.results,child.level = 1,
hmap.cols = "n",
line.width = c(0.2),
color.vec = c("black"),
centroid.size = 0.03,
plot.type = '2Dheatmap'
)
plotHVT(
hvt.results,child.level = 1,
hmap.cols = "price",
line.width = c(0.2),
color.vec = c("black"),
centroid.size = 0.03,
plot.type = '2Dheatmap'
)
plotHVT(
hvt.results,child.level = 1,
hmap.cols = "hd",
line.width = c(0.2),
color.vec = c("black"),
centroid.size = 0.03,
plot.type = '2Dheatmap'
)
plotHVT(
hvt.results,child.level = 1,
hmap.cols = "ram",
line.width = c(0.2),
color.vec = c("black"),
centroid.size = 0.03,
plot.type = '2Dheatmap'
)
plotHVT(
hvt.results,child.level = 1,
hmap.cols = "screen",
line.width = c(0.2),
color.vec = c("black"),
centroid.size = 0.03,
plot.type = '2Dheatmap'
)
plotHVT(
hvt.results,child.level = 1,
hmap.cols = "ads",
line.width = c(0.2),
color.vec = c("black"),
centroid.size = 0.03,
plot.type = '2Dheatmap'
)
For more detailed information on scoring please refer to section 7.4 of this vignette.
Now once we have built the model, let us try to score using our testing dataset containing(1252 data points) which cell and which level each point belongs to.
scoreHVT(dataset,
hvt.results.model,
child.level,
mad.threshold,
line.width,
color.vec,
normalize,
seed,
distance_metric,
error_metric,
yVar,
analysis.plots, names.column)
The parameters for the function scoreHVT
are explained
below:
dataset
- A dataframe containing
the testing dataset. The dataframe should have all the
variable(features) used for training.
hvt.results.model
- A list obtained
from the trainHVT function while performing hierarchical vector
quantization on training data. This list provides an overview of the
hierarchical vector quantized data, including diagnostics, tessellation
details, Sammon’s projection coordinates, and model input
information.
child.level
- A number indicating
the depth for which the heat map is to be plotted. Each depth represents
a different level of clustering or partitioning of the data.
mad.threshold
- A numeric value
indicating the permissible Mean Absolute Deviation which is obtained
from Minimum Intra centroid plot(when diagnose is set to TRUE in
trainHVT). mad.threshold
value is important since it is
used in anomaly detection.Default value is 0.2 NOTE: for a given
datapoint, when the quantization error is above
mad.threshold
it is denoted as anomaly else
not.
line.width
- A vector indicating
the line widths of the tessellation boundaries for each layer. (Optional
Parameters)
color.vec
- A vector indicating the
colors of the tessellations boundaries at each layer. (Optional
Parameters)
normalize
- A logical value
indicating if the dataset should be normalized. When set to TRUE, the
data (testing dataset) is standardized by mean and sd of the training
dataset referred from the trainHVT(). When set to FALSE, the
data
is used as such without any changes.
distance_metric
- The distance
metric can be L1_Norm
(Manhattan) or
L2_Norm
(Euclidean). The metric is used when calculating
distance between each datapoint(in test dataset) with the centroids
obtained from results of trainHVT. Default is
L1_Norm
.
error_metric
- The error metric can
be mean
or max
. max
will return
the max of m
values and mean
will take mean of
m
values where each value is a distance between the
datapoint and centroid of the cell. This helps in calculating the scored
quantization error. Default value is max
.
yVar
- A character or a vector
representing the name of the dependent variable(s)
The below given arguments are used only when character column can be mapped over the scored results. since torus doesn’t have a character column, we are not using them in this vignette.
analysis.plots
- A logical value to
indicate whether to include the insight plots which are useful in
viewing the contents and clusters of cells. Default is FALSE.
names.column
- The column of names
of the datapoints which will be displayed as the contents of the cell in
‘scoredPlotly’. Default is NULL.
set.seed(240)
<-scoreHVT(
scoring_comp
testComputers,
hvt.results,child.level = 1,
normalize = TRUE
)
When normalize is set to TRUE while using scoreHVT, the function has an inbuilt feature to standardize the testing dataset based on the mean and standard deviation of the training dataset from the trainHVT results.
Let’s see which cell and level each point belongs to and check the mean absolute difference of each of the 1252 records. For the sake of brevity, we will only show the first 100 rows.
displayTable(scoring_comp[["scoredPredictedData"]])
Segment.Level | Segment.Parent | Segment.Child | n | Cell.ID | Quant.Error | centroidRadius | diff | anomalyFlag | price | speed | hd | ram | screen | ads |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 111 | 1 | 11 | 0.2257 | 0.3909 | 0.1652 | 1 | -1.0786 | -1.2710 | -0.9473 | -0.7590 | 0.4307 | -1.7213 |
1 | 1 | 268 | 1 | 105 | 0.2245 | 0.5994 | 0.3749 | 1 | -0.6387 | -1.2710 | -0.9473 | -0.0501 | -0.6741 | -1.7213 |
1 | 1 | 111 | 1 | 11 | 0.0776 | 0.3909 | 0.3133 | 0 | -0.8621 | -1.2710 | -0.9473 | -0.7590 | -0.6741 | -1.7213 |
1 | 1 | 225 | 1 | 170 | 0.2334 | 0.7934 | 0.5600 | 1 | 0.6187 | -0.0817 | -0.7914 | -0.7590 | 0.4307 | -1.7213 |
1 | 1 | 225 | 1 | 170 | 0.2325 | 0.7934 | 0.5609 | 1 | -0.0394 | -0.8905 | -0.9473 | -0.0501 | 0.4307 | -1.7213 |
1 | 1 | 268 | 1 | 105 | 0.1022 | 0.5994 | 0.4972 | 0 | 0.1338 | -1.2710 | -0.6551 | -0.0501 | -0.6741 | -1.7213 |
1 | 1 | 103 | 1 | 152 | 0.1086 | 0.7989 | 0.6903 | 0 | 0.8334 | -0.0817 | -0.7836 | -0.0501 | -0.6741 | -1.7213 |
1 | 1 | 41 | 1 | 111 | 0.2262 | 0.8620 | 0.6358 | 1 | -0.2126 | -0.8905 | -0.6356 | -0.7590 | 0.4307 | -1.7213 |
1 | 1 | 194 | 1 | 110 | 0.2096 | 0.5606 | 0.3511 | 1 | 0.9997 | 0.6795 | -1.1031 | -0.7590 | -0.6741 | -1.7213 |
1 | 1 | 188 | 1 | 256 | 0.2002 | 1.5526 | 1.3524 | 1 | 1.1382 | -0.0817 | -0.7914 | -0.7590 | 2.6404 | -1.7213 |
1 | 1 | 200 | 1 | 251 | 0.2889 | 0.8753 | 0.5863 | 1 | 3.0780 | -0.8905 | 0.1513 | -0.0501 | -0.6741 | -1.7213 |
1 | 1 | 172 | 1 | 217 | 0.2376 | 1.0939 | 0.8563 | 1 | 1.5193 | -0.8905 | -0.2850 | 1.3676 | -0.6741 | -1.7213 |
1 | 1 | 188 | 1 | 256 | 0.3607 | 1.5526 | 1.1919 | 1 | 0.6533 | -0.8905 | -0.7914 | -0.0501 | 2.6404 | -1.7213 |
1 | 1 | 103 | 1 | 152 | 0.0442 | 0.7989 | 0.7548 | 0 | 0.3243 | -0.0817 | -0.7914 | -0.0501 | -0.6741 | -1.7213 |
1 | 1 | 103 | 1 | 152 | 0.0384 | 0.7989 | 0.7606 | 0 | 0.3589 | -0.0817 | -0.7914 | -0.0501 | -0.6741 | -1.7213 |
1 | 1 | 268 | 1 | 105 | 0.1122 | 0.5994 | 0.4872 | 0 | 0.4871 | -0.8905 | -0.7836 | -0.0501 | -0.6741 | -1.7213 |
1 | 1 | 268 | 1 | 105 | 0.1687 | 0.5994 | 0.4306 | 0 | 0.8265 | -0.8905 | -0.6551 | -0.0501 | -0.6741 | -1.7213 |
1 | 1 | 111 | 1 | 11 | 0.097 | 0.3909 | 0.2939 | 0 | -0.8119 | -1.2710 | -1.1420 | -0.7590 | -0.6741 | -1.7213 |
1 | 1 | 35 | 1 | 216 | 0.1167 | 0.9101 | 0.7934 | 0 | 1.3461 | 0.6795 | -0.2850 | -0.0501 | 0.4307 | -1.7213 |
1 | 1 | 86 | 1 | 44 | 0.2686 | 0.5002 | 0.2316 | 1 | -0.7322 | -0.8905 | -0.9473 | -0.7590 | 0.4307 | -1.7213 |
1 | 1 | 111 | 1 | 11 | 0.1021 | 0.3909 | 0.2888 | 0 | -0.9054 | -0.8905 | -0.9473 | -0.7590 | -0.6741 | -1.7213 |
1 | 1 | 103 | 1 | 152 | 0.1978 | 0.7989 | 0.6011 | 0 | 1.4309 | -0.0817 | -0.6551 | -0.0501 | -0.6741 | -1.7213 |
1 | 1 | 111 | 1 | 11 | 0.2065 | 0.3909 | 0.1844 | 1 | -1.0197 | -1.2710 | -1.2979 | -0.0501 | -0.6741 | -1.7213 |
1 | 1 | 268 | 1 | 105 | 0.1888 | 0.5994 | 0.4106 | 0 | 0.6533 | -1.2710 | -0.6551 | -0.0501 | -0.6741 | -1.7213 |
1 | 1 | 268 | 1 | 105 | 0.1647 | 0.5994 | 0.4347 | 0 | -0.3789 | -0.8905 | -1.1420 | -0.0501 | -0.6741 | -1.7213 |
1 | 1 | 225 | 1 | 170 | 0.1675 | 0.7934 | 0.6259 | 0 | 0.3502 | -0.8905 | -0.9473 | -0.0501 | 0.4307 | -1.7213 |
1 | 1 | 172 | 1 | 217 | 0.1972 | 1.0939 | 0.8967 | 0 | 0.6533 | -1.2710 | -0.2850 | 1.3676 | -0.6741 | -1.7079 |
1 | 1 | 111 | 1 | 11 | 0.0999 | 0.3909 | 0.2910 | 0 | -0.9054 | -0.8905 | -0.9473 | -0.7590 | -0.6741 | -1.7079 |
1 | 1 | 35 | 1 | 216 | 0.1485 | 0.9101 | 0.7616 | 0 | 1.3530 | 0.6795 | -0.3240 | -0.7590 | 0.4307 | -1.7079 |
1 | 1 | 188 | 1 | 256 | 0.2316 | 1.5526 | 1.3210 | 1 | 1.3461 | 0.6795 | -0.6356 | -0.0501 | 2.6404 | -1.7079 |
1 | 1 | 194 | 1 | 110 | 0.0744 | 0.5606 | 0.4862 | 0 | -0.1260 | 0.6795 | -0.9473 | -0.7590 | -0.6741 | -1.7079 |
1 | 1 | 35 | 1 | 216 | 0.1144 | 0.9101 | 0.7956 | 0 | 1.3461 | 0.6795 | -0.2850 | -0.0501 | 0.4307 | -1.7079 |
1 | 1 | 111 | 1 | 11 | 0.0619 | 0.3909 | 0.3290 | 0 | -1.2448 | -1.2710 | -0.9473 | -0.7590 | -0.6741 | -1.7079 |
1 | 1 | 136 | 1 | 179 | 0.1333 | 1.0527 | 0.9194 | 0 | 0.1857 | 0.6795 | -0.6356 | -0.0501 | -0.6741 | -1.7079 |
1 | 1 | 86 | 1 | 44 | 0.1349 | 0.5002 | 0.3653 | 0 | 0.0039 | -0.8905 | -0.6356 | -0.7590 | -0.6741 | -1.7079 |
1 | 1 | 197 | 1 | 97 | 0.2249 | 0.8563 | 0.6314 | 1 | 1.3530 | -0.0817 | -0.9473 | -0.7590 | -0.6741 | -1.7079 |
1 | 1 | 225 | 1 | 170 | 0.2302 | 0.7934 | 0.5632 | 1 | -0.0394 | -0.8905 | -0.9473 | -0.0501 | 0.4307 | -1.7079 |
1 | 1 | 35 | 1 | 216 | 0.2512 | 0.9101 | 0.6588 | 1 | 2.0458 | 0.6795 | -0.7135 | -0.7590 | 0.4307 | -1.7079 |
1 | 1 | 197 | 1 | 97 | 0.125 | 0.8563 | 0.7312 | 0 | 0.6602 | -0.0817 | -0.7914 | -0.7590 | -0.6741 | -1.7079 |
1 | 1 | 136 | 1 | 179 | 0.068 | 1.0527 | 0.9847 | 0 | 0.9114 | 0.6795 | -0.6551 | -0.0501 | -0.6741 | -1.7079 |
1 | 1 | 120 | 1 | 212 | 0.2587 | 0.5996 | 0.3409 | 1 | 2.7316 | -0.8905 | 0.1513 | -0.0501 | -0.6741 | -1.7079 |
1 | 1 | 35 | 1 | 216 | 0.1776 | 0.9101 | 0.7324 | 0 | 0.9131 | 0.6795 | -0.6356 | -0.7590 | 0.4307 | -1.7079 |
1 | 1 | 248 | 1 | 81 | 0.1205 | 0.5102 | 0.3897 | 0 | 0.3139 | -0.0817 | -0.7836 | -0.7590 | -0.6741 | -1.7079 |
1 | 1 | 103 | 1 | 152 | 0.0801 | 0.7989 | 0.7188 | 0 | 0.7382 | -0.0817 | -0.6551 | -0.0501 | -0.6741 | -1.7079 |
1 | 1 | 136 | 1 | 179 | 0.0611 | 1.0527 | 0.9916 | 0 | 0.6187 | 0.6795 | -0.6356 | -0.0501 | -0.6741 | -1.7079 |
1 | 1 | 172 | 1 | 217 | 0.149 | 1.0939 | 0.9449 | 0 | 0.8265 | -0.8905 | -0.2850 | 1.3676 | -0.6741 | -1.7079 |
1 | 1 | 81 | 1 | 250 | 0.1589 | 0.9042 | 0.7453 | 0 | 1.3530 | 0.6795 | -0.6551 | 1.3676 | 0.4307 | -1.6406 |
1 | 1 | 188 | 1 | 256 | 0.2664 | 1.5526 | 1.2862 | 1 | 0.9824 | -0.8905 | -0.6356 | -0.0501 | 2.6404 | -1.6406 |
1 | 1 | 74 | 1 | 4 | 0.0707 | 0.5442 | 0.4735 | 0 | -1.0786 | -0.8905 | -1.2784 | -1.1134 | -0.6741 | -1.6406 |
1 | 1 | 120 | 1 | 212 | 0.2475 | 0.5996 | 0.3521 | 1 | 2.7316 | -0.8905 | 0.1513 | -0.0501 | -0.6741 | -1.6406 |
1 | 1 | 69 | 1 | 254 | 0.1398 | 0.8873 | 0.7475 | 0 | 2.9048 | 0.6795 | 0.3383 | -0.0501 | 0.4307 | -1.6406 |
1 | 1 | 188 | 1 | 256 | 0.3097 | 1.5526 | 1.2429 | 1 | 0.7226 | -0.8905 | -0.6356 | -0.0501 | 2.6404 | -1.6406 |
1 | 1 | 103 | 1 | 152 | 0.0663 | 0.7989 | 0.7326 | 0 | 0.6602 | -0.0817 | -0.7836 | -0.0501 | -0.6741 | -1.6406 |
1 | 1 | 136 | 1 | 179 | 0.0848 | 1.0527 | 0.9679 | 0 | 0.9131 | 0.6795 | -0.9473 | -0.0501 | -0.6741 | -1.6406 |
1 | 1 | 250 | 1 | 248 | 0.2187 | 1.1807 | 0.9619 | 1 | 1.1729 | -0.0817 | -0.2850 | 1.3676 | -0.6741 | -1.6406 |
1 | 1 | 86 | 1 | 44 | 0.0133 | 0.5002 | 0.4869 | 0 | -0.3789 | -0.8905 | -0.9473 | -0.7590 | -0.6741 | -1.6406 |
1 | 1 | 268 | 1 | 105 | 0.0838 | 0.5994 | 0.5155 | 0 | -0.0394 | -1.2710 | -0.6551 | -0.0501 | -0.6741 | -1.6406 |
1 | 1 | 111 | 1 | 11 | 0.2143 | 0.3909 | 0.1766 | 1 | -1.0786 | -1.2710 | -0.9473 | -0.7590 | 0.4307 | -1.6406 |
1 | 1 | 136 | 1 | 179 | 0.0499 | 1.0527 | 1.0028 | 0 | 0.6187 | 0.6795 | -0.6356 | -0.0501 | -0.6741 | -1.6406 |
1 | 1 | 81 | 1 | 250 | 0.2208 | 0.9042 | 0.6835 | 1 | 1.8726 | 0.6795 | -0.6551 | 1.3676 | 0.4307 | -1.6406 |
1 | 1 | 136 | 1 | 179 | 0.0436 | 1.0527 | 1.0091 | 0 | 0.8334 | 0.6795 | -0.7798 | -0.0501 | -0.6741 | -1.6406 |
1 | 1 | 111 | 1 | 11 | 0.0662 | 0.3909 | 0.3247 | 0 | -0.8621 | -1.2710 | -0.9473 | -0.7590 | -0.6741 | -1.6406 |
1 | 1 | 86 | 1 | 44 | 0.0664 | 0.5002 | 0.4338 | 0 | -0.3858 | -0.8905 | -0.6356 | -0.7590 | -0.6741 | -1.6406 |
1 | 1 | 248 | 1 | 81 | 0.0839 | 0.5102 | 0.4262 | 0 | 0.1407 | -0.0817 | -0.7836 | -0.7590 | -0.6741 | -1.6406 |
1 | 1 | 35 | 1 | 216 | 0.1848 | 0.9101 | 0.7253 | 0 | 0.5667 | 0.6795 | -0.6356 | -0.0501 | 0.4307 | -1.6406 |
1 | 1 | 103 | 1 | 152 | 0.0669 | 0.7989 | 0.7321 | 0 | 0.1164 | -0.0817 | -0.6356 | -0.0501 | -0.6741 | -1.5331 |
1 | 1 | 120 | 1 | 212 | 0.2296 | 0.5996 | 0.3700 | 1 | 2.7316 | -0.8905 | 0.1513 | -0.0501 | -0.6741 | -1.5331 |
1 | 1 | 103 | 1 | 152 | 0.0763 | 0.7989 | 0.7226 | 0 | 0.6602 | -0.0817 | -0.7836 | -0.0501 | -0.6741 | -1.5331 |
1 | 1 | 250 | 1 | 248 | 0.2008 | 1.1807 | 0.9798 | 1 | 1.1729 | -0.0817 | -0.2850 | 1.3676 | -0.6741 | -1.5331 |
1 | 1 | 194 | 1 | 110 | 0.0309 | 0.5606 | 0.5298 | 0 | -0.0394 | 0.6795 | -0.9473 | -0.7590 | -0.6741 | -1.5331 |
1 | 1 | 74 | 1 | 4 | 0.0802 | 0.5442 | 0.4640 | 0 | -0.9140 | -0.8905 | -1.2784 | -1.1134 | -0.6741 | -1.5331 |
1 | 1 | 250 | 1 | 248 | 0.1177 | 1.1807 | 1.0630 | 0 | 1.5106 | 0.6795 | -0.2850 | 1.3676 | -0.6741 | -1.5331 |
1 | 1 | 225 | 1 | 170 | 0.1999 | 0.7934 | 0.5935 | 0 | 0.6602 | -0.8905 | -0.6746 | -0.7590 | 0.4307 | -1.5331 |
1 | 1 | 248 | 1 | 81 | 0.0572 | 0.5102 | 0.4530 | 0 | -0.0394 | -0.0817 | -1.1031 | -0.7590 | -0.6741 | -1.5331 |
1 | 1 | 225 | 1 | 170 | 0.1418 | 0.7934 | 0.6516 | 0 | 0.6533 | -0.0817 | -0.6356 | -0.0501 | 0.4307 | -1.5331 |
1 | 1 | 225 | 1 | 170 | 0.1048 | 0.7934 | 0.6886 | 0 | 0.5148 | -0.8905 | -0.6356 | -0.0501 | 0.4307 | -1.5331 |
1 | 1 | 194 | 1 | 110 | 0.0655 | 0.5606 | 0.4951 | 0 | -0.2473 | 0.6795 | -0.9473 | -0.7590 | -0.6741 | -1.5331 |
1 | 1 | 225 | 1 | 170 | 0.2716 | 0.7934 | 0.5218 | 1 | 1.0066 | -0.0817 | -0.6746 | -0.7590 | 0.4307 | -1.5331 |
1 | 1 | 268 | 1 | 105 | 0.0478 | 0.5994 | 0.5516 | 0 | 0.1338 | -0.8905 | -0.6551 | -0.0501 | -0.6741 | -1.5331 |
1 | 1 | 74 | 1 | 4 | 0.0688 | 0.5442 | 0.4754 | 0 | -1.2604 | -1.2710 | -1.2784 | -1.1134 | -0.6741 | -1.5331 |
1 | 1 | 74 | 1 | 4 | 0.0962 | 0.5442 | 0.4480 | 0 | -1.4249 | -1.2710 | -1.2784 | -1.1134 | -0.6741 | -1.5331 |
1 | 1 | 197 | 1 | 97 | 0.163 | 0.8563 | 0.6933 | 0 | 0.4005 | -0.8905 | -0.7135 | -0.7590 | -0.6741 | -1.5331 |
1 | 1 | 259 | 1 | 53 | 0.0343 | 0.4085 | 0.3742 | 0 | -0.5521 | -0.8905 | -0.7836 | -0.7590 | -0.6741 | -1.1163 |
1 | 1 | 172 | 1 | 217 | 0.1227 | 1.0939 | 0.9713 | 0 | 1.1642 | -0.8905 | 0.1513 | 1.3676 | -0.6741 | -1.1163 |
1 | 1 | 41 | 1 | 111 | 0.1383 | 0.8620 | 0.7237 | 0 | -0.0481 | -0.8905 | -0.7759 | -0.7590 | 0.4307 | -1.1163 |
1 | 1 | 249 | 1 | 89 | 0.109 | 0.4151 | 0.3061 | 0 | -0.3858 | -0.8905 | -0.6356 | -0.0501 | -0.6741 | -1.1163 |
1 | 1 | 225 | 1 | 170 | 0.1834 | 0.7934 | 0.6100 | 0 | 0.4801 | -0.0817 | -0.6356 | -0.0501 | 0.4307 | -1.1163 |
1 | 1 | 103 | 1 | 152 | 0.1017 | 0.7989 | 0.6972 | 0 | 0.3243 | -0.0817 | -0.6356 | -0.0501 | -0.6741 | -1.1163 |
1 | 1 | 134 | 1 | 136 | 0.1277 | 0.5248 | 0.3970 | 0 | 0.1338 | -1.2710 | -0.2850 | -0.0501 | -0.6741 | -1.1163 |
1 | 1 | 188 | 1 | 256 | 0.253 | 1.5526 | 1.2996 | 1 | 1.0170 | -0.8905 | -0.6356 | -0.0501 | 2.6404 | -1.1163 |
1 | 1 | 128 | 1 | 233 | 0.1628 | 0.8728 | 0.7100 | 0 | 0.6602 | -0.8905 | -0.6551 | 1.3676 | 0.4307 | -1.1163 |
1 | 1 | 134 | 1 | 136 | 0.1201 | 0.5248 | 0.4046 | 0 | 0.3069 | -0.8905 | -0.6356 | -0.0501 | -0.6741 | -1.1163 |
1 | 1 | 291 | 1 | 194 | 0.1523 | 0.8878 | 0.7355 | 0 | 0.3589 | 0.6795 | -0.6356 | -0.0501 | 0.4307 | -1.1163 |
1 | 1 | 273 | 1 | 158 | 0.135 | 0.7643 | 0.6294 | 0 | 0.5148 | -0.8905 | -0.6356 | -0.0501 | 0.4307 | -1.1163 |
1 | 1 | 200 | 1 | 251 | 0.2491 | 0.8753 | 0.6262 | 1 | 2.2120 | 0.6795 | 0.3383 | -0.0501 | -0.6741 | -1.1163 |
1 | 1 | 249 | 1 | 89 | 0.1205 | 0.4151 | 0.2945 | 0 | -0.3165 | -0.8905 | -0.6356 | -0.0501 | -0.6741 | -1.1163 |
1 | 1 | 247 | 1 | 175 | 0.3124 | 0.8540 | 0.5417 | 1 | 0.6602 | -0.0817 | -0.0318 | -0.7590 | 0.4307 | -1.1163 |
1 | 1 | 205 | 1 | 95 | 0.0848 | 0.6402 | 0.5554 | 0 | -0.0394 | -0.0817 | -0.7759 | -0.7590 | -0.6741 | -1.1163 |
1 | 1 | 247 | 1 | 175 | 0.1941 | 0.8540 | 0.6600 | 0 | 0.8178 | -0.0817 | -0.2850 | -0.0501 | 0.4307 | -1.1163 |
1 | 1 | 81 | 1 | 250 | 0.2171 | 0.9042 | 0.6872 | 1 | 1.3374 | -0.0817 | 0.1513 | 1.3676 | 0.4307 | -1.1163 |
hist(scoring_comp[["actual_predictedTable"]]$diff, breaks = 20, col = "blue", main = "Mean Absolute Difference", xlab = "Difference",xlim = c(0,0.6), ylim = c(0,250))
Example I: HVT with the Torus dataset
We have considered torus dataset for multidimensional data visualization using Sammons projection.
We have randomly selected 9600 datapoints for training and remaining datapoints for validation.
Our goal is to achieve data compression upto atleast
80%
We constructed a compressed HVT map (hvt.torus) by applying the
trainHVT() on the torus dataset. We set the parameters as follows:
n_cells = 100
, quant.error = 0.1
, and
depth = 1
. Upon analyzing the compression summary, we found
that none of the 100 cells has the quantization error below the
threshold.
We created another compressed HVT map (hvt.torus2) using the
trainHVT() algorithm on the torus dataset. This time, we adjusted the
parameters to n_cells = 300
,
quant.error = 0.1
, and depth = 1
. After
examining the compression summary, we discovered that 48% of the cells
have reached the quantization threshold error.
Once again, we generated a compressed HVT map (hvt.torus3) using
the trainHVT() algorithm on the torus dataset. The parameters for this
map were set to n_cells = 500
,
quant.error = 0.1
, and depth = 1
. Upon
analyzing the compression summary, we found that 90% of the 100 cells
have reached the quantization threshold error and we can clearly
visualize the 3D torus(donut) in 2D space.
Example II: HVT with the Personal Computer dataset
We have considered Computer dataset for multidimensional data visualization using Sammons projection.
We have randomly selected 80% of datapoints for training and rest for validation.
Our goal is to achieve data compression upto atleast
80%
We construct a compressed HVT map using the trainHVT() on the
training dataset by setting n_cells
to 300
and quant.error
to 0.2, and we were able
to attain a compression of 91%
We then plot the Voronoi Tessellation with the heatmap overlaid for all the features in the computers dataset for better visualization
Next, we pass the validation dataset along with the HVT map
obtained from trainHVT()
to scoreHVT()
to see
which cell and level each point belongs to
Pricing Segmentation - The package can be used to discover groups of similar customers based on the customer spend pattern and understand price sensitivity of customers
Market Segmentation - The package can be helpful in market segmentation where we have to identify micro and macro segments. The method used in this package can do both kinds of segmentation in one go
Anomaly Detection - This method can help us categorize system behavior over time and help us find anomaly when there are changes in the system. For e.g. Finding fraudulent claims in healthcare insurance
The package can help us understand the underlying structure of the data. Suppose we want to analyze a curved surface such as sphere or vase, we can approximate it by a lot of small low-order polygons in the form of tessellations using this package
In biology, Voronoi diagrams are used to model a number of different biological structures, including cells and bone microarchitecture
Using the base idea of Systems Dynamics, these diagrams can also be used to depict customer state changes over a period of time