Curse of dimensionality¶

Based on this article : http://www.edupristine.com/blog/curse-dimensionality

In [1]:

import numpy as np
import pylab as pl
%matplotlib inline

Sampling¶

This the sampling example of the original article.

[Original image](http://content.edupristine.com/images/blogs/curse-of-dimensionality_1.png)

Let's say we want to sample a target percentage ($p$) of our points. If our points are in $n$ dimensions, we have to sample $x$ percent of the range along each dimension to achieve overall $p$ sampling rate.

$$x^n = p$$

So if we want to find $x$ for a given dimension

$$x = p^\frac{1}{n}$$

Let's plot this

In [2]:

# we want to sample 1% of our points
p = 0.01
n = np.arange(1, 50)

x = p**(1.0 / n)

In [3]:

pl.title('Percentage of range to sample along each axis\nto achieve 1% overall sampling rate')
pl.plot(n, x)
pl.xlabel('dimensionality')
pl.ylabel('percentage of range')

Out[3]:

<matplotlib.text.Text at 0x7fed0b4fe0d0>

Distance¶

This is the distance example.

[Original image](http://content.edupristine.com/images/blogs/curse-of-dimensionality_2.png)

(Image linked from the article)

Given a square (or a cube or hypercube), we want to find what volume of the hypercube is covered by the biggest possible hypersphere. Since many algorithms rely on euclidean distance to define nearness, this gives an idea of how many points will fall in our sphere if we choose the biggest possible radius. This should be high for our algorithms to work as intended.

For example, if you're in 2D and you have points in the ranges [0, 10], [0, 10], you would except that setting a distance threshold of 10 will include most of the points (obviously missing the points in the corner). If we increase the dimensionality, the volume of the "corners" will increase proportionally to our sphere's volume.

The volume of an hypersphere of dimension n can be computed recursively :

$$V_2(R) = \pi R^2$$$$V_3(R) = \frac{4}{3} \pi R^3$$$$V_n(R) = \frac{2\pi R^2}{n} V_{n-2}(R)$$

In [4]:

# volume of n-spheres of radius R
dims = np.arange(15)
R = 1.0
S_n = [1,
       2 * R,
       np.pi * R**2,
       4.0/3.0 * np.pi * R**3]
for n in dims[4:]:
    S_n.append((2 * np.pi * R**2 / float(n)) * S_n[n - 2])
S_n = np.array(S_n)

In [5]:

# volume of hypercubes of edge length 2*R
C_n = np.array([(2*R)**n for n in dims])

In [6]:

pl.figure(figsize=(12, 3))

ax1 = pl.subplot(111)
ax1.set_title('Volume as a function of dimensionality\nfor R=%f' % R)
ax1.plot(dims, S_n, c='g', label='n-sphere volume')
ax2 = ax1.twinx()
ax2.plot(dims, C_n, c='b', label='enclosing n-cube volume')
h1, l1 = ax1.get_legend_handles_labels()
h2, l2 = ax2.get_legend_handles_labels()
ax1.legend(h1+h2, l1+l2, loc=(0.0, 0.9), fontsize=10)

Out[6]:

<matplotlib.legend.Legend at 0x7fed0b2a5790>

Interestingly, the volume of the n-sphere decreases after a number of dimensions (this depends on its radius). This is explained on Wikipedia

In [7]:

pl.figure(figsize=(12, 3))

ax1 = pl.subplot(111)
ax1.set_title('Ratio of n-sphere to n-cube volume\n for  R=%f' % R)
ax1.plot(dims, S_n/C_n)

Out[7]:

[<matplotlib.lines.Line2D at 0x7fed0b11ab50>]