# Visualizing the true extent of the curse of dimensionality | by Florin Andrei | Medium

## Utilizing the Monte Carlo technique to visualise the habits of observations with very massive numbers of options

Consider a dataset, manufactured from some variety of observations, every statement having N options. In case you convert all options to a numeric illustration, you might say that every statement is some extent in an N-dimensional area.

When N is low, the relationships between factors are simply what you’ll count on intuitively. However typically N grows very massive — this might occur, for instance, in the event you’re creating a variety of options by way of one-hot encoding, and so on. For very massive values of N, observations behave as if they’re sparse, or as if the distances between them are someway greater than what you’ll count on.

The phenomenon is actual. Because the variety of dimensions N grows, and all else stays the identical, the N-volume containing your observations actually does improve in a way (or at the very least the variety of levels of freedom turns into bigger), and the Euclidian distances between observations additionally improve. The group of factors truly does turn into extra sparse. That is the geometric foundation for the curse of dimensionality. The habits of the fashions and strategies utilized to the dataset will likely be influenced as a consequence of those modifications.

Many issues can go fallacious if the variety of options may be very massive. Having extra options than observations is a typical setup for fashions overfitting in coaching. Any brute-force search in such an area (e.g. GridSearch) turns into much less environment friendly — you want extra trials to cowl the identical intervals alongside any axis. A refined impact has an affect on any fashions primarily based on distance or neighborhood: **because the variety of dimensions grows to some very massive values, in the event you think about any level in your observations, all the opposite factors look like far-off and someway almost equidistant **— since these fashions depend on distance to do their job, the leveling out of variations of distance makes their job a lot tougher. E.g. clustering doesn’t work as effectively if all factors look like almost equidistant.

For all these causes, and extra, strategies similar to PCA, LDA, and so on. have been created — in an effort to maneuver away from the peculiar geometry of areas with very many dimensions, and to distill the dataset all the way down to quite a lot of dimensions extra appropriate with the precise data contained in it.

It’s laborious to understand intuitively the true magnitude of this phenomenon, and areas with greater than 3 dimensions are extraordinarily difficult to visualise, so let’s do some easy 2D visualizations to assist our instinct. There’s a geometric foundation for the rationale why dimensionality can turn into an issue, and that is what we’ll visualize right here. You probably have not seen this earlier than, the outcomes could be shocking — the geometry of high-dimensional areas is much extra advanced than the standard instinct is prone to recommend.

Think about a sq. of measurement 1, centered within the origin. Within the sq., you inscribe a circle.

That’s the setup in 2 dimensions. Now assume within the normal, N-dimensional case. In 3 dimensions, you might have a sphere inscribed in a dice. Past that, you might have an N-sphere inscribed in an N-cube, which is probably the most normal option to put it. For simplicity, we’ll refer to those objects as “sphere” and “dice”, regardless of what number of dimensions they’ve.

The quantity of the dice is fastened, it’s all the time 1. The query is: because the variety of dimensions N varies, what occurs to the quantity of the sphere?

Let’s reply the query experimentally, utilizing the Monte Carlo technique. We’ll generate a really massive variety of factors, distributed uniformly however randomly inside the dice. For every level we calculate its distance to the origin — if that distance is lower than 0.5 (the radius of the sphere), then the purpose is contained in the sphere.

If we divide the variety of factors contained in the sphere by the whole variety of factors, that may approximate the ratio of the quantity of the sphere and of the quantity of the dice. For the reason that quantity of the dice is 1, the ratio will likely be equal to the quantity of the sphere. The approximation will get higher when the whole variety of factors is massive.

In different phrases, the ratio `inside_points / total_points`

will approximate the quantity of the sphere.

The code is fairly simple. Since we’d like many factors, specific loops have to be averted. We may use NumPy, but it surely’s CPU-only and single-threaded, so it is going to be sluggish. Potential options: CuPy (GPU), Jax (CPU or GPU), PyTorch (CPU or GPU), and so on. We’ll use PyTorch — however the NumPy code would look virtually similar.

In case you comply with the nested `torch`

statements, we generate 100 million random factors, calculate their distances to the origin, rely the factors contained in the sphere, and divide the rely by the whole variety of factors. The `ratio`

array will find yourself containing the quantity of the sphere in numerous numbers of dimensions.

The tunable parameters are set for a GPU with 24 GB of reminiscence — modify them in case your {hardware} is totally different.

`gadget = torch.gadget("cuda:0" if torch.cuda.is_available() else "cpu")`

# pressure CPU

# gadget = 'cpu'# cut back d_max if too many ratio values are 0.0

d_max = 22

# cut back n in the event you run out of reminiscence

n = 10**8

ratio = np.zeros(d_max)

for d in tqdm(vary(d_max, 0, -1)):

torch.manual_seed(0)

# mix massive tensor statements for higher reminiscence allocation

ratio[d - 1] = (

torch.sum(

torch.sqrt(

torch.sum(torch.pow(torch.rand((n, d), gadget=gadget) - 0.5, 2), dim=1)

)

<= 0.5

).merchandise()

/ n

)

# clear up reminiscence

torch.cuda.empty_cache()

Let’s visualize the outcomes: