The problem briefly is that: Given two sets of K dimensional points find the N nearest points (using Euclidean distance or any other measurement) to a K dimensional point X.
The Naive Solution:
Go through the list, find the distance between X and all the points in the two sets, get the smallest N elements in the distances list... Simple enough.
As part of our assignment we were given a dataset of 100,000 points which proved this algorithm to be very slow.
My Naive WRONG Solution:
I thought that it was easy to index the points in a way that makes it efficient to find the nearest points to a given point. Multiple Dimensional Sorting... It was totally wrong because two points might have the same values of m dimensions but the values of the other (K-m) dimensions put them far away from each other.
Possible Indexing Solutions:
There are different ways to index K Dimensional Points based on how near they are to each other. I will mention here KD-Trees.
K dimensional trees is a binary tree that is based on space partitioning. It's used to index multi-dimensional data.
The figure represents a simple 3d-tree. The dimensions are X,Y and Z. Constructing such a tree is much similar to a conventional binary search tree but differs in only that splitting mechanism uses a different dimension at each level and keeps iterating on them in the same order.
The root point in the figure splits the space into two partitions X>5 and X<5. On the second level of the tree the X>5 partition is split into two partitions Y>3 and Y<3 and so on.
So as you can see by navigating through the tree you get a good approximation of the nearest points to the point you're searching for. But because the point you're searching for might be on the edge of one of the partitions a normal binary search will not always yield the correct the real nearest neighbour so re-checking is needed.
To do that re-checking the algorithm saves the point obtained by a normal binary search as the current best. Then it iterates on the nodes already visisted in the tree to check for two things:
- If any of those nodes' distance is smaller than the current best.
- If there exists a partition (branch) that might contain points with smaller distances. That's done by comparing the distance between the point we're searching for and the splitting point on the adjacent partition (the sibling of the node currently being examined) and the current best distance.