How does random forest define the Proximity (Similarity) between observations?

vishrut-singhal · 5 June 2021 16:41

Random Forest defines proximity between two data points in the following way:

Initialize proximities to zeroes.
For any given tree, apply all the cases to the tree.
If case i and case j both end up in the same node, then proximity prox(ij) between i and j increases by one.
Accumulate over all trees in Random Forest and normalize by twice the number of trees in Random forest.

Finally, it creates a proximity matrix i.e, a square matrix with entry as 1 on the diagonal and values between 0 and 1 in the off-diagonal positions. Proximities are close to 1 when the observations are “alike” and conversely the closer proximity to 0, implies the more dissimilar cases are.