How does random forest define the Proximity (Similarity) between observations?

Random Forest defines proximity between two data points in the following way:

  • Initialize proximities to zeroes.
  • For any given tree, apply all the cases to the tree.
  • If case i and case j both end up in the same node, then proximity prox(ij) between i and j increases by one.
  • Accumulate over all trees in Random Forest and normalize by twice the number of trees in Random forest.

Finally, it creates a proximity matrix i.e, a square matrix with entry as 1 on the diagonal and values between 0 and 1 in the off-diagonal positions. Proximities are close to 1 when the observations are “alike” and conversely the closer proximity to 0, implies the more dissimilar cases are.