What are different distance measures used in Clustering?

board-infinity · 7 October 2022 06:53

The classification of observations into groups requires some methods for computing the distance or the (dis)similarity between each pair of observations. The result of this computation is known as a dissimilarity or distance matrix. Clustering consists of grouping certain objects that are similar to each other, it can be used to decide if two items are similar or dissimilar in their properties. In the clustering setting, a distance (or equivalently a similarity) measure is a function that quantifies the similarity between two objects.

Most clustering approaches use distance measures to assess the similarities or differences between a pair of objects, the most popular distance measures used are:

1. Euclidean Distance: Euclidean distance is considered the traditional metric for problems with geometry. It can be simply explained as the ordinary distance between two points.

2. Manhattan Distance: This determines the absolute difference among the pair of the coordinates.

Suppose we have two points P and Q to determine the distance between these points we simply have to calculate the perpendicular distance of the points from X-Axis and Y-Axis.

In a plane with P at coordinate (x1, y1) and Q at (x2, y2).

Manhattan distance between P and Q = |x1 – x2| + |y1 – y2|

3. Jaccard Index: The Jaccard distance measures the similarity of the two data set items as the intersection of those items divided by the union of the data items.

4. Minkowski distance:

It is the generalized form of the Euclidean and Manhattan Distance Measure. In an N-dimensional space, a point is represented as,

(x1, x2, ..., xN)

Consider two points P1 and P2:

P1: (X1, X2, ..., XN) P2: (Y1, Y2, ..., YN)