What is Clustering in Data Science?

Clustering is the unsupervised categorization of patterns into groups. The clustering is used in a variety of fields, reflecting its broad appeal and use as a phase in exploratory data analysis. Clustering algorithms are used to find groups of similar items in multiple data sets.

Clustering is used in several ways:

Marketing:
marketers discover distinct groups in their customer bases and then use this knowledge to develop targeted marketing programs

Recognizing Objects
When you conduct a Google Photos search, the suggested images are probably picked using cluster analysis, which searches for similarities in colors, objects, and so on.

Detecting fraud
Clustering is used in identifying cybercrime such as credit card fraud detection.

Clustering is a set of methods to group observations (e.g. people, corporations, rates. whatever you are studying) based on variables so that each group is “similar”.

There are a huge variety of methods that fall into two broad categories:

  1. K-means, where you set the number of groups and let the computer figure it out
  2. Agglomerative, where you start with all the observations separate and gradually combine them

There is also the opposite of agglomerative (start with all the observations together and gradually split them) but it is computationally extremely expensive and rarely used.