How will you define the number of clusters in a clustering algorithm?
- Compute clustering algorithm (e.g., k-means clustering) for different values of k. For instance, by varying k from 1 to 10 clusters.
- For each k, calculate the total within-cluster sum of square (wss).
- Plot the curve of wss according to the number of clusters k.
- The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number of clusters.
Before clusterting any dataset, you may check if the dataset is clusterable. So there is a group of methods saying if the dataset could be clusterabale or not. This group is called clustering tendency.
After this step, another group of methods may be used to determine the optimal number of clusters of the dataset. It is called relative clustering. It consists to execute a clustering algorithm on the same data but variang the parameters of the algorithm (e.g: number of clusters). After each execution the quality of the model is measured by a metric of clustering validation (e.g index silhouette). After the all executions, the best model in terms of quality is chosen