-
There are mainly two purposes for using clustering algorithms:
- Data transformation
- Clustering
-
PCA (Principal Component Analysis)
-
NMF (Non-negative Matrix Factorization)
-
Clustering
-
k-means algorithm:
- Used for vector quantization
- Its strength lies in being able to handle clusters regardless of input dimensions, unlike PCA.
- Dividing scattered points into 10 clusters is equivalent to separating each point into a 10-dimensional component (One-hot representation).
- For example: {0,0,0,1,0,0,0,0,0,0}
- Alternatively, assigning distances to each dimension towards the cluster centers.
-
Agglomerative Clustering
-
Evaluation:
-
When using ground truth for validation: metrics like Adjusted Rand Index (ARI) are used.
-
However, if ground truth is available, supervised learning can be applied.
-
Evaluation without ground truth: metrics like Silhouette Coefficient are used.
- However, to verify accuracy, human visual inspection of the data is necessary.
- Unlike supervised learning where metrics like R-squared can automatically validate, clustering evaluation relies on human assessment, making it challenging.
-
Getting Started with Machine Learning in Python
(From the book “Mastering Information Science”)
- Assumption: Belonging to the same cluster implies having the same label.