Note of data science training EP 10: Cluster – collecting and clustering
prev: Note of data science training EP 9: NetworkX – Map of Marauder in real world
One of the classic problem for data scientists is clustering or grouping. For example, we have to find the lifestyle of 100 customers in each terms e.g. bookworms, sport guys, and shoppers. How can we do?
For that problem, this is introduced, the module
This time, we have a dataset named
make_blobs from sci-kit learn dataset.
Try a simple scatter graph and there are 3 groups actually, aren’t they?
DBSCAN stands for “Density-Based Spatial Clustering of Applications with Noise”. It works like these.
- Give x as a distance.
- Pick y dots and find the core point among those dots.
- Find other dots within x radius from the core point of y dots. If any, create a group then update the core point of the group.
- Finished when all dots has its own group.
Now we start from creating a DBSCAN object with 2 parameters:
eps(epsilon) as the distance x
min_samplesas the minimum dots or the number y
After that, we use
.fit_predict() and the result is in
Here we use
pd.unique() to check all groups in the model.
min_samples and we can distinguish the result.
Reference link: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html
K-means is the popular one as it is easy to use. This requires a number of group and it’s done.
Firstly, we want 3 groups and we have 3 groups now.
.cluster_centers_ to find the center of each group.
Let’s try to find 5 groups.
Reference link: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
The last one is Optics standing for “Ordering Points To Identify the Clustering Structure”. This is similar to DBSCAN but not requires epsilon. It is suit for large datasets and trade-off for long run time.
Reference link: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.OPTICS.html
Now it’s assessment time. There are 3 main scores for the clustering models.
- Silhouette score
Determines distances within a cluster and between clusters. Best at 1 and worst at -1.
- Davies-Bouldin score
Calculates dispersion of each cluster and distance between clusters. Best at 0 and the higher is the worse.
- Calinski-Harabasz Score
Find a ratio between dispersion in each cluster and between-cluster. The higher is the better.
from sklearn import metrics # Silhouette score metrics.silhouette_score(dataframe, clustering.labels_) # Davies-Bouldin score metrics.davies_bouldin_score(dataframe, clustering.labels_) # Calinski-Harabasz Score metrics.calinski_harabasz_score(dataframe, clustering.labels_)
Hope this is useful as the grouping problems are much popular in many industries.
Let’s see what’s next.
next: Note of data science training EP 11: NLP & Spacy – Languages are borderless