prev: Note of data science training EP 9: NetworkX – Map of Marauder in real world

One of the classic problem for data scientists is clustering or grouping. For example, we have to find the lifestyle of 100 customers in each terms e.g. bookworms, sport guys, and shoppers. How can we do?

# Clustering

For that problem, this is introduced, the module `sklearn.cluster`

.

# Preparing

This time, we have a dataset named `make_blobs`

from sci-kit learn dataset.

Try a simple scatter graph and there are 3 groups actually, aren’t they?

# DBSCAN

DBSCAN stands for “** D**ensity-

**ased**

**B****patial**

**S****lustering of**

**C****pplications with**

**A****oise”. It works like these.**

**N**- Give
as a distance.*x* - Pick
dots and find the core point among those dots.*y* - Find other dots within
radius from the core point of*x*dots. If any, create a group then update the core point of the group.*y* - Finished when all dots has its own group.

Now we start from creating a DBSCAN object with 2 parameters:

`eps`

(epsilon) as the distance*x*`min_samples`

as the minimum dots or the number*y*

After that, we use `.fit_predict()`

and the result is in `.labels_`

.

Here we use `pd.unique()`

to check all groups in the model.

Change `eps`

and `min_samples`

and we can distinguish the result.

Reference link: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

# K-means

K-means is the popular one as it is easy to use. This requires a number of group and it’s done.

Firstly, we want 3 groups and we have 3 groups now.

Use `.cluster_centers_`

to find the center of each group.

Let’s try to find 5 groups.

Interesting.

Reference link: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

# OPTICS

The last one is Optics standing for “** O**rdering

**oints**

**P****o**

**T****dentify the**

**I****lustering**

**C****tructure”. This is similar to DBSCAN but not requires epsilon. It is suit for large datasets and trade-off for long run time.**

**S**Try change `min_samples`

.

Reference link: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.OPTICS.html

# Metrics measurement

Now it’s assessment time. There are 3 main scores for the clustering models.

**Silhouette score**

Determines distances within a cluster and between clusters. Best at 1 and worst at -1.**Davies-Bouldin score**

Calculates dispersion of each cluster and distance between clusters. Best at 0 and the higher is the worse.**Calinski-Harabasz Score**

Find a ratio between dispersion in each cluster and between-cluster. The higher is the better.

```
from sklearn import metrics
# Silhouette score
metrics.silhouette_score(dataframe, clustering.labels_)
# Davies-Bouldin score
metrics.davies_bouldin_score(dataframe, clustering.labels_)
# Calinski-Harabasz Score
metrics.calinski_harabasz_score(dataframe, clustering.labels_)
```

Hope this is useful as the grouping problems are much popular in many industries.

Let’s see what’s next.

See ya~

next: Note of data science training EP 11: NLP & Spacy – Languages are borderless