python实现六大分群质量评估指标（兰德系数、互信息、轮廓系数）

1 R语言中的分群质量——轮廓系数

SSE 值太大了。当 k=6 时，SEE 的值会低很多，但此时平均轮廓系数的值非常高，仅仅比 k=2 时的值低一点。因此，k=6

2 python中的分群质量

completeness and V-measure（聚类数量情况）、轮廓系数

>>> from sklearn
import metrics

>>> labels_true = [0,
0, 0, 1, 1, 1]

>>> labels_pred = [0,
0, 1, 1, 2, 2]

>>>

0.24

1.2 Mutual Information based scores 互信息

Two different normalized versions of this measure are available,
Normalized Mutual Information(NMI) and Adjusted Mutual
Information(AMI). NMI is often used in the literature while AMI was
proposed more recently and is normalized against chance:

>>> from sklearn
import metrics

>>> labels_true = [0,
0, 0, 1, 1, 1]

>>> labels_pred = [0,
0, 1, 1, 2, 2]

>>>

0.22504

1.3 Homogeneity, completeness and V-measure

>>> from sklearn
import metrics

>>> labels_true = [0,
0, 0, 1, 1, 1]

>>> labels_pred = [0,
0, 1, 1, 2, 2]

>>>
metrics.homogeneity_score(labels_true, labels_pred)

0.66...

>>>
metrics.completeness_score(labels_true, labels_pred)

0.42...

>>>
metrics.v_measure_score(labels_true,
labels_pred)

0.51...

1.4 Fowlkes-Mallows scores

The Fowlkes-Mallows score FMI is defined as the geometric mean of
the pairwise precision and recall:

>>> from sklearn
import metrics

>>> labels_true = [0,
0, 0, 1, 1, 1]

>>> labels_pred = [0,
0, 1, 1, 2, 2]

>>>

>>>
metrics.fowlkes_mallows_score(labels_true, labels_pred)

0.47140...

1.5 Silhouette Coefficient 轮廓系数

>>> import numpy as
np

>>> from
sklearn.cluster import KMeans

>>> kmeans_model =
KMeans(n_clusters=3, random_state=1).fit(X)

>>> labels =
kmeans_model.labels_

>>>
metrics.silhouette_score(X, labels, metric='euclidean')

...

0.55...

1.6 Calinski-Harabaz Index

也就是说，类别内部数据的协方差越小越好，类别之间的协方差越大越好，这样的Calinski-Harabasz分数会高。

在scikit-learn中， Calinski-Harabasz
Index对应的方法是metrics.calinski_harabaz_score.

>>> import numpy as
np

>>> from
sklearn.cluster import KMeans

>>> kmeans_model =
KMeans(n_clusters=3, random_state=1).fit(X)

>>> labels =
kmeans_model.labels_

>>>
metrics.calinski_harabaz_score(X, labels)

560.39...