失眠网 > python机器学习手写算法系列——kmeans聚类

python机器学习手写算法系列——kmeans聚类

时间：2020-08-24 13:40:39

从机器学习到kmeans

聚类是一种非监督学习，他和监督学习里的分类有相似之处，两者都是把样本分布到不同的组里去。区别在于，分类分析是有标签的，聚类是没有标签的。或者说，分类是有y的，聚类是没有y的，只有X。所以，聚类只能根据X的特征本身，把样本分布到不同的组。

比如，我们有个成语，叫物以类聚，人以群分。我们可以把人分成男人和女人，这里分组的根据是人本身的属性-性别。而性别是知道的，而不需要用一个公式求得。

问题

这里，我们用的数据集是sklearn自带的数字数据集。

这些图片都是8 * 8 = 64 个点组成，每个点的数值从0到15。我们用PCA降维并归一化（Normalization）以后，得到以下数据集：

其实，这部分的处理和sklearn的示例代码是一致的。我会替换掉sklearn的kmeans算法，用自己的kmeans算法做聚类。

kmeans算法原理

kmeans中文叫k均值，不过我们平时交流都叫他kmeans。我个人反对把写着算法翻译成中文，因为这样只是增加了我们的负担。这里的k，是指要把数据集分成k组。means是指同一个组group（或者叫簇cluster）里，所有的样本求平均值，得到他们的centroid（中心）。

这个算法是通过以下两个步骤不断的交替，来实现聚类的：

用求平均值的方法，求每个组的centroid根据centroids，计算样本到centroids的距离，判断这个样本属于哪个组。

手写算法

初始化centroids。首先，随机初始化k个centroids。

def init_centroids(k, n_features):return np.random.random(k * n_features).reshape((k, n_features))

接着用求平均值的方法，求每个组的centroid

def update_centroids(points, centroid_index):k = max(centroid_index)+1new_centroids = np.zeros((10,2))for i in range(k):new_centroids[i]=points[centroid_index==i].mean(axis=0)return new_centroids

再根据centroids，计算样本到centroids的距离，判断这个样本属于哪个组。

def distance(pointA, pointB):return np.sqrt((pointA[0]-pointB[0])**2+(pointA[1]-pointB[1])**2)def belongs2(point, centroids):index = 0min_distance = np.inffor i in range(len(centroids)):d = distance(point, centroids[i])if d<min_distance:min_distance=dindex=ireturn indexdef update_index(points, centroids):n_samples = len(points)new_indeces = np.zeros((n_samples))for i, point in enumerate(points):new_indeces[i] = belongs2(point, centroids)new_indeces = new_indeces.astype(int)return new_indeces

最后，把他们组装成kmeans算法。这里，输入是所有样本，既X，不过我这里叫points了。

def my_kmeans(points):centroids = init_centroids(10, 2)indeces=update_index(points, centroids)old_indeces = indecesfor i in range(1000):centroids=update_centroids(points, indeces)indeces=update_index(points, centroids)if np.array_equal(indeces, old_indeces):print('converge', i)breakelse:old_indeces=indecesreturn centroids, indeces

运用kmeans算法

将降维以后的数据，带入kmeans算法，得到kmeans的centroids以及indeces，既每个点属于哪个centroid。

centroids, indeces=my_kmeans(points)centroids

把他们聚类的结果打印出来如下：

用sklearn里面的kmeans算法

sklearn里的kmeans位于sklearn.cluster下面。使用的时候，只需要调用这个算法即可。

from sklearn.cluster import KMeansn_digits = 10kmeans = KMeans(init='k-means++', n_clusters=n_digits, n_init=10)kmeans.fit(points)centroids = kmeans.cluster_centers_

结果也打印如下：

和我们自己的算法对比以后，结果是一致的。