k-means implementation in python

Question

This is the first mini-project that I'm working on python, where I implement k-means. I'm planning to parallelize it as soon as I've written a good serial version.

Code description:

Below you will find the working code. Following, I describe it, you can skip if you want:

generate_clusters: generates k random clusters with n points in d dimensions. Each point is assigned to one cluster with at distance at most deviation
CLUMPY.__init__: the user has to specify the number of clusters (i.e., k) and optionally give a file of the dataset to clusterize (elements are divided by delimiter). Otherwise, it must specify the number of points n and their dimensions k: in this case, generate_clusters is called. iterations is the limit for the number of iterations in the algorithm. The constructor initializes the most important arrays, i.e. points, centroids, class, clusters_size and clusters_sum (read the comments to know what they mean).

Design choices:

I could have used other data structures, but I decided to use numpy.arrays only to obtain the best efficiency.

Seeds are generated randomly among points. I'm planning to implement the k-means++ initialization which has been shown to give better results.

Stopping conditions are:

A number of iterations equal to iterations are performed.
The centroids do not change between 2 iterations.
energy goes below a given threshold (todo).

Code:

import numpy as np
import random
import math
import matplotlib.pyplot as plt
import matplotlib.cm as cm

from sklearn.decomposition import PCA as sklearnPCA


def generate_clusters (k, n, d, max_value=30000, deviation=10000):
    # generate n random points in d dimensions with elements in [-deviation, deviation]
    points = np.random.np.random.uniform(low=-deviation, high=deviation, size=(n, d))
    # generate k points in d dimensions with elements in [0, max_value]
    centers = np.random.random((k, d))*max_value
    # generate clusters: for each point, randomly select a center and add to it
    print(points, centers)
    for i, point in enumerate(points):
        points[i] += random.choice(centers)
    return points


class CLUMPY:

    def __init__(self, k, file=None, n=None, d=None, delimiter=None, iterations=100):
        self.__file = file                      # input file
        self.__k = k                            # number of clusters
        self.__iterations = iterations          # number of iterations
        self.__colors = \
            cm.rainbow(np.linspace(0, 1, self.__k))     # colors[i] = color of the i-th cluster
        if file:
            # if file is specified, read points from file
            print("Reading {}...".format(file))
            self.__points = np.loadtxt(file, delimiter=delimiter)        # data points
        else:
            # otherwise generate n clusterized points in d dimensions
            if not n or not d:
                raise ValueError("missing n={} or d={}".format(n, d))
            self.__n = n
            self.__d = d
            print("Generating {} random points in {} dimensions...".format(n, d))
            self.__points = generate_clusters(k, n, d)
        self.__d = self.__points.shape[1]       # points dimensions
        self.__n = self.__points.shape[0]       # number of data points
        self.__centroids = \
            np.empty((k, self.__d))             # centroids[i] = i-th centroid vector
        # class[i] = j : the i-th data point is assigned to the j-th cluster
        self.__class = np.full(self.__n, -1, dtype=np.int16)
        # energy[i] : energy of the i-th cluster
        self.__energy = np.zeros(k)
        self.__distances = np.zeros(self.__n)
        self.__clusters_size = np.zeros(self.__k, dtype=np.int32) # number of points assigned to each cluster
        self.__clusters_sum = np.zeros((k, self.__d)) # sum of all vectors assigned to each cluster
        # sanity checks
        if self.__n < k:
            raise ValueError("Number of clusters k={} is smaller than number of data points n={}".format(k, self.__n))
        if self.__d < 2:
            raise ValueError("data points must have at least two dimensions")
        print("{} points in {} dimensions.".format(self.__n, self.__d))
        print("Generating seeds...")
        # generate k random indexes
        random_indexes = list(range(self.__n))
        random.shuffle(random_indexes)
        # we decide centroids by randomly picking up data points
        for i in range(k):
            self.__centroids[i] = self.__points[random_indexes[i]]
        self.plot()

    def assign_datapoints(self):
        # for each datapoint
        for i, point in enumerate(self.__points):
            min_distance_index = float('nan')
            min_distance = math.inf
            # for each centroid
            for j, centroid in enumerate(self.__centroids):
                # compute the euclidean distance between the i-th point and the j-th centroid
                d = np.linalg.norm(point - centroid)
                if d < min_distance:
                    min_distance_index = j
                    min_distance = d
            # update cluster assignment and distances
            if not math.isnan(min_distance_index):
                if self.__class[i] != -1:
                    self.__clusters_size[self.__class[i]] -= 1
                    self.__clusters_sum[self.__class[i]] -= point
                    self.__energy[self.__class[i]] -= self.__distances[i]
                self.__class[i] = min_distance_index
                self.__clusters_size[min_distance_index] += 1
                self.__clusters_sum[min_distance_index] += point
                self.__distances[i] = min_distance ** 2
                self.__energy[min_distance_index] += self.__distances[i]

    def plot(self):
        print("Plotting...")
        points = np.concatenate((self.__centroids, self.__points), axis=0)
        if self.__d > 2:
            point_norm = (points - points.min())/(points.max() - points.min())
            pca = sklearnPCA(n_components=2)  # 2-dimensional PCA
            points = np.array(pca.fit_transform(point_norm))
        for i, (X, Y) in enumerate(points):
            if i<self.__k:
                plt.scatter(X, Y, c=self.__colors[i], s=100, marker="^")
            else:
                plt.scatter(X, Y, c=self.__colors[self.__class[i-self.__k]])
        plt.show()

    def cluster(self):
        for iteration in range(self.__iterations):
            print("iteration", iteration)
            # update each assignment: if no point changed its cluster, then we have reached the optimum
            self.assign_datapoints()
            # update centroids
            centroids_unchanged = True
            for i in range(self.__k):
                new_centroid = self.__clusters_sum[i] / self.__clusters_size[i]
                centroids_unchanged = centroids_unchanged and np.array_equal(new_centroid, self.__centroids[i])
                self.__centroids[i] = new_centroid
            if centroids_unchanged:
                print("Centroids unchanged, terminating...")
                break
            #self.plot()
        else:
            print("All iterations are finished")
        self.plot()


if __name__ == "__main__":
    clumpy = CLUMPY(k=5, d=64, n=30000, delimiter=",")
    clumpy.cluster()

Example result:

Used the main version.

IEatBagels · Accepted Answer · 2019-08-06 01:07:21Z

First, while Clumpy is a nice name, it doesn't really say what the class is doing. As a matter of fact, by looking at the code in the class, it tried to solve three different problems, that, in my opinion, shouldn't be grouped in a class. Creating points, plotting and clustering are three very different responsibilities, that's why you usually use three distinct libraries to do it (eg. matplotlib, numpy and sklearn). I think it should be three separate functions in your code, it'll permit much more flexibility.

In the generate_clusters method, I think you could use a little more documentation. I think that the names deviation and max_value are confusing and should be renamed. Since you're creating clusters, I think deviation should be maxDistance or radius or something like that. Regarding max_value, it's not clear that it relates to the maximal distance of the centers, but I have a hard time finding another name right now. Otherwise, I like the creativity used to create the clusters, but I hope you've also tested your clustering code with points that aren't so easily clustered.

Using the __ prefix clutters your code a lot. I'd settle for one, personally, or none at all.

You use wayyyyy to many comments. A comment should be used when the code isn't clear enough by itself, and only if you didn't find a way (or didn't have time for business reasons) to make it clearer. All the comments by variable names should be deleted and.. I think, all the others too. Your code is pretty clear, the comments are only in the way.

You assign a lot of arrays before using them, that would probably cause problems for large n values, which are pretty common in the field of data analysis. You should consider trying to reduce your memory footprint by using large arrays only when you need them instead of assigning them at first.

I've said it in the review, but the main thing that needs work is to separate the logic in functions. There're no good reasons to have the file reading, the plotting and the clustering in an object! I'm sure once you've done the work to split the code you'll find it much clearer.

I've seen at least one library give clustering, file reading, and plotting to one object. It does allow for some very quick-to-start "beginner friendly" prototyping (maybe that is the intended audience?). I would say the really big concern here is no distinction between plotting and data IMHO. — Dair
– Dair, Commented Aug 6, 2019 at 3:47

Stack Exchange Network

k-means implementation in python

Code description:

Design choices:

Code:

Example result:

1 Answer 1

You must log in to answer this question.

Hot Network Questions

k-means implementation in python

Code description:

Design choices:

Code:

Example result:

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions