Storing a K-means model in R

K-means clustering is quick and dirty and generally provides some interesting results. However, the default kmeans function in R lacks features, such as actually storing the model to use the centroids for prediction purposes on unseen data. That’s where flexclust comes in.

Flexclust is a package that is designed around K-centroid cluster analysis. Its most important function is the acronym kcca().

First, let’s load the packages.

library(flexclust)
library(dummies)

Let’s say you have a data frame (dt) that contains numeric data and factors. You’re gonna want to convert all factors to binaries.

dt <- dummy.data.frame(dt, dummy.classes='factor')

Next, we convert the data frame to a matrix. There are multiple ways to do this, however, to make sure that all variables are treated as equally important, I scale and center the data (and so should you).

mx <- data.matrix(dt)
mx_scaled <- scale(mx)

Finally, I train the model and store it in a kModel variable.

kModel <- kcca(mx_scaled, 5, family = kccaFamily('kmeans'))

Now, we need to scale the new data with the same parameters as the old data. You should know that the scale() function returns a matrix, but it has two attributes that you can use: scaled:center and scaled:scale. You can use these as parameters to scale your new data.

mx2 <- data.matrix(dt2)
mx2_scaled <- scale(mx2, attr(mx_scaled, "scaled:center"), attr(mx_scaled, "scaled:scale"))

Finally, you can use the predict() function to use the centroids from your first data set to cluster your new data.

predict(kModel,mx2_scaled)

By the way, if you’re having trouble understanding some of the code and concepts, I can highly recommend “An Introduction to Statistical Learning: with Applications in R”, which is the must-have data science bible. If you simply need an introduction into R, and less into the Data Science part, I can absolutely recommend this book by Richard Cotton. Hope it helps!

Great succes!

Say thanks, ask questions or give feedback

Technologies get updated, syntax changes and honestly… I make mistakes too. If something is incorrect, incomplete or doesn’t work, let me know in the comments below and help thousands of visitors.

Storing a K-means model in R

Say thanks, ask questions or give feedback

Leave a Reply Cancel reply

Related Posts

Starting a remote Selenium server in R

How to set the package directory in R

Counting, adding or subtracting business days in R