Best Practice

Prime hints for running a data project in R

I’ve been asked more and more for hints and best practices when working with R. It can be a daunting task, depending on how deep or specialised you want to be. So I tried to keep it as balanced as I could and mentioned point that definitely helped me in the last couple of years. Finally, there’s lots (and I mean, LOTS) of good advice out there that you should definitely check out - see some examples in the Quick Reference section below.

Cluster Validation In Unsupervised Machine Learning

In the previous post I showed several methods that can be used to determine the optimal number of clusters in your data - this often needs to be defined for the actual clustering algorithm to run. Once it’s run, however, there’s no guarantee that those clusters are stable and reliable. In this post I’ll show a couple of tests for cluster validation that can be easily run in R.

Determining the optimal number of clusters in your dataset

Recently, I worked a bit with cluster analysis: the common method in unsupervised learning that uses datasets without labeled responses to draw inferences. I wanted to put my notes together and write it all down before I forget it, thus the blog post. For the start, I’ll tackle multiple approaches to how to determine the number of clusters in your data. QUICK INTRO Clustering algorithms aim to establish a structure of your data and assign a cluster/segment to each datapoint based on the input data.