Statistical Learning Report



Data overview



Grey scale








Logistic regression



k-nearest neighbors (KNN):
K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions). KNN can be used for both classification and regression predictive problems. However, it is more widely used in classification problems in the industry.

kNN Algorithm – Pros and Cons
Pros: The algorithm is highly unbiased in nature and makes no prior assumption of the underlying data. Being simple and effective in nature, it is easy to implement and has gained good popularity.

Cons: Indeed it is simple but kNN algorithm has drawn a lot of flake for being extremely simple! If we take a deeper look, this does not create a model since there’s no abstraction process involved. Yes, the training process is really fast as the data is stored verbatim (hence lazy learner) but the prediction time is pretty high with useful insights missing at times. Therefore, building this algorithm requires time to be invested in data preparation (especially treating the missing data and categorical features) to obtain a robust model.

Let’s see the process of building this model using kNN algorithm in R Programming. Below you’ll observe I’ve explained every line of code written to accomplish this task.

Step 1- Preparing and exploring the data:

Normalizing numeric data:

This feature is of paramount importance since the scale used for the values for each variable might be different. The best practice is to normalize the data and transform all the values to a common scale.

normalize <- function(x) {return ((x - min(x)) / (max(x) - min(x))) }

Once we run this code, we are required to normalize the numeric features in the data set.

knn_n <-[,1:100], normalize))

Step 2- Data collection


[1] 4218 100


knn_n=knn_n[,(cols):=lapply(.SD, as.double),.SDcols=cols]

dt_knn_n <- (cbind(labels,knn_n))


[1] 4218 101

Creating training and test data set:

The kNN algorithm is applied to the training data set and the results are verified on the test data set. For this, we would divide the data set into 2 portions in the ratio of 90:10 for the training and test data set respectively.

sub <- sample(nrow(dt_knn_n), floor(nrow(dt_knn_n) * 0.9))

train_knn <- dt_knn_n[sub, ]

test_knn <- dt_knn_n[-sub, ]


[1] 3796 101


[1] 422 101

knn_train_labels <- train_knn[, labels]

knn_test_labels <- test_knn[, labels]


[1] 3796

Step 3– Training a model on data and building the prediction model:

The knn () function needs to be used to train a model for which we need to install a package ‘class’. The knn() function identifies the k-nearest neighbors using Euclidean distance where k is a user-specified number.

Let’s see if we can get a better accuracy by changing the value of k. We can use a for loop to see how the algorithm performs for different values of k.



accuracy <- rep(0, 20)

k <- 1:20

for(x in k){

  • prediction <- knn(train = train_knn[,features,with=F], test = test_knn[,features,with=F], cl = knn_train_labels, k=x)
  • accuracy[x] <- mean(prediction == knn_test_labels)}

plot(k, accuracy, pch = 16, col = 'royalblue2', cex = 1.5, main= 'Accuarcy Vs. K', type = 'b'); box(lwd = 2);

Replace this text with your caption