Statistical Learning Report

Introduction

Methods

Data overview

Preprocessing

Cropping/resizing

Grey scale

Vectorisation

Normalisation

PCA

Modelling

KNN

SVM

RF

Logistic regression

Testing

Cross-validation

k-nearest neighbors (KNN): K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions). KNN can be used for both classification and regression predictive problems. However, it is more widely used in classification problems in the industry.

kNN Algorithm – Pros and Cons Pros: The algorithm is highly unbiased in nature and makes no prior assumption of the underlying data. Being simple and effective in nature, it is easy to implement and has gained good popularity.

Cons: Indeed it is simple but kNN algorithm has drawn a lot of flake for being extremely simple! If we take a deeper look, this does not create a model since there’s no abstraction process involved. Yes, the training process is really fast as the data is stored verbatim (hence lazy learner) but the prediction time is pretty high with useful insights missing at times. Therefore, building this algorithm requires time to be invested in data preparation (especially treating the missing data and categorical features) to obtain a robust model.

Let’s see the process of building this model using kNN algorithm in R Programming. Below you’ll observe I’ve explained every line of code written to accomplish this task.

Step 1- Preparing and exploring the data: load("imgveclab.Rdata")

Normalizing numeric data:

This feature is of paramount importance since the scale used for the values for each variable might be different. The best practice is to normalize the data and transform all the values to a common scale.

load("imgveclab.Rdata") normalize <- function(x) {return ((x - min(x)) / (max(x) - min(x))) }

Once we run this code, we are required to normalize the numeric features in the data set.

knn_n <- as.data.frame(lapply(dt[,1:100], normalize))

Step 2- Data collection

knnn=as.data.table(knnn)

dim(knn_n)

[1] 4218 100

cols=names(knn_n)

knnn=knnn[,(cols):=lapply(.SD, as.double),.SDcols=cols]

dtknnn <- (cbind(labels,knn_n))

dim(dtknnn)

[1] 4218 101

Creating training and test data set:

The kNN algorithm is applied to the training data set and the results are verified on the test data set. For this, we would divide the data set into 2 portions in the ratio of 90:10 for the training and test data set respectively.

sub <- sample(nrow(dtknnn), floor(nrow(dtknnn) * 0.9))

trainknn <- dtknn_n[sub, ]

testknn <- dtknn_n[-sub, ]

dim(train_knn)

[1] 3796 101

dim(test_knn)

[1] 422 101

knntrainlabels <- train_knn[, labels]

knntestlabels <- test_knn[, labels]

length(knntrainlabels)

[1] 3796

Step 3– Training a model on data and building the prediction model:

The knn () function needs to be used to train a model for which we need to install a package ‘class’. The knn() function identifies the k-nearest neighbors using Euclidean distance where k is a user-specified number.

Let’s see if we can get a better accuracy by changing the value of k. We can use a for loop to see how the algorithm performs for different values of k.

library(class)

features=cols[grepl("V",cols)]

accuracy <- rep(0, 20)

k <- 1:20

for(x in k){

  • prediction <- knn(train = trainknn[,features,with=F], test = testknn[,features,with=F], cl = knntrainlabels, k=x)
  • accuracy[x] <- mean(prediction == knntestlabels)}

plot(k, accuracy, pch = 16, col = 'royalblue2', cex = 1.5, main= 'Accuarcy Vs. K', type = 'b'); box(lwd = 2);

Replace this text with your caption

Step 4– Evaluate the model performance:

We have built the model but we also need to check the accuracy of the predicted values in knntestpred as to whether they match up with the known values in knntestlabels. To ensure this, we need to use the CrossTable() function available in the package ‘gmodels’.

library(gmodels)

knntestpred <- knn(train = trainknn[,features,with=F], test = testknn[,features,with=F], cl = knntrainlabels, k=5)

CrossTable(x= knntestlabels, y= knntestpred, prop.chisq=FALSE)

Cell Contents |-------------------------| | N | | N / Row Total | | N / Col Total | | N / Table Total | |-------------------------|

Total Observations in Table: 422

            | knn_test_pred 
knntestlabels A B C Point V Row Total
A 99 6 8 17 4 134
0.739 0.045 0.060 0.127 0.030 0.318
0.707 0.158 0.148 0.110 0.114
0.235 0.014 0.019 0.040 0.009
---------------- ----------- ----------- ----------- ----------- ----------- -----------