Preprocessing:
- Category variables cabin_flown and type_traveller were converted using get_dummies() method of pandas library which basically does one hot encoding of each category.
- NaN column created as a result of one hot encoding of cabin_flown was renamed to avoid conflict with type_traveller
- Missing values were set to -99999 so that the algorithm can identify them as missing values
Strategy for selection:
Crossvalidation:In order to select the best model train_test_split method from sklearn.cross_validation was used to split the training data into training and test data where 70% of data was used for testing while 30% of the data was used for testing.
Metrics:
Metrics such as
Accuracy score ,
Cohen’s kappa and roc_auc_score were calculated using all the models namely naive Bayes, decision tree classification for multiple crosses and KNeighborsClassifier was found to have the maximum accuracy and therefore was selected.
For selecting the ideal value of k cross validation was again used and a plot was made between accuracy and value of k and 14 was found to have the maximum accuracy during multiple cross validations.
Boosting and scaling of data was tried but did not yield and improvement in accuracy.
Techniques for Increasing accuracy:
- During the testing phase in-order to increase the accuracy wifi_connectivity_rating and ground_service_rating were dropped from the parameters being used for classification of data because they had many missing values and consequently were having a negative influence on accuracy.
- Initially all the missing values were set to 0 ,the next strategy to find missing values used was imputing (using imputer from sklearn )the missing values based on mean and median of the column. However, for knn this decreased the performance of the algorithm because voting from nearest neighbors got skewed because of same data in missing columns.
- Finally, all the missing values were set to -99999 so that the algorithm can easily identify them as missing values.This resulted in improved performance.
- cabin_flown was then used for classification, as this field had class variables rather than numerical attributes , One hot encoding using pandas.getdummies() method was used.As a result, a vectorized version of the attribute was created which was used for classification.The vector had a field called NaN which was set to 1 for all data points for which cabin_flown was missing .
- Similar to cabin_flown , type_traveller was added to improve the performance of the algorithm.
- Scaling and normalization were also tried but they had a negative effect on performance and hence were not used in the final prediction