Figure 3: Proposed Solution

3.2 Data Set

For our proposed work we selected lung cancer from the UC Irvine Machine Learning Repository. This data set contains patients’ diagnosis reports based on clinical evidence.

3.3 Tool & language

In this research work, we are using python language in jupyter notebook. Python is a high-level programming language. Easy to use and understand with short coding.

3.4 Data Pre-Processing

It is a very important step to clean data from missing and duplicated values. if in data we have missing or duplicated values. So, we are performing imputation methods to deal with it.

3.5 Label Encoding

Perfect step for categorical values to numerical. Because the machine learning algorithms cannot process categorical data. So, we using label encoding to convert the label data set to some import numerical features.

3.6 Data Visualization

It is a graphical presentation of data by using plots, charts, bars, and histograms. It is important to understand data and make a better decision from it.

3.7 Data Splitting

Data splitting is using for dividing data set in form of training and testing data set. It means dependent class and independent classes.

3.8 Feature Scaling

In feature engineering, we are normalizing the data structure for our proposed that comfortable for the machine learning algorithm.

3.9 Supervised Algorithms

it is a classification problem so we are using a machine learning supervised model for classification and prediction. It is using for categorical data set. in this research work we are using two types of advance supervised model.

3.9.1 Random Forest Classifier

A random forest algorithm is a combination of the decision tree. It is very fast as compare to another classifier.

3.9.2 XGBoost Classifier

XGBoost model. the queen of machine learning. It’s one of the most popular algorithms in machine learning, and it’s been popular recently, it is very fast, especially when you’re dealing with large data sets. XGBoost’s accuracy, speed, and scale have become mature models for data science competitions. The XGBoost algorithm has become the ultimate weapon for many data scientists. This is a very classic algorithm, powerful enough to handle a variety of irregular data. Therefore, it is important to get good accuracy and speed of execution.

3.9.3 Advantages of Random Forest and XGboost Model

Random forest and XGBoost both are powerful model of supervised learning. Both are using for classification problem. Random forest is 100 time faster than other algorithms and best working for classification problem. XGBoost is the queen of machine learning because it is 100 time faster than Random forest and best for bid data analysis. Both are simple and faster based on execution time.

3.9.4 Model Optimization

It provides an optimization algorithm that is an adaptive Classification algorithm, which takes into account the unequal voting strategy, compared to the original classification algorithm of each probability based on weight distribution rights performance. In this research, we will be trying to estimate the best hyper-parameters for classification problems to improve model speed and performance. The below kernel function of the classifier will be optimized.
Classifier = XGBClassifier(objective =’Classifier’, colsample_bytree = 0.3, learning_rate = 0.05, max_depth = 10, alpha = 1, n_estimators = 1000)

3.10 Classification and prediction

After applying classification, we generating classification results of our data and predicting to generate new data based on the existing data set.

3.11 Performance Measure

A very important step to identify the performance of our proposed model and calculate accuracy. That how much our model and how much our proposed work results are accurate. In this step we using confusion matric for calculating the performance of our model. these performances identify by the precision, recall, and F1 score of models.

3.12 Evaluate and Validate

To evaluate and verify all results. Finding all error and correct in our proposed solution.

3.13 Comparison

In finally we comparing our results, proposed solution, model performance, and accuracy score with the existing research study.