Figure 3: Proposed Solution
3.2 Data
Set
For our proposed work we selected lung cancer from the UC Irvine Machine
Learning Repository. This data set contains patients’ diagnosis reports
based on clinical evidence.
3.3 Tool &
language
In this research work, we are using python language in jupyter notebook.
Python is a high-level programming language. Easy to use and understand
with short coding.
3.4 Data
Pre-Processing
It is a very important step to clean data from missing and duplicated
values. if in data we have missing or duplicated values. So, we are
performing imputation methods to deal with it.
3.5 Label
Encoding
Perfect step for categorical values to numerical. Because the machine
learning algorithms cannot process categorical data. So, we using label
encoding to convert the label data set to some import numerical
features.
3.6 Data
Visualization
It is a graphical presentation of data by using plots, charts, bars, and
histograms. It is important to understand data and make a better
decision from it.
3.7 Data Splitting
Data splitting is using for dividing data set in form of training and
testing data set. It means dependent class and independent classes.
3.8 Feature
Scaling
In feature engineering, we are normalizing the data structure for our
proposed that comfortable for the machine learning algorithm.
3.9 Supervised
Algorithms
it is a classification problem so we are using a machine learning
supervised model for classification and prediction. It is using for
categorical data set. in this research work we are using two types of
advance supervised model.
3.9.1 Random Forest
Classifier
A random forest algorithm is a combination of the decision tree. It is
very fast as compare to another classifier.
3.9.2 XGBoost Classifier
XGBoost model. the queen of machine learning. It’s one of the most
popular algorithms in machine learning, and it’s been popular recently,
it is very fast, especially when you’re dealing with large data sets.
XGBoost’s accuracy, speed, and scale have become mature models for data
science competitions. The XGBoost algorithm has become the ultimate
weapon for many data scientists. This is a very classic algorithm,
powerful enough to handle a variety of irregular data. Therefore, it is
important to get good accuracy and speed of execution.
3.9.3 Advantages of Random Forest and XGboost
Model
Random forest and XGBoost both are powerful model of supervised
learning. Both are using for classification problem. Random forest is
100 time faster than other algorithms and best working for
classification problem. XGBoost is the queen of machine learning because
it is 100 time faster than Random forest and best for bid data analysis.
Both are simple and faster based on execution time.
3.9.4 Model Optimization
It provides an optimization algorithm that is an adaptive Classification
algorithm, which takes into account the unequal voting strategy,
compared to the original classification algorithm of each probability
based on weight distribution rights performance. In this research, we
will be trying to estimate the best hyper-parameters for classification
problems to improve model speed and performance. The below kernel
function of the classifier will be optimized.
Classifier = XGBClassifier(objective =’Classifier’, colsample_bytree
= 0.3, learning_rate = 0.05, max_depth = 10, alpha = 1,
n_estimators = 1000)
3.10 Classification and
prediction
After applying classification, we generating classification results of
our data and predicting to generate new data based on the existing data
set.
3.11 Performance
Measure
A very important step to identify the performance of our proposed model
and calculate accuracy. That how much our model and how much our
proposed work results are accurate. In this step we using confusion
matric for calculating the performance of our model. these performances
identify by the precision, recall, and F1 score of models.
3.12 Evaluate and
Validate
To evaluate and verify all results. Finding all error and correct in our
proposed solution.
3.13
Comparison
In finally we comparing our results, proposed solution, model
performance, and accuracy score with the existing research study.