Preprocessing :
- When the data is loaded into IDE, the attributes has to be changed to relevant data types such as numeric or nominal, based on attribute information.
- Data standardization is done.
- Class imbalance is handled using SMOTE(Synthetic minority oversampling technique). It uses k- nearest neighbors algorithm and calculate the similarity between records, high similarity records of minority class are over-sampled.
- Omitted three attributes (weight, payer_code, medical specialty) which have missing values more than 50 % ,remaining attributes have less than 3% of missing values. Instead of imputing missing values(which may not be perfect), records (with missing values) are deleted(as data is huge).
- Unimportant features such as patient number, encounter ID, examide are dropped(features which have unique records / no variance).
- Feature engineering is done, number of lab procedures (number of lab tests performed during the encounter)and number of procedures (number of procedures performed other than lab tests) are combined which gave total number of procedures.
Approach :
There are three approaches for this data.
Approach – 1 :
Handling the class imbalance using SMOTE, which over samples the minority class.
Logistic Regression:
Logistic regression is a statistical method for analyzing a data set in which there are one or more independent variables that determine an outcome.