How many patients will suffer from diabetes?
Let’s predict people who will suffer with diabetes as per there health records.
Linear Regression can be used only when y is continuous, and it isn’t fit for categorical data. This is where Logistic Regression is used. It is used for categorical data. The regression line fits between 0 and 1. Hence, it is a non-linear regression model. It can be used for both binomial and multinomial data, but, this model is mainly fit for binomial data.
There are two types of logistic regression techniques:
- Ordinal logistic regression
- Multinomial logistic regression.
The function glm () is used for building logistic regression model. Some examples for logistic regression are spam detection, marketing and banking etc.
Logistic regression will work by calculating the likelihood of values by using sigmoid function. It gives the probability of target variables. It always lies between 0 and 1.
ln(p/1-p) = b0 +b1x
This is sigmoid curve:
After building the model, we need to evaluate the model. There are different techniques to evaluate the model like
- Alkaline Information Criteria (AIC)
- Null deviance and Residual deviance
- Confusion matrix
- ROC – AUC
It’s an important indicator of model fit. It follows the rule: Smaller the better. AIC penalizes increasing number of coefficients in the model. In other words, adding more variables to the model wouldn’t let AIC increase. It helps to avoid overfitting.
Null Deviance and Residual Deviance:
Null deviance is calculated from the model with no features, i.e., only intercept. The null model predicts class via a constant probability.
Residual deviance is calculated from the model having all the features. On comparison with Linear Regression, think of residual deviance as residual sum of square (RSS) and null deviance as total sum of squares (TSS).
The larger the difference between null and residual deviance, better the model.
Confusion matrix is the most crucial metric commonly used to evaluate classification models. It is used to calculate true positive rate (Sensitivity) and true negative rate (Specificity).
ROC determines the accuracy of a classification model at a user-defined threshold value. It determines the model’s accuracy using Area Under Curve (AUC). The Area Under the Curve (AUC), also referred to as index of accuracy (A) or concordant index, represents the performance of the ROC curve. Higher the area, better the model. ROC is plotted between True Positive Rate (Y axis) and False Positive Rate (X Axis).
In this plot, our aim is to push the red curve (shown below) toward 1 (left corner) and maximize the area under curve. Higher the curve, better the model. The yellow line represents the ROC curve at 0.5 threshold. At this point, sensitivity = specificity.
Now, let’s go for a practical demo using diabetes data set, to predict whether a patient will get diabetes or not. In this article, I’m mainly concentrating on logistic regression model and ROC- AUC curve. The rest of the things are already explained in the previous article (what is linear regression).
Checking null value in data set
Above function is used to check the
We can see that the age column has some NA values and, we can see visually that, missmap () function in Amelia library. Now, impute null values (imputing is the technique to fill NA values, there are different techniques and functions for imputing)
Here, I am doing the mean of all the values of age and replacing it with NA.
glm () is the function used for logistic regression
Here, I am building model with significance variables and predicting with test data, logistic regression it will give probability, so I am taking 0.5 as threshold, so probability above 0.5 as 1 otherwise 0
ROC -AUC Curve:
ROC stands for Receiver Operating Characteristic and AUC stands for Area Under Curve
More AUC give best Performance of the model
Get data set and source code from below link