Logistic Regression — Part III — Titanic Disaster Survival Prediction

In this article we will be researching on the Titanic Dataset with Logistic Regression and Classification Metrics.

Lets see how to do logistic regression with Python — LogisticRegression() from sklearn.

I have taken the Titanic data set from Kaggle. https://www.kaggle.com/c/titanic/data

raw = pd.read_csv('train.csv')
df = raw.copy()
raw_test = pd.read_csv('test.csv')
test = raw_test.copy()
df.head()
Image for post

Here I have skipped the data processing section except encoding. I will come up with a new post specially for Data Pre-Processing.

Here I have skipped the data processing section except encoding. I will come up with a new post specially for Data Pre-Processing.

  • Removed Cabin as it seem to be Large number of Null columns and not much info can be received with a column which has high manipulated values.
  • Encode the Category columns. I am going to do all the encoding for the validation dataset (test.csv) also.
  • Drop original columns and concat the encoded columns.
#1=>
df = df.drop(['Cabin', 'PassengerId', 'Name', 'Ticket', 'Cabin'], axis = 1)
test = test.drop(['Cabin', 'PassengerId', 'Name', 'Ticket', 'Cabin'], axis = 1)

#2=>
df['SibSp']=df['SibSp'].astype('category')
df['Parch']=df['Parch'].astype('category')
test['SibSp']=test['SibSp'].astype('category')
test['Parch']=test['Parch'].astype('category')
encodedDF = pd.get_dummies(df[['Embarked','Sex','SibSp','Parch']])
testencoded = pd.get_dummies(test[['Embarked','Sex','SibSp','Parch']])

#3=>
df = df.drop(['Survived','Embarked','Sex','SibSp','Parch'],axis=1)
test = test.drop(['Embarked','Sex','SibSp','Parch'],axis=1)
X = pd.concat([encodedDF,df],axis=1)
Val = pd.concat([testencoded,test],axis=1)

y= raw['Survived']

X_train.head()
Image for post

Model — Using LogisticRegression:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
X_train, X_test, y_train, y_test = train_test_split(
     X, y, test_size=0.2, random_state=101)
log = LogisticRegression()
log.fit(X_train,y_train)
print("Intercept: " , log.intercept_)
print("Coefficients: " , log.coef_)
predict = log.predict(X_test)
Image for post

Finally we predicted Survived values for test data using predict() method.

Metrics

Packages to import for Error Metrics:

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix,roc_auc_score

1. Confusion Matrix

It is a clear representation of Correct Predictions. All the Correct Predictions fall in the diagonal order.

Image for post

2. Classification Accuracy

This metric measures the ratio of correct predictions over the total number of predictions. For Higher accuracy, the model gives best.

print(accuracy_score(y_test,predict))

Output: 0.8044692737430168

3. ROC Curve & AUC Score

ROC (Receiver Operating Characteristic) curve is a visualization of false positive rate (x-axis) and the true positive rate (y-axis).

predict_proba(…) provides the probability in arrays. pred_prob[:, 1] means we are taking only the positive values.

pred_prob = log.predict_proba(X_test)
probs = pred_prob[:, 1]

auc = roc_auc_score(y_test, probs)
print('AUC Score: %.2f' % auc)


fpr, tpr, thresholds = roc_curve(y_test, probs)
plt.plot(fpr, tpr, color='orange', label='ROC')
plt.plot([0, 1], [0, 1], color='darkblue', linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend()
plt.show()

AUC Score: 0.88

Image for post
ROC Curve

We can see that ROC curve is not bad. With more pre-processing, we can increase the AUC score. To know more about ROC curve, please visit Logistic Regression Part II — Cost Function & Error Metrics.

4. Classification Report

This is a summary of metrics for each class.

Image for post

In the above report, we have Precision, Recall, F1-Score for both 0 & 1 classes separately.

Conclusion:

In this article we have seen how to predict Discrete values using Logistic Regression.

To know more about Logistic Regression:

  1. Logistic Regression Part I — Transformation of Linear to Logistic
  2. Logistic Regression Part II— Cost Function & Error Metrics

Please try Logistic Regression and drop your comments here.

Thank you! 👍

Like to support? Just click the heart icon ❤️.

Happy Programming!🎈

Asha Ponraj
Asha Ponraj

Data science and Machine Learning enthusiast | Software Developer | Blog Writter

Articles: 86

Leave a Reply

Your email address will not be published. Required fields are marked *