Two weeks ago I wrote a post, Using R and Random Forest to predict Diabetes Risk. Since I am less experienced with using python in machine learning models, and this was a data set that worked out so nicely, I figured I would take an attempt at it.

First we need to load all the modules and functions we need to use.

import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
import seaborn as sns 
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

The next thing is just like was done in R, load the data, clean it up a bit for using scikit-learn to create a classification model, and then split our factors from our classification variable.

diabetesdf = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00529/diabetes_data_upload.csv")
gender_state = {'Male': 1, 'Female': 0}
yes_no_map = {'Yes': 1, 'No': 0}
diabetesdf['Gender'] = diabetesdf['Gender'].map(gender_state)
for i in range(2,16):
 diabetesdf.iloc[:,i] = diabetesdf.iloc[:,i].map(yes_no_map)
Y = diabetesdf['class'].values
X = diabetesdf.drop(labels = ['class'], axis =1)

As we did in R, we need to do a train/test split. In this case, I will set it to be the same split as what I did in R, so that we can compare the accuracy.

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state = 1)

Finally, it’s time to make the model and quantify the accuracy. I will set n_estimators = 500, as that matches the default ntrees of randomForest in R.

model = RandomForestClassifier(n_estimators = 500, random_state = 2)
model.fit(X_train, Y_train)
prediction_test = model.predict(X_test)
print("Accuracy = ", metrics.accuracy_score(Y_test, prediction_test))

Accuracy = 0.9807692307692307 We see here an accuracy of 98.1%. This is a bit better than what we got doing R (97.1%), but not very much different. In fact, when I run the model in R repeatedly I frequently get results of 96.1%, 97.1% and 98.1%, or given the train/test split, this means between 100–102 of the 104 in the test set are accurately predicted. So it looks like we are getting comparable accuracy.

Now, last time I visualized an overall tree, representing the total random forest. This time, let’s do a different sort of visualization. I took a function I found on analyseup on how to make a bar plot to show a bar plot of how important the features are in order.

feature_importance = np.array(model.feature_importances_)
feature_names = np.array(X.columns)
#Create a DataFrame using a Dictionary
data={'feature_names':feature_names,'feature_importance':feature_importance}
fi_df = pd.DataFrame(data)
#Sort the DataFrame in order decreasing feature importance
fi_df.sort_values(by=['feature_importance'], ascending=False,inplace=True)
#Define size of bar plot
plt.figure(figsize=(10,8))
#Plot Searborn bar chart
sns.barplot(x=fi_df['feature_importance'], y=fi_df['feature_names'])
plt.title('RANDOM FOREST ' + 'FEATURE IMPORTANCE')
plt.xlabel('FEATURE IMPORTANCE')
plt.ylabel('FEATURE NAMES')

output_9_1

Looking at this, it appears Polyuria (excess urination), Polydipsia (excess thirst or drinking), Age, and Gender are the most important features in predicting Diabetes risk, with other features following afterwards.

This was a fun exercise for me, and I look forward to leveraging python more in my future posts and personal machine learning work.