Show the robustness of your classifier performance in a pooled ROC chart
In machine learning, one crucial rule ist that you should not score your model on previously unseen data [aka your test set] until you are satisfied with your results using solely training data.
To show the performance and robustness of your model you can use multiple training and test sets inside your training data. To prevent confusion we call it validation set, if its part of the train data. Dividing the training data into multiple training and validation sets is called cross validation. The ratio, size and number of sets depend on the cross-validation method and size of your training set. The most common is probably K-Fold, but depending on the size of the training set you might want to try Bootstrapping or Leave-One-Out. Each method has advantages and disadvantages like an increased training or validation set size per fold. I will not go into detail, there are plenty of awesome articles on Medium on the topic.
So, we are using some sort of cross-validation with a classifier to train and validate the model more than once. This approach results in a series of score results. Probably the most common metric is a ROC curve to compare model performances among each other. It does not take class imbalances into account, which makes it useful to compare with other models trained with different data but in the same field of research. A great complement to the ROC curve is a PRC curve which takes the class imbalance into account and helps judging the performance of different models trained with the same data. But again, there are already plenty of awesome articles on Medium on all kinds of metrics. To get a ROC curve you basically plot the true positive rate [TPR] against the false positive rate [FPR]. To indicate the performance of your model you calculate the area under the ROC curve [AUC].
Lets say we trained a XGBoost classifiers in a 100 x 5-folds cross validation and got 500 results. For each fold we have to extract the TPR — also known as sensitivity — and FPR — also known as 1-specificity — and calculate the AUC. Based on this series of results you can actually give a confidence interval to show the robustness of your classifier.
As this is specifically meant to show how to build a pooled ROC plot, I will not run a feature selection or optimise my parameters.
First of all we import some packages and load a data set:
import numpy as np
import pandas as pd
import plotly.graph_objects as go
from tqdm.notebook import tqdm
from sklearn.model_selection import RepeatedKFold
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, roc_curveurl = '//archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data'
df = pd.read_csv[url, header=None]
There are a few missing values denoted as “?”, we have to remove them first:
for i in range[13]:
df[i] = df[i].apply[lambda x: np.nan if x=='?' else x]
df[i] = df[i].astype[float]
df = df.dropna[]
The Cleveland Cancer data set has a target that is encoded in 0-4 which we will binarize in class 0 with all targets encoded as 0 and 1 with all targets encoded as 1–4.
def binarize[x]:
if x==0:
value=0
else:
value=1
return valuedf[13] = df[13].map[binarize]
Next, we define our features and the label and split the data:
X = df.drop[13, axis=1]
y = df[13]
Now we do a stratified split of the data to preserve a potential class imbalance:
X_train, X_test, y_train, y_test = train_test_split[X, y, test_size=0.3, random_state=101, stratify=y]
We can now get the folds using our train set. I use a repeated k-fold to get more score results:
cv = RepeatedKFold[n_splits=5, n_repeats=100, random_state=101]
folds = [[train,test] for train, test in cv.split[X_train, y_train]]
Lets build a dictionary to collect our results in:
metrics = ['auc', 'fpr', 'tpr', 'thresholds']
results = {
'train': {m:[] for m in metrics},
'val' : {m:[] for m in metrics},
'test' : {m:[] for m in metrics}
}
To initialise XGBoost we have to chose some parameters:
params = {
'objective' : 'binary:logistic',
'eval_metric' : 'logloss'
}
Now it is time to run our cross validation and save all scores to our dictionary:
dtest = xgb.DMatrix[X_test, label=y_test]
for train, test in tqdm[folds, total=len[folds]]:
dtrain = xgb.DMatrix[X_train.iloc[train,:], label=y_train.iloc[train]]
dval = xgb.DMatrix[X_train.iloc[test,:], label=y_train.iloc[test]]
model = xgb.train[
dtrain = dtrain,
params = params,
evals = [[dtrain, 'train'], [dval, 'val']],
num_boost_round = 1000,
verbose_eval = False,
early_stopping_rounds = 10,
]
sets = [dtrain, dval, dtest]
for i,ds in enumerate[results.keys[]]:
y_preds = model.predict[sets[i]]
labels = sets[i].get_label[]
fpr, tpr, thresholds = roc_curve[labels, y_preds]
results[ds]['fpr'].append[fpr]
results[ds]['tpr'].append[tpr]
results[ds]['thresholds'].append[thresholds]
results[ds]['auc'].append[roc_auc_score[labels, y_preds]]
This is a quite easy procedure. There is also the possibility to use feval
inside the xgb.cv
method, to put your scores in a custom function, but I made the experience that it is much slower and harder to debug.
Now that we have our results from the 100 cross validation folds, we can plot our ROC curve:
kind = 'val'c_fill = 'rgba[52, 152, 219, 0.2]'
c_line = 'rgba[52, 152, 219, 0.5]'
c_line_main = 'rgba[41, 128, 185, 1.0]'
c_grid = 'rgba[189, 195, 199, 0.5]'
c_annot = 'rgba[149, 165, 166, 0.5]'
c_highlight = 'rgba[192, 57, 43, 1.0]'fpr_mean = np.linspace[0, 1, 100]
interp_tprs = []
for i in range[100]:
fpr = results[kind]['fpr'][i]
tpr = results[kind]['tpr'][i]
interp_tpr = np.interp[fpr_mean, fpr, tpr]
interp_tpr[0] = 0.0
interp_tprs.append[interp_tpr]
tpr_mean = np.mean[interp_tprs, axis=0]
tpr_mean[-1] = 1.0
tpr_std = 2*np.std[interp_tprs, axis=0]
tpr_upper = np.clip[tpr_mean+tpr_std, 0, 1]
tpr_lower = tpr_mean-tpr_std
auc = np.mean[results[kind]['auc']]fig = go.Figure[[
go.Scatter[
x = fpr_mean,
y = tpr_upper,
line = dict[color=c_line, width=1],
hoverinfo = "skip",
showlegend = False,
name = 'upper'],
go.Scatter[
x = fpr_mean,
y = tpr_lower,
fill = 'tonexty',
fillcolor = c_fill,
line = dict[color=c_line, width=1],
hoverinfo = "skip",
showlegend = False,
name = 'lower'],
go.Scatter[
x = fpr_mean,
y = tpr_mean,
line = dict[color=c_line_main, width=2],
hoverinfo = "skip",
showlegend = True,
name = f'AUC: {auc:.3f}']
]]
fig.add_shape[
type ='line',
line =dict[dash='dash'],
x0=0, x1=1, y0=0, y1=1
]
fig.update_layout[
template = 'plotly_white',
title_x = 0.5,
xaxis_title = "1 - Specificity",
yaxis_title = "Sensitivity",
width = 800,
height = 800,
legend = dict[
yanchor="bottom",
xanchor="right",
x=0.95,
y=0.01,
]
]
fig.update_yaxes[
range = [0, 1],
gridcolor = c_grid,
scaleanchor = "x",
scaleratio = 1,
linecolor = 'black']
fig.update_xaxes[
range = [0, 1],
gridcolor = c_grid,
constrain = 'domain',
linecolor = 'black']
You could make the code shorter by using plotlys toself
filling method, but this way you are more flexible in terms of color or specific changes on lower or upper boundaries. This is the result of the scores on the validation set inside our KFold procedure:
When you tuned your model, found some better features and optimised your parameters you can go ahead and plot the same graph for your test data by changing kind = 'val'
to kind = 'test'
in the code above. Lets see how the
models perform on our test set:
Of course you can use the same procedure to build a precision recall curve [PRC] and save the feature importances of each fold to inspect performance when the class imbalance is high or to get an idea of the robustness of your features.
Since we are using plotly
to plot the results, the plot is interactive and could be visualised inside a streamlit
app for example.
Hope this is helping some fellow Data Scientists to present the performance of their Classifiers. Thanks for reading!