Performance Analysis of Out-of-Distribution Detection on Various Trained Neural Networks

How does model accuracy affect the supervisor?

Out-of-distribution detection by safety supervisors will be critical for machine learning-based perception. In automotive perception, we refer to this solution as a safety cage architecture. In this work, we show that out-of-distribution detection gets better as the supervised deep neural networks improve with training.

The work in this paper was driven by Jens Henriksson as part of the SMILE program. In SMILE, we investigate approaches to make ML-based automotive features dependable. The solution approach that we primarily work with is safety cage architectures, in this paper referred to as safety supervisors. The idea is to develop a reliable approach to out-of-distribution detection. How can we detect when input from front-facing dashcams obtain images that do not resemble anything the models were trained for?

Our work was nominated for the Best Paper Award at the Euromicro Conference on Software Engineering and Advanced Applications (SEAA) 2019 – this was the second year in a row with awarded papers at SEAA!

Structured evaluation of supervisors

In a paper published at IEEE AI Testing earlier this year, we highlight the need for better evaluations of safety supervisors. We also present a family of metrics that we believe draw a solid picture of how a supervisor performs. The metrics are listed in the table below. In this paper, we put two supervisors to the test using these metrics: a simple baseline and an algorithm called OpenMax.

Metric Description
AUROC Area under the Receiver Operating Characteristic (ROC) curve. Captures how the true/false positive ratios change with varying thresholds. Indicates the precision variation over increasing true positive rate.
AUPRC Area under the Precision-Recall curve. Determines the rise of the ROC-curve, with minimal FPR.
TPR05 True positive rate at 5% false positive rate.
P95Precision at 95% recall. Presents the accuracy when removing the majority of outliers.
FNR95False negative rate at 95% false positive rate. Shows how many anomalies are missed by the supervisor.
CBPLCoverage breakpoint at performance level. Measures how restrictive the supervisor has to be to regain similar accuracy as during training.
CBFAD Coverage breakpoint at full anomaly detection. Gives the
coverage level where the supervisor has caught all outliers.

We evaluate the baseline and OpenMax on two established Deep Neural Networks (DNN) for computer vision: VGG16 and DenseNet. We train the DNNs on CIFAR-10 and use the validation set of Tiny ImageNet as outliers. Each model is trained for 300 epochs with a batch size of 64 – more than enough to let each model to converge. We mainly study the research question:

  • How does the outlier detection performance change with improved training of the DNNs under supervision?

Better model, better supervisor

We found that training the model until it provides a high classification accuracy – in our case this meant training for almost 300 epochs – resulted in a better out-of-distribution detection for the supervisor. Our results suggest a linear relationship between the model accuracy and the ability of the supervisor to detect out-of-distribution samples. We consider this our main finding in the study. The figure on top of this page shows how the AUROC for both OpenMax and the baseline (and some additional variations, full details in the paper) increases as the supervised classification model improves.

Secondary findings include that OpenMax outperforms the baseline. Also, there is a notable trade-off between achieving a high AUROC and a low FNR95 – how to balance the importance of the metrics depends on the application. Such trade-offs resemble the precision-recall trade-off I’ve worked with when developing tools based on information retrieval, such as ImpRec for change impact analysis.

Implications for Research

  • A first structured evaluation of safety supervisors for out-of-distribution detection using previously proposed metrics.
  • There is a linear relationship between model accuracy and supervisor performance.

Implications for Practice

  • Supervisors do a better job if the models are already good.
  • There will be inevitable trade-offs between the proposed metrics – developers have to optimize depending on the application.
Jens Henriksson, Christian Berger, Markus Borg, Lars Tornberg, Sankar Raman Sathyamoorthy, and Cristofer Englund. Performance Analysis of Out-of-Distribution Detection on Various Trained Neural Networks, In Proc. of the 45th Euromicro Conference on Software Engineering and Advanced Applications, 2019. (link)


Several areas have been improved with Deep Learning during the past years. For non-safety related products adoption of AI and ML is not an issue, whereas in safety critical applications, robustness of such approaches is still an issue. A common challenge for Deep Neural Networks (DNN) occur when exposed to out-of-distribution samples that are previously unseen, where DNNs can yield high confidence predictions despite no prior knowledge of the input. In this paper we analyse two supervisors on two well-known DNNs with varied setups of training and find that the outlier detection performance improves with the quality of the training procedure. We analyse the performance of the supervisor after each epoch during the training cycle, to investigate supervisor performance as the accuracy converges. Understanding the relationship between training results and supervisor performance is valuable to improve robustness of the model and indicates where more work has to be done to create generalized models for safety critical applications.