Towards Structured Evaluation of Deep Neural Network Supervisors

This paper is based on discussions and experimental work in the SMILE project. In SMILE, we have proposed a safety cage architecture around deep learning networks – also known as implementing a safety supervisor. To evaluate different supervisors, we need a good framework with test input and well-defined metrics. This work, part of Jens’ PhD studies, presents such a framework.

Supervisors that reject novel input

Deep learning has revolutionized many classification challenges in recent years – computer vision is probably the most well-known example. However, in any domain, the classifiers will sometimes make mistakes. If the mistakes happen when optimizing conversion rates in e-commerce, this is perhaps not so bad. If it happens in a perception system based on input from the dashcam of a car it can be fatal. Based on workshops with industry, we propose complementing vision systems with a supervisor to detect when camera input does not resemble the training data. If the supervisor rejects the input, the classifier shall not try to make predictions.

Datasets with outliers

Supervisors can be implemented in different ways, and we have tried a handful of approaches in the SMILE project. To be able to recommend any approaches to our industry partners, we need reliable evaluations. To meet this end, we need datasets polluted with anomalies and meaningful evaluation metrics. Starting with the data, we propose four datasets of increasing complexity.

MNIST handwritten digits polluted by Omniglot (28×28 pixels)
CIFAR-10 polluted by Tiny ImageNet (32×32 pixels)
Highway driving in the simulator Pro-SiVIC polluted by fog and urban driving. (752×480 pixels)
DR(eye)VE images in normal driving conditions polluted by rain and night driving. (1920×1080 pixels)

Evaluation plots and metrics for supervisors

Our framework only assumes that the output from the supervisor is a single value between 0 and 1. Apart from that, the supervisors evaluated using our framework could be implemented in any way. First, we propose characterizing supervisors using four plots:

  • ROC curve to determine the discriminative capability of the supervisor with respect to positive/negative examples.
  • Precision-recall curve to better illustrate performance for imbalanced datasets.
  • Anomaly score distributions of the supervisor algorithm for the in- and outlier data (see example below)
  • Risk-coverage curve, showing the trade-off between prediction failure rate (risk) and amount of covered data samples, i.e., how the classifier performance varies with different aggressive rejection levels of the supervisor.

To enable comparison between different supervisors, a set of simple metrics is helpful. We propose seven metrics that assess provide insights into the plots above. The metrics capture a set of different key characteristics of the curves – the ones we plan to report in future work on safety supervisors.

  • Area under ROC curve
  • Area under precision-recall curve
  • True positive rate at 5% false positive rate (TPR05)
  • Precision at 95% recall (P95)
  • False negative rate at 95% false positive rate (FNR95)
  • Coverage breakpoint at performance level (CBPL)
  • Coverage breakpoint at full anomaly detection (CBFAD)

Implications for research

  • A proposed framework for evaluation of supervisors that reject outliers the classifier has not been trained for.
  • Four plots and seven metrics that help support insights when comparing supervisors.
  • Four datasets with well-defined outliers to assess supervisors in future work.

Implications for practice

  • The framework can help developers of perception systems find an appropriate supervisor for their context.

Citation details

Jens Henriksson, Christian Berger, Markus Borg, Lars Tornberg, Cristofer Englund, Sankar Raman Sathyamoorthy, and Stig Ursing. Towards Structured Evaluation of Deep Neural Network Supervisors, In Proc. of the 1st IEEE International Conference on Artificial Intelligence Testing (AITest), pp. 27-34, 2019. (link, preprint)


Deep Neural Networks (DNN) have improved the  quality of several non-safety related products in the past years.  However, before DNNs should be deployed to safety-critical applications,  their robustness needs to be systematically analyzed. A common  challenge for DNNs occurs when input is dissimilar to the training set,  which might lead to high confidence predictions despite proper knowledge  of the input. Several previous studies have proposed to complement DNNs  with a supervisor that detects when inputs are outside the scope of the  network. Most of these supervisors, however, are developed and tested  for a selected scenario using a specific performance metric. In this  work, we emphasize the need to assess and compare the performance of  supervisors in a structured way. We present a framework constituted by  four datasets organized in six test cases combined with seven evaluation  metrics. The test cases provide varying complexity and include data  from publicly available sources as well as a novel dataset consisting of  images from simulated driving scenarios. The latter we plan to make  publicly available. Our framework can be used to support DNN supervisor  evaluation, which in turn could be used to motive development,  validation, and deployment of DNNs in safety-critical applications.