This paper is based on discussions and experimental work in the SMILE project. In SMILE, we have proposed a safety cage architecture around deep learning networks – also known as implementing a safety supervisor. To evaluate different supervisors, we need a good framework with test input and well-defined metrics. This work, part of Jens’ PhD studies, presents such a framework.
Supervisors that reject novel input
Deep learning has revolutionized many classification challenges in recent years – computer vision is probably the most well-known example. However, in any domain, the classifiers will sometimes make mistakes. If the mistakes happen when optimizing conversion rates in e-commerce, this is perhaps not so bad. If it happens in a perception system based on input from the dashcam of a car it can be fatal. Based on workshops with industry, we propose complementing vision systems with a supervisor to detect when camera input does not resemble the training data. If the supervisor rejects the input, the classifier shall not try to make predictions.
Datasets with outliers
Supervisors can be implemented in different ways, and we have tried a handful of approaches in the SMILE project. To be able to recommend any approaches to our industry partners, we need reliable evaluations. To meet this end, we need datasets polluted with anomalies and meaningful evaluation metrics. Starting with the data, we propose four datasets of increasing complexity.
Evaluation plots and metrics for supervisors
Our framework only assumes that the output from the supervisor is a single value between 0 and 1. Apart from that, the supervisors evaluated using our framework could be implemented in any way. First, we propose characterizing supervisors using four plots:
- ROC curve to determine the discriminative capability of the supervisor with respect to positive/negative examples.
- Precision-recall curve to better illustrate performance for imbalanced datasets.
- Anomaly score distributions of the supervisor algorithm for the in- and outlier data (see example below)
- Risk-coverage curve, showing the trade-off between prediction failure rate (risk) and amount of covered data samples, i.e., how the classifier performance varies with different aggressive rejection levels of the supervisor.
To enable comparison between different supervisors, a set of simple metrics is helpful. We propose seven metrics that assess provide insights into the plots above. The metrics capture a set of different key characteristics of the curves – the ones we plan to report in future work on safety supervisors.
- Area under ROC curve
- Area under precision-recall curve
- True positive rate at 5% false positive rate (TPR05)
- Precision at 95% recall (P95)
- False negative rate at 95% false positive rate (FNR95)
- Coverage breakpoint at performance level (CBPL)
- Coverage breakpoint at full anomaly detection (CBFAD)
Implications for research
- A proposed framework for evaluation of supervisors that reject outliers the classifier has not been trained for.
- Four plots and seven metrics that help support insights when comparing supervisors.
- Four datasets with well-defined outliers to assess supervisors in future work.
Implications for practice
- The framework can help developers of perception systems find an appropriate supervisor for their context.
Citation details
Jens Henriksson, Christian Berger, Markus Borg, Lars Tornberg, Cristofer Englund, Sankar Raman Sathyamoorthy, and Stig Ursing. Towards Structured Evaluation of Deep Neural Network Supervisors, In Proc. of the 1st IEEE International Conference on Artificial Intelligence Testing (AITest), pp. 27-34, 2019. (link, preprint)
Abstract
Deep Neural Networks (DNN) have improved the quality of several non-safety related products in the past years. However, before DNNs should be deployed to safety-critical applications, their robustness needs to be systematically analyzed. A common challenge for DNNs occurs when input is dissimilar to the training set, which might lead to high confidence predictions despite proper knowledge of the input. Several previous studies have proposed to complement DNNs with a supervisor that detects when inputs are outside the scope of the network. Most of these supervisors, however, are developed and tested for a selected scenario using a specific performance metric. In this work, we emphasize the need to assess and compare the performance of supervisors in a structured way. We present a framework constituted by four datasets organized in six test cases combined with seven evaluation metrics. The test cases provide varying complexity and include data from publicly available sources as well as a novel dataset consisting of images from simulated driving scenarios. The latter we plan to make publicly available. Our framework can be used to support DNN supervisor evaluation, which in turn could be used to motive development, validation, and deployment of DNNs in safety-critical applications.