V&V for ML in the literature

Autonomous vehicles are on their way to public roads. Relying on deep learning for environmental perception, they are supposed to be sufficiently trained to generalize to new traffic situations. We know how to test source code, but how can we test a component that instead has been trained? This paper reports findings from a literature review and a workshop series with industry partners.

This paper is the main outcome from the first SMILE project: Safety analysis and verification/validation of MachIne LEarning based systems – a 6-month pilot study funded by Vinnova’s FFI program. The project was coordinated by RISE Viktoria and some automotive companies busy with active safety development were involved: Volvo Cars, Volvo Trucks, and QRTech. A first version of this paper, reporting experience from SMILE workshops, was rejected by IEEE Transactions on Intelligent Transportation Systems… and the critique was fair. We made big changes after resubmitting to the newly started open access Journal of Automotive Software Engineering. I stepped up as first-author and merged the previous manuscript with a systematic literature review and a survey conducted as part of a MSc thesis project at BTH. The manuscript was substantially improved and our paper ended up as the first publication in the new journal!

Deep learning needs safety

Autonomous driving is obviously one of the really hot topics in research nowadays. Several contributions have made self-driving cars realistic, including computer vision using deep learning, i.e., training classifiers to enable awareness of elements in the surrounding traffic. But using machine learning (ML) to train models is very different from how current automotive safety standards prescribe systems engineering. No longer do human engineers explicitly describe all system behavior in source code, instead ML relies on enormous amounts of (manually annotated) historical data. How can we build a safety case, i.e., a structured argumentation that a system is sufficiently safe in a given operational environment, when we have released control by replacing coding with training? We must have an answer to this before letting vechicles out on public roads. This is what the SMILE project was about.

Exploratory work with a little bit of everything

This paper wraps up most of the activities we completed in the SMILE project. The figure below shows how we combined different approaches to learn about challenges and possible solutions. We started with reading relevant papers to get a first understanding of the existing body of research (cf A). A subset of the papers was used to seed a systematic literature review based on snowballing (cf C.) as described by Wohlin. Both of our literature studies were validated by requesting feedback from industry practitioners.

The “ad hoc” search was validated through a series of workshops (cf B.) conducted with representatives from the Swedish automotive industry – combined with a semi-structured elicitation of challenges related to V&V of ML. The systematic search was completed later and was instead validated by a short survey with industry practitioners distributed via LinkedIn.

From the literature: Challenges in V&V of ML

The systematic literature review resulted in a set of 64 papers related to V&V of ML. For a paper to be included, we required the paper to:

descibe engineering of an ML-based system
in the context of autonomous cyber-physical systems, and
the paper should address V&V or safety analysis

When analyzing the 64 papers, we separated challenges and solutions. Our first idea was to map the extracted data to schemes presented in previous work, but we didn’t find that valuable – instead we inductively created new schemes. The table below shows the seven identified categories of challenges.

State-space explosion	Challenges related to the very large size of the input space.
Robustness	Issues related to operation in the presence of invalid inputs or stressful environmental conditions.
Systems engineering	Challenges related to integration or co-engineering of ML-based and conventional components.
Transparency	Challenges originating in the black-box nature of the ML system.
Requirements specification	Problems related to specifying expectations on the learning behavior.
Test specification	Issues related to designing test cases for ML-based systems, e.g., nondeterministic output.
Adversarial attacks	Threats related to antagonistic attacks on ML-based systems, e.g., adversarial examples.

The figure below shows how the survey respondents weighted the importance of the seven categories – by providing answers between 1 and 5. Robustness stands out as the respondents’ most important challenge. However, all seven categories are perceived as important. If any challenge is less pressing to the respondents, it’s adversarial attacks.

From the literature: Solutions for V&V of ML

We did the same type of inductive creation of categories for solution proposals, resulting in five overarching categories:

Formal methods	Approaches to mathematically prove that some specification holds.
Control theory	Verification of learning behavior based on automatic control and self-adaptive systems.
Probabilistic methods	Statistical approaches such as uncertainty calculation, Bayesian analysis, and confidence intervals.
Test case design	Approaches to create effective test cases, e.g., using genetic algorithms or procedural generation.
Process guidelines	Guidelines supporting work processes, e.g., covering training data collection or testing strategies.

Again we requested the respondents of the industrial survey to provide feedback. We asked them to assess how promising they perceived solution proposals from the five categories, as shown below. The results indicate that engineers in industry primarily see simulated test cases as a promising direction for future work.

Solution proposals to tackle V&V for ML.

From the literature: Mapping solutions and challenges

The figure below shows the distribution of challenges and solutions in the literature. We find that challenges related to transparency and state-space explosion have been discussed the most in previous work. The distribution of work across the solution categories is more even, but we notice that formal methods, control theory, and probabilistic methods have been proposed more frequently.

Mapping solution proposals and targeted challenges.

The figure also shows a mapping between solution proposal categories
and challenge categories. Some of the papers in the literature propose a solution to address challenges belonging to a specific category. The width of the connections in the figure illustrates the number. None of the proposed solutions address challenges related to the categories “Requirements specification” or “Systems engineering” – thus indicating a research gap. Furthermore, “Transparency” is the challenge category that has been addressed the most in the papers, followed by “State-space explosion”.

Related work in the aerospace domain

One subset of papers identified in the literature review deserves special attention. Almost 20 years ago, the aerospace industry, at least in the defense and space sectors, were actively publishing papers on using machine learning for advanced flight controllers. Just like how the automotive industry today wonders how to argue that a trained deep neural network is safe, NASA and the US Airforce discussed how to make sure ML-based flight controllers were safe – this was difficult long before deep learning became trendy.

We found two books that capture most of the ideas from this research period. Taylor edited a book in 2006 that collected experiences for V&V of neural networks technology in a project sponsored by the NASA Goddard Space Flight Center. Taylor concluded that the V&V techniques available at the time must evolve to tackle the novelty of ML and reports five areas
that need to be augmented:

Configuration management must track all additional design elements, for example, the training data, the network architecture, and the learning algorithms.
Requirements need to specify novel adaptive behavior, including control requirements (how to acquire and act on knowledge) and knowledge requirements (what knowledge should be acquired).
Design specifications must capture design choices related to novel design elements such as training data, network architecture, and activation functions.
Development lifecycles for neural networks are highly iterative and last until some quantitative goal has been reached. Traditional waterfall software development is not feasible, and V&V must be an integral part rather than an add-on.
Testing needs to evolve to address novel requirements. Structure testing should determine whether the network architecture is better at learning according to the control requirements than alternative architectures. Knowledge testing should verify that the neural network has learned what was specified in the knowledge requirements.

The second book that collected experiences on V&V of safety-critical
neural networks, also funded by NASA, was edited by Schumann and Liu in 2010. While the book primarily surveys the use of neural networks in high-assurance systems, parts of the discussion are focused on V&V – and the overall conclusion is that V&V must evolve. The authors propose organizing solution proposals into seven categories, i.e., approaches that 1) separate ML algorithms from conventional source code, 2) analyze the network architecture, 3) consider ML as function approximators, 4) tackle the opaqueness of ML, 5) assess the characteristics of the learning algorithm, 6) analyze the selection and quality of training data, and 7) provide means for online monitoring of ML. We note that the five categories we inductively identified are quite different.

Challenge elicitation from industry workshops

Apart from the literature review, we conducted six workshops with industry partners. We explored key questions that must be explored to enable engineering of safety-critical automotive systems with deep learning components. Three subareas emerged during the workshops.

First, the concept of robustness was discussed at each workshop. The issues of false positives and false negatives are obvious. We argue that future research is needed on how to specify and verify acceptable levels of ML robustness. During the workshops, robustness often was often discussed as an elusive quality attribute interpreted as “something you can trust” – rather than the IEEE definition: “the degree to which a component can function correctly in the presence of invalid inputs or stressful environmental conditions”.

Second, the interplay between deep learning components and conventional software is unexplored. ML components will be part of an automotive system consisting of also conventional hardware and software components. How do we best integrate deep learning? Some workshop participants advocated a “safety cage” concept, i.e., encapsulating deep learning by a supervisor that continuously monitors the input to the ML component. The envisioned safety cage should perform novelty detection and alert when input does not belong within the training region. The idea is similar to Varshney’s concept of safe fail for ML. After the workshops, we continued working on this topic.

Third, V&V of deep learning components is a pressing challenge. V&V is a cornerstone in safety certification, but it is unclear how to develop a safety case around applications that rely on deep learning. The corresponding safety standards are still under development – ISO 26262 is as we know not very helpful. Ideas from the workshop participants included: 1) formal notation for requirements related to functional safety with deep learning, 2) a tool-chain and framework tailored to lifecycle management of systems with ML components, and 3) methods for test case generation for deep learning – simulated test cases was a major discussion topic.

Implications for Research

Most ML research showcases applications, while development on ML V&V is lagging behind.
Industry mainly stresses robustness challenges, whereas academic research most often addresses state-space explosion and the lack of ML transparency.
Industry practitioners perceive simulated test cases as very promising – more so than formal methods.

Implications for Practice

The gap between ML practice and ISO 26262 calls for novel standards rather than incremental updates.
Cross-domain knowledge transfer from the aerospace V&V engineers to the automotive domain appears promising – neural networks has been used in flight controllers.
Systems-based safety approaches are encouraged by pratitioners, including safety cages and simulated test cases.

Markus Borg, Cristofer Englund, Krzysztof Wnuk, Boris Duran, Christoffer Levandowski, Shenjian Gao, Yanwen Tan, Henrik Kaijser, Henrik Lönn, and Jonas Törnqvist. Safely Entering the Deep: A Review of Verification and Validation for Machine Learning and a Challenge Elicitation in the Automotive Industry, Journal of Automotive Software Engineering, 1(1), pp. 1-19, 2019. (open access)

Abstract

Deep neural networks (DNNs) will emerge as a cornerstone in automotive software engineering. However, developing systems  with DNNs introduces novel challenges for safety assessments. This paper reviews the state-of-the-art in verification and validation of safety-critical systems that rely on machine learning. Furthermore, we report from a workshop series on DNNs for perception with automotive experts in Sweden, confirming that ISO 26262 largely contravenes the nature of DNNs. We recommend aerospace-to-automotive knowledge transfer and systems-based safety approaches, for example, safety cage architectures and simulated system test cases.

Safely Entering the Deep: A Review of Verification and Validation for Machine Learning and a Challenge Elicitation in the Automotive Industry