Human Test Scenario Bias in Autonomous Vehicle Validation

Human Test Scenario Bias:
Machine Learning perceives the world differently than you do. That means your intuition is not necessarily a good source for test planning.


Simulation-based testing (including especially closed-course testing of real vehicles) can suffer from a test planning bias. The problem is that a test plan is often made according to human perception of the scenario being tested. For example, a test scenario might be “child crossing in a painted cross-walk.” Details of the test scenario might explore various corner cases involving child clothing, size, weather conditions, scene clutter, and so on.

Commonly test scenarios map to a human-interpretable taxonomy of the system and environmental state space. However, autonomy systems might have a different internal state space representation than humans, meaning that they classify the world in ways that differ from how humans do so. This in turn can lead to a situation in which a human believes apparent complete coverage via a testing plan has been achieved, while in reality significant aspects of the autonomy system have not been tested.

As a hypothetical example, the autonomy system might have deduced that a human’s shirt color is a predictor of whether that human will step into a street because of accidental correlations in a training data set. But the test plan might not specify shirt color as a variable, because test designers did not realize it was a relevant autonomy feature for pedestrian motion prediction. That means that a human-designed vision test for an autonomous vehicle can be an excellent starting point to make sure that obstacles, pedestrians, and other objects can be detected by the system. But, more is required.

Machine-learning based systems are known to be vulnerable to learning bias that is not recognized by human testers, at least initially. Some such failures have been quite dramatic (e.g. Grush 2015). Thus, simplistic tests such as having an average body size white male in neutral summer clothing cross a street to test pedestrian avoidance do not demonstrate a robust safety capability. Rather, such tests tend to demonstrate a minimum performance capability.

Interpreting the results of human-constructed test designs, including humans interpreting why a particular on-road scenario failed, are also subject to human test scenario bias. A credible safety argument that relies upon human-constructed tests or human interpretation of root cause analysis in claiming that test failures have been fixed should address this pitfall.

(This is an excerpt of our SSS 2019 paper:  Koopman, P., Kane, A. & Black, J., "Credible Autonomy Safety Argumentation," Safety-Critical Systems Symposium, Bristol UK, Feb. 2019.  Read the full text here)

Safety Argument Consideration for Public Road Testing of Autonomous Vehicles

Beth Osyk and I are presenting our paper at SAE WCX today on how to argue sufficient road test safety for self-driving car technology.

See below slideshare or follow this link for presentation slides.

Preprint of paper here:
https://users.ece.cmu.edu/~koopman/pubs/koopman19_TestingSafetyCase_SAEWCX.pdf

Abstract:
Autonomous vehicle (AV) developers test extensively on public roads, potentially putting other road users at risk. A safety case for human supervision of road testing could improve safety transparency. A credible safety case should include: (1) the supervisor must be alert and able to respond to an autonomy failure in a timely manner, (2) the supervisor must adequately manage autonomy failures, and (3) the autonomy failure profile must be compatible with effective human supervision.\

Human supervisors and autonomous test vehicles form a combined human-autonomy system, with the total rate of observed failures including the product of the autonomy failure rate and the rate of unsuccessful failure mitigation by the supervisor. A difficulty is that human ability varies in a nonlinear way with autonomy failure rates, counter-intuitively making it more difficult for a supervisor to assure safety as autonomy maturity improves. Thus, road testing safety cases must account for both the expected failures during testing and the practical effectiveness of human supervisors given that failure profile. This paper outlines a high level safety case that identifies key factors for credibly arguing the safety of an onroad AV test program. A similar approach could be used to analyze potential safety issues for high capability semiautonomous production vehicles.






Nondeterministic Behavior and Legibility in Autonomous Vehicle Validation

Nondeterministic Behavior and Legibility:
How do you know your autonomous vehicle passed the test for the right reason? What if it just got lucky, or is gaming the test?



The nature of the algorithms used by autonomy systems creates problems for modelling and testing that go beyond typical safety critical software. Some autonomy algorithms, such as randomized path planning, are inherently non-deterministic. Others can be brittle, failing dramatically with subtle variations in data, such as perception false negatives induced by adversarial attacks (Szegedy at al. 2013) or false negatives induced by slight image degradation due to haze or defocus (Pezzementi et al. 2018).


A related issue is over-fitting to the test, in which an autonomy system over-fits and learns how to beat a fixed test. By analogy, this is the pitfall of the system cheating by having memorized the correct answers. A proposed way to deal with this risk is by randomly varying aspects of test cases. 

In such a fuzzing or variable testing approach it is important to randomly vary all relevant aspects of a problem. For example, varying geometries for traffic situations can be helpful, but probably does not address potential over-fitting for perception algorithms that perform object classification.

The use of potentially non-deterministic test scenarios combined with non-deterministic system behaviors and opaque system designs means it is difficult to know whether a system has passed a test, because there is no single correct answer. Rather, there must be some algorithmic way to determine whether a particular system response is acceptable or not, making that test oracle algorithm safety critical.

Moreover, it is possible that a system has passed a particular test by chance. For example, a pedestrian might be avoided due to a properly functioning detection and avoidance algorithm. But a pedestrian might also be avoided merely because a random path planner by chance picked a path that did not intersect the pedestrian, or responded to a completely unrelated aspect of the environment that caused it to pick a fortuitously safe path. Similarly, a pedestrian might be detected in one image, but undetected in another that differs in ways that are essentially imperceptible to a human.

It is unclear if resolving this issue requires solving the difficult problem of explainable AI (Gunning 2018). As a minimum, a credible safety argument will need to address the problem of how plans to test vehicles with less than a statistically valid amount of real-world exposure data can avoid these pitfalls. It seems likely that a credible argument will also have to establish that each type of test has been passed due to safe operation of the system rather than simply by chance (Koopman & Wagner 2018).

(This is an excerpt of our SSS 2019 paper:  Koopman, P., Kane, A. & Black, J., "Credible Autonomy Safety Argumentation," Safety-Critical Systems Symposium, Bristol UK, Feb. 2019.  Read the full text here)
  • Gunning, D. (2018), Explainable Artificial Intelligence (XAI), Defense Advanced Research Projects Agency, https://www.darpa.mil/program/explainable-artificial-intelligence (accessed October 27, 2018).
  • Koopman, P. & Wagner, M., (2018) "Toward a Framework for Highly Automated Vehicle Safety Validation," SAE World Congress, 2018. SAE-2018-01-1071.
  • Pezzementi, Z., Tabor, T., Yim, S., Chang, J., Drozd, B., Guttendorf, D., Wagner, M., Koopman, P., "Putting image manipulations in context: robustness testing for safe perception," IEEE International Symposium on Safety, Security, and Rescue Robotics (SSRR), Aug. 2018.
  • Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fer-gus, R. (2013) "Intriguing properties of neural networks." arXiv preprint arXiv:1312.6199 (2013).