Not a Number: Identifying Instance Features for Capability-Oriented Evaluation

Not a Number: Identifying Instance Features for Capability-Oriented Evaluation

Ryan Burnell, John Burden, Danaja Rutar, Konstantinos Voudouris, Lucy Cheke, José Hernández-Orallo

Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence
Main Track. Pages 2827-2835. https://doi.org/10.24963/ijcai.2022/392

In AI evaluation, performance is often calculated by averaging across various instances. But to fully understand the capabilities of an AI system, we need to understand the factors that cause its pattern of success and failure. In this paper, we present a new methodology to identify and build informative instance features that can provide explanatory and predictive power to analyse the behaviour of AI systems more robustly. The methodology builds on these relevant features that should relate monotonically with success, and represents patterns of performance in a new type of plots known as ‘agent characteristic grids’. We illustrate this methodology with the Animal-AI competition as a representative example of how we can revisit existing competitions and benchmarks in AI—even when evaluation data is sparse. Agents with the same average performance can show very different patterns of performance at the instance level. With this methodology, these patterns can be visualised, explained and predicted, progressing towards a capability-oriented evaluation rather than relying on a less informative average performance score.
Keywords:
Machine Learning: Evaluation
AI Ethics, Trust, Fairness: Explainability and Interpretability
AI Ethics, Trust, Fairness: Safety & Robustness
Machine Learning: Experimental Methodology