A General Taylor Framework for Unifying and Revisiting Attribution Methods

Abstract

Attribution methods provide an insight into the decision-making process of machine learning models, especially deep neural networks, by assigning contribution scores to each individual feature. However, the attribution problem has not been well-defined, which lacks a unified guideline to the contribution assignment process. Furthermore, existing attribution methods often built upon various empirical intuitions and heuristics. There still lacks a general theoretical framework that not only can offer a good description of the attribution problem, but also can be applied to unifying and revisiting existing attribution methods. To bridge the gap, in this paper, we propose a Taylor attribution framework, which models the attribution problem as how to decide individual payoffs in a coalition. Then, we reformulate fourteen mainstream attribution methods into the Taylor framework and analyze these attribution methods in terms of rationale, fidelity, and limitation in the framework. Moreover, we establish three principles for a good attribution in the Taylor attribution framework, i.e., low approximation error, correct Taylor contribution assignment, and unbiased baseline selection. Finally, we empirically validate the Taylor reformulations and reveal a positive correlation between the attribution performance and the number of principles followed by the attribution method via benchmarking on real-world datasets.

Publication
arxiv