Interpreting machine learning models

Link to original paper Authors Publication year
Here Scott M. Lundberg, Su-In Lee 2017

The idea in 2 lines

The authors propose to use a method created by Lloyd Shapley, Nobel Prize in economics in 2012, to interpret a machine learning model’s decisions by measure each feature’s importance.

Some context before reading the paper

The trust barrier

The improvement of machine learning models performances over the past decade has broadened the scope where such models could be used in the day to day life. We often think about autonomous cars, voice assistants and Netflix recommendations as the most common applications, but many other have emerged. Machine learning is now helping research (e.g. protein folding), and is even starting to bring actual value to “traditional” companies (examples here).

By growing continuously, machine learning has started to hit a new obstacle that may be even harder to overcome than performance or computing power issues: trust. Due to some major fails (racist cameras, sexist recruitment software, extremist bots…), this weakness of machine learning models is now well known.

Whole fields of machine learning research are now dedicated to discovering ways to improve ML models, so that they become more trustworthy. For instance:

  • Adversarial learning: cyber attacks directed at machine learning models, and how to prevent them

The adversarial approach is not limited to tricking and protecting models: for instance, it also inspired the whole Generative Adversarial Networks (GAN) architecture, which has now enabled many new interesting applications (e.g. NVIDIA’s famous model to generate faces that do not exist, which was later transposed to chemicals, cats, horses - with mixed results - and art).

  • Interpretability: understanding the model’s reasoning and the factors that influenced the final decisions.

The article you are about to read stems from this second research field. It considers interpretability through the lens of feature importance, i.e. how can we find out which aspects of the input data most affected the final decision.

The taxonomy to classify interpretability methods is vast and constantly changing, but the two main characteristics of the SHAP method are:

  • a black box approach: it is fully model-agnostic, i.e. no information about the model is required (it could work with neural networks, random forests, and pretty much anything else)
  • local interpretability: by default, it provides information about a specific decision of the model

The limits of model interpretation

Since most interpretability methods rely on simple approximations for complex models, the output they generate is highly dependent on the hypotheses they make.

For instance, this article shows how different approaches will result in different explanations.

Moreover, feature-based approaches have one major limitation: they assume that the features actually have meaning. This assumption can sometimes fall short e.g. in computer vision (1 feature = 1 pixel), or when destructive transformations have been applied to improve model performence (such as PCA or t-SNE).

Ready to learn more? Dive in!