Explainable Image Classification with Reduced Overconfidence for Tissue Characterisation

Alfie Roddan¹, Chi Xu¹, Serine Ajlouni², Irini Kakaletri³, Patra Charalampaki^2,4,
Stamatia Giannarou¹

¹The Hamlyn Centre for Robotic Surgery, Imperial College London, UK
²Medical Faculty, University Witten Herdecke, Germany
³Medical Faculty, Rheinische Friedrich Wilhelms University of Bonn, Germany
⁴Department of Neurosurgery, Cologne Medical Center, Cologne, Germany

agr21@ic.ac.uk

Abstract

The deployment of Machine Learning models intraoperatively for tissue characterisation can assist decision making and guide safe tumour resections. For image classification models, pixel attribution methods are popular to infer explainability. However, overconfidence in deep learning model’s predictions translates to overconfidence in pixel attribution. In this paper, we propose the first approach which incorporates risk estimation into a pixel attribution method for improved image classification explainability. The proposed method iteratively applies a classification model with a pixel attribution method to create a volume of PA maps. This volume is used for the first time, to generate a pixel-wise distribution of PA values. We introduce a method to generate an enhanced PA map by estimating the expectation values of the pixel-wise distributions. In addition, the coefficient of variation (CV) is used to estimate pixel-wise risk of this enhanced PA map. Hence, the proposed method not only provides an improved PA map but also produces an estimation of risk on the output PA values. Performance evaluation on probe-based Confocal Laser Endomicroscopy (pCLE) data and ImageNet verifies that our improved explainability method outperforms the state-of-the-art.

Keywords: Explainability, Uncertainty, MC Dropout, ADCC

1 Introduction

When using a Machine Learning (ML) model during intraoperative tissue characterisation, it is vital that the surgeon trusts the output predictions of the model otherwise the model is rendered useless [9]. For the surgeon to trust the output predictions of the model, the model must be able to explain itself [2]. One form of explainability in the image classification domain is pixel attribution (PA) mapping. PA maps aim to highlight the "most important" pixels to the classification. PA maps can be used to visually highlight whether a model is poorly extracting semantic features [32] and/or that the model is misinformed due to spurrious correlations within the data that it was trained on [16]. To efficiently process image data, these methods mainly rely on Convolutional Neural Networks (CNNs) and achieve state-of-the-art (SOTA) performance. One of the first PA methods proposed for CNNs was class activation maps (CAM) [33]. CAM uses one forward pass of the model to find the channels in the last convolutional layer that contributed most to the prediction. One of CAM’s limitations is its reliance on global average pooling (GAP) [20] after the last CNN layer as it dramatically reduces the number of architectures that can use CAM. To improve on this, Grad-CAM [29] generalises to all CNN architectures which are differentiable from the output logit layer to the chosen CNN layer. However, Grad-CAM often lacks sharpness in object localisation, as noted and improved on in Grad-CAM++ [6] and SmoothGrad-CAM++ [23]. These extensions of Grad-CAM have good semantic feature localisation but they are unable to be deployed for use in surgery [5]. Both Score-CAM [31] and Recipro-CAM [5] also generalise to all CNN architectures but are deployable. Score-CAM improves on object localisation within the visual PA map without losing the class specific capabilities of Grad-CAM by masking out regions of the image and measuring the change in the output score. This is similar to perturbation methods like RISE [25], LIME [27] and other perturbation techniques [32, 3]. On the other hand, Recipro-CAM focuses on the speed of PA map computation whilst maintaining comparable SOTA performance. By utilising the CNN’s receptive field, Recipro-CAM generates a number of spatial masks and then measures the effect on the output score much like Score-CAM.

Despite being speedy, easy to deploy and able to localise semantic features, the above rely on the overconfident predictions of the underlying model. Deep learning (DL) models trained with empirical risk minimisation (ERM) are overconfident in prediction [13] and vulnerable to adverserial attacks [14]. Bayesian Neural Networks (BNNs) [22] bring improved regularisation and output uncertainty estimates. Unfortunately, the non-linearity and number of variables within NNs make Bayesian inference a computationally intensive task. For this reason, variational methods [19, 15] are used to approximate Bayesian inference. More recently, the variational method Bayes by Backprop [4] used Dropout [18] to approximate Bayesian inference. Dropout is a regularisation technique which has also been noted to improve salient feature extraction. Although Bayes by Backprop is not overconfident, it often fails to scale to the complex architectures of SOTA models. To improve on this lack of generalisability, another variational method called Monte Carlo (MC) Dropout [13] proposes that a model trained with Dropout is equivalent to a probabilistic deep Gaussian process [7, 12]. With this assumption, an estimated output distribution is computed after a number of forward passes with Dropout have been applied. This output distribution is used in practice to indicate risk (uncalibrated variation) in the model’s predictions. Using Dropout to perturb a model is a computationally cheap method of model averaging [18]. It is worth noting though that this method’s validity as a Bayesian Inference approximation was later questioned [11]. However, this does not affect the use of this method for risk estimation. So far, model explainability and risk estimation have mostly been used separately to assess models’ suitability for surgical applications.

In this paper, we propose the first approach which incorporates risk estimation into a PA method. A classification model is trained with Dropout and a PA method is used to generate a PA map. At test time, the classification model is employed with the Dropout enabled. In this work, we propose to repeat this process for a number of iterations creating a volume of PA maps. This volume is used for the first time, to generate a pixel-wise distribution of PA values from which we can infer risk. More specifically, we introduce a method to generate an enhanced PA map by estimating the expectation values of the pixel-wise distributions. In addition, the coefficient of variation (CV) is used to estimate pixel-wise risk of this enhanced PA map. This provides an improved explanation of the model’s prediction by clearly presenting to the surgeon which salient areas to trust in the model’s enhanced PA map. In this work, we focus on the explainability of the classification of brain tumours using probe-based Confocal Laser Endomicroscopy (pCLE) data but also demonstrate generalisation by evaluating on natural scenes. Performance evaluation on pCLE data shows that our improved explainability method outperforms the SOTA.

2 Methodology

Refer to caption — Figure 1: Outline of the proposed method. A PA volume is generated using T forward passes of a CNN model with Dropout applied.

The aim of the proposed method is to produce an improved PA map of a classification model, while providing risk estimation of the model’s explainability. Further aiding the decision making during intraoperative tissue characterisation.

In our method, any CNN classification model trained with Dropout can be used. Let $\hat{\bm{Y}}$ be the output logits of the CNN model, where Dropout is enabled at test time, with input image $\bm{X}\in\mathbb{R}^{height\times width\times channels}$ . Any PA method can be used to generate a PA map using the output logits $S=f_{s}(\hat{\bm{Y}})\in\mathbb{R}^{height\times width}$ where $f_{s}(.)$ is the PA method. We propose to repeat the above process for $T$ iterations to create a volume of PA maps $\bm{S}=\{S_{1},...,S_{T}\}\in\mathbb{R}^{height\times width\times T}$ . We show this visually in Supplementary D. A visual representation of how the volume is generated is show in Fig. 1. The aim is to use this volume to generate a pixel-wise distribution of PA values from which we can infer risk. To achieve this, we compute the expectation and variance values of the volume along the third dimension as:

	$\displaystyle\mathbb{E}(\bm{S}_{i,j})$	$\displaystyle\approx\frac{1}{T}\sum^{T}_{t=1}f_{s}(\hat{\bm{Y}}_{t})_{i,j}$		(1)
	$\displaystyle Var(\bm{S}_{i,j})$	$\displaystyle\approx\frac{1}{T}\sum^{T}_{t=1}f_{s}(\hat{\bm{Y}}_{t})_{i,j}^{T}f_{s}(\hat{\bm{Y}}_{t})_{i,j}-\mathbb{E}(\bm{S}_{i,j})^{T}\mathbb{E}(\bm{S}_{i,j}),$		(1)

where, $i,j$ represent the pixel’s row and column coordinates, respectively. The expectation $\mathbb{E}(\bm{S}_{i,j})$ of each pixel $(i,j)$ is used to generate an enhanced PA map of size $height\times width$ . The intuition is that the above distribution of PA values can produce less noisy and overconfident estimation of a pixel’s contribution to the final explainability map compared to a single estimate.

Advancing SOTA explainability methods, in our method we also estimate the risk of the enhanced PA map generated above. For the risk estimation, it is important to consider that different pixels in the PA map correspond to different semantic features which contribute differently to the output logits. This makes the pixe-wise distributions (and therefore expectation and variance values) to have different scales. For this purpose, the coefficient of variation (CV) is used to estimate pixel-wise risk, as it allows us to compare pixel-wise variances despite their different scales. This is mathematically defined as:

S^{cv}_{i,j}=\frac{\sqrt{Var(\bm{S}_{i,j})}}{\mathbb{E}(\bm{S}_{i,j})}=\frac{std(\bm{S}_{i,j})}{\mathbb{E}(\bm{S}_{i,j})}.

(2)

Our proposed method improves ADDC and allows visualisation of both the explainability of the classification model (provided by the enhanced PA method) together with the pixel-wise risk of this map (provided by the CV map). For instance, salient areas on the PA map should not be trusted unless the CV values are low. An example of the enhanced PA and risk maps generated with the proposed method are shown in Figure 2. This shows that the proposed method, not only improves explainability but also provides associated risk information which improves trustworthiness.

3 Experiments and Analysis

3.0.1 Datasets

The developed explainability framework has been validated on an in vivo and ex vivo pCLE dataset of meningioma, glioblastoma and metastases of an invasive ductal carcinoma (IDC) collected at Anonymous Hospital. The Cellvizio© by Mauna Kea Technologies, Paris, France has been used in combination with the mini laser probe CystoFlex© UHD-R. The distinguishing characteristic of the meningioma is the psammoma body with concentric circles that show various degrees of calcification. Regarding glioblastomas, the pCLE images allow for the visualization of the characteristic hypercellularity, evidence of irregular nuclei with mitotic activities or multinuclear appearance with irregular cell shape. When examining metastases of an IDC, the tumor presents as egg-shaped cells with uniform evenly spaced nuclei. Our dataset includes 38 meningioma videos, 24 glioblastoma and 6 IDC. Each pCLE video represents one tumour type and corresponds to a different patient. The data has been curated to remove noisy images and similar frames. This resulted in a training dataset of 2500 frames per class (7500 frames in total) and a testing dataset of the same size. The dataset is split into a training and testing subset, with the division done on the patient level.To show generalisation to other domains, the proposed method was also evaluated on the ImageNet [8]database comprised of 1000 classes of natural scenes images.

3.0.2 Implementation

To implement the DL models we use the open-source framework PyTorch [24], a NVIDIA Geforce RTX 3090 graphics card for parallel computation and a 12th Gen Intel(R) Core(TM) i9-12900K CPU (using 16 threads for latency experiments). To show our method generalises across domains we train and test on both pCLE data and ImageNet. For the pCLE data we train two lightweight models; ResNet-18 [17] and MobileNetV2 [28]. ResNet-18 was trained with a learning rate of 0.001, whereas MobileNetV2 had a learning rate of 0.01 Both were trained using the Adam-W [21] optimiser with a learning rate of 0.001 and weight decay of 0.01. For ImageNet we train a ResNet-50 with a learning rate of 0.1, using the Stochastic Gradient Descent optimizer [30] and the Step learning rate scheduler with a step size 30 and gamma of 0.1. For the Resnet-50 model, we trained using distributed training on 3 X NVIDIA RTX A5000 graphics cards, evaluation was done on the NVIDIA Geforce RTX 3090 graphics card using the trained weights. All models were trained from scratch with a Dropout probability of 0.2 and a batch size of 256. At test time, we set $T=10$ . PA methods were implemented with the help of TorchCAM [10], ReciproCAM was implemented using the authors’ source code.

3.0.3 Evaluation Metrics

Evaluating a PA method is not a trivial task as a PA map may not need to be inline with what a human deems "reasonable" [1]. Segmentation scores like intersection over union (IoU) may be used with caution to compare thresholded PA maps to ground truth maps with annotated salient regions. By doing so, we can measure how informed the model is about a particular class. To quantify how misinformed a model is, we can estimate at its average drop [6]:

AverageDrop(f_{s},\hat{\bm{Y}},\bm{X})=100\times\frac{max(0,\hat{\bm{Y}}(\bm{X})-\hat{\bm{Y}}({\hat{\bm{X}}}))}{\hat{\bm{Y}}(\bm{X})},

(3)

where, $\hat{\bm{X}}=\bm{X}\odot f_{s}(\hat{\bm{Y}}(\bm{X})$ . The above equation measures the effect on the output score of the classification model if we only include the pixels which the PA method scored highly. A minimum average drop is desired.

As average drop was found to not be sufficient on its own, the unified method ADCC [26] has been introduced which is the harmonic mean of average drop, coherency and complexity, defined as:

$\displaystyle ADCC(f_{s}(\hat{\bm{Y}}))=$	$\displaystyle(\frac{1}{Coherency(f_{s}(\hat{\bm{Y}}))}$	(4)
	$\displaystyle+\frac{1}{1-Complexity(f_{s}(\hat{\bm{Y}}))}$
	$\displaystyle+\frac{1}{1-AverageDrop(f_{s},\hat{\bm{Y}},\bm{X})})^{-1}.$

Coherency is the Pearson Correlation Coefficient which ensures that the remaining pixels after dropping are still important, defined as:

Coherency(f_{s}(\hat{\bm{Y}}))=100\times\frac{Cov(f_{s}(\hat{\bm{Y}}({\hat{\bm{X}}})),f_{s}(\hat{\bm{Y}}))}{\sigma(f_{s}(\hat{\bm{Y}}({\hat{\bm{X}}}))\sigma(f_{s}(\hat{\bm{Y}}))},

(5)

where $Cov(.,.)$ is the covariance. A higher coherency is better. Complexity is the L1 norm of the output PA map.

Complexity(f_{s}(\hat{\bm{Y}})))=100\times||f_{s}(\hat{\bm{Y}}))||_{1}.

(6)

Complexity is used to measure how cluttered a PA map is. For a good PA map, complexity should be a minimum. As it has been shown in the literature, the metrics in Eq. (3), (5) and (6), can not be used individually to evaluate a PA method [26]. Whilst, ADCC combined with computation time gives us a reliable overall metric of how a PA method is performing.

Table 1: ADCC vs Latency Study for ResNet-18 and MobilNetV2 on pCLE Dataset. Latency(ms) is the time to compute one PA map using a batch size of one.

ResNet18
	Original		Proposed
PA Method	ADCC $\uparrow$	Latency $\downarrow$	ADCC $\uparrow$	Latency $\downarrow$
Grad-CAM	76.6	75.6	77.7	9.7
Grad-CAM++	76.2	5.5	78.2	10.5
SmoothGradCAM++	74.8	70.7	75.6	103.9
Score-CAM	80.5	121.4	80.5	1267.2
Recipro-CAM	66.5	3.55	75.9	36.4
MobileNetV2
	Original		Proposed
PA Method	ADCC $\uparrow$	Latency $\downarrow$	ADCC $\uparrow$	Latency $\downarrow$
Grad-CAM	29.3	8.8	48.0	12.5
Grad-CAM++	37.8	8.7	59.5	13.4
SmoothGradCAM++	24.5	71.3	37.1	88.8
Score-CAM	43.9	315.1	43.9	3154.3
Recipro-CAM	33.3	5.8	55.5	65.6

Table 2: ADCC vs Latency Study for ResNet-50 ImageNet Dataset. Latency(ms) is the time to compute one PA map using a batch size of one

ResNet50
	Original		Proposed
PA Method	ADCC $\uparrow$	Latency $\downarrow$	ADCC $\uparrow$	Latency $\downarrow$
Grad-CAM	67.9	11.3	72.4	25.4
Grad-CAM++	67.6	11.3	72.1	26.3
SmoothGradCAM++	64.6	84.6	71.8	134.0
Score-CAM	74.3	134.0	74.3	14147.9
Recipro-CAM	63.0	7.5	70.9	86.6

3.1 T Study

A parameter search was performed to find the optimal value of $T$ . As show in Supplementary C there is a positive correlation between ADCC and the value of $T$ . With increase of $T$ , there is an implicit increase in latency. We found the optimal tradeoff of ADCC against latency to be $T=10$ . The raw values of ADCC against T are in Supplementary C.

3.1.1 Performance Evaluation

In a model’s explanation we consider five metrics of performance; speed, usability, generalisability, trustworthiness and ability to localise semantic features. The proposed method has been compared to combinations of ResNet18 and MobileNetV2 with SOTA PA methods on both medical and natural scenes datasets. At test time, Dropout it not enabled for these standard methods (only for the proposed method). In Table 1, we show that our method outperforms all the compared CNN-PA method combinations on ADCC apart from Score-CAM. Much like our method, Score-CAM makes multiple forward passes on peturbed inputs, this makes it less susceptible to overconfidence. From Table 2 we show that this method generalises to the natural scenes domain. We believe that the better performance of our method is because of the random dropping of features taking place during Dropout at test time which helps to suppress noise in the estimated enhanced PA map. The combination of Recipro-CAM with our proposed method improves performance (increases ADCC) at the expense of increasing the computational complexity. We believe that this could be reduced using a batched implementation of Recipro-CAM. We attribute slow down in SmoothGradCAM++ when Dropout is applied during test time to the perturbations it adds on top of the PA method. Our validation study shows that Grad-CAM, Grad-CAM++ and Recipro-CAM are often leading in terms of speed as expected from the literature.

In Fig 2, we can see our method includes more regions (top part) of the image and slightly sharper in localisation both of which would help the coherency and average drop metrics. Risk estimations from Eq. (2) are also displayed and provide an added visualisation for a surgeon to evaluate both the model and the model’s explanation. As it can be seen, areas of low CV match the areas of high PA values which shows the proposed explainability method is precise. During intraoperative surgery a surgeon can visualise an assitive DL model’s explanation in order to assess or compare with what pixel’s were found to be most relevant to the classfication. Whilst a PA method does not need to highlight a salient region, it does need to provide a fair and precise PA map for the surgeon to use. Our PA method removes overconfidence and also provides an added visualisation of relative precision, improving on SOTA PA methods.

4 Conclusion

In this work we have introduced the first combination of risk in a PA methods. Using our proposed framework we not only improve on all the tested SOTA PA method’s ADCC performances but also produce an estimation of risk on the output PA values. The proposed method can clearly present to the surgeon areas of the explainability map that are more trustworthy. From this work we hope to encourage trust between the surgeon and DL models by reducing overconfidence. For future work, we plan to deploy the proposed framework for use in surgery.

4.0.1 Acknowledgements

This work was supported by the Engineering and Physical Sciences Research Council (EP/T51780X/1) and Intel R&D UK. Dr Giannarou is supported by the Royal Society (URF $\setminus$ R $\setminus$ 201014).

References

[1] Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz Hardt, Been Kim, and Google Brain. Sanity Checks for Saliency Maps.
[2] Julia Amann, Alessandro Blasimme, Effy Vayena, Dietmar Frey, and Vince I. Madai. Explainability for artificial intelligence in healthcare: a multidisciplinary perspective. BMC, 20(1), 12 2020.
[3] Marco Ancona, Enea Ceolini, Cengiz Öztireli, and Markus Gross. Towards better understanding of gradient-based attribution methods for Deep Neural Networks. 11 2017.
[4] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight Uncertainty in Neural Networks. 5 2015.
[5] Seok-Yong Byun and Wonju Lee. Recipro-CAM: Gradient-free reciprocal class activation map. 9 2022.
[6] Aditya Chattopadhyay, Anirban Sarkar, Prantik Howlader, and Vineeth N Balasubramanian. Grad-CAM++: Improved Visual Explanations for Deep Convolutional Networks. 10 2017.
[7] Andreas C. Damianou and Neil D. Lawrence. Deep Gaussian Processes. 11 2012.
[8] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE CVPR. IEEE, 2009.
[9] William K. Diprose, Nicholas Buist, Ning Hua, Quentin Thurier, George Shand, and Reece Robinson. Physician understanding, explainability, and trust in a hypothetical machine learning risk calculator. Journal of the American Medical Informatics Association, 27(4), 4 2020.
[10] François-Guillaume Fernandez. TorchCAM: class activation explorer, 2020.
[11] Loic Le Folgoc, Vasileios Baltatzis, Sujal Desai, Anand Devaraj, Sam Ellis, Octavio E. Martinez Manzanera, Arjun Nair, Huaqi Qiu, Julia Schnabel, and Ben Glocker. Is MC Dropout Bayesian? 10 2021.
[12] Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian Approximation: Appendix. 6 2015.
[13] Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. 6 2015.
[14] Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and Harnessing Adversarial Examples. 12 2014.
[15] Alex Graves. Practical Variational Inference for Neural Networks.
[16] Misgina Tsighe Hagos, Kathleen M. Curran, and Brian Mac Namee. Identifying Spurious Correlations and Correcting them with an Explanation-based Learning. 11 2022.
[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. 12 2015.
[18] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. 7 2012.
[19] Geoffrey E. Hinton and Drew van Camp. Keeping neural networks simple by minimizing the description length of the weights. pages 5–13, 1993.
[20] Min Lin, Qiang Chen, and Shuicheng Yan. Network In Network. 12 2013.
[21] Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. 11 2017.
[22] Radford M. Neal. Bayesian Learning for Neural Networks. 118, 1996.
[23] Daniel Omeiza, Skyler Speakman, Celia Cintas, and Komminist Weldermariam. Smooth Grad-CAM++: An Enhanced Inference Level Visualization Technique for Deep Convolutional Neural Network Models. 8 2019.
[24] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
[25] Vitali Petsiuk, Abir Das, and Kate Saenko. RISE: Randomized Input Sampling for Explanation of Black-box Models. 6 2018.
[26] Samuele Poppi, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Revisiting The Evaluation of Class Activation Mapping for Explainability: A Novel Metric and Experimental Analysis. 4 2021.
[27] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why Should I Trust You?Explaining the Predictions of Any Classifier. 2 2016.
[28] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. MobileNetV2: Inverted Residuals and Linear Bottlenecks. 1 2018.
[29] Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. Proceedings of the IEEE International Conference on Computer Vision, 2017-October:618–626, 12 2017.
[30] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. 30th International Conference on Machine Learning, ICML 2013, pages 1139–1147, 01 2013.
[31] Haofan Wang, Zifan Wang, Mengnan Du, Fan Yang, Zijian Zhang, Sirui Ding, Piotr Mardziel, and Xia Hu. Score-CAM: Score-Weighted Visual Explanations for Convolutional Neural Networks. 10 2019.
[32] Matthew D Zeiler and Rob Fergus. Visualizing and Understanding Convolutional Networks. 11 2013.
[33] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning Deep Features for Discriminative Localization.

Supplementary Material

Appendix A

Table 3: Performance evaluation study on pCLE data based on the ADCC and time metrics. Coh is Coherence, Comp is Complexity, AD is average drop and they are reported for completeness. Time(s) is the average time to compute one PA map using a batch size of one.

ResNet18	Original
Architecture	PA method	Coh $\uparrow$	Comp $\downarrow$	AD $\downarrow$	ADCC $\uparrow$	Latency(ms) $\downarrow$
	Grad-CAM	90.1	32.7	10.1	76.6	5.6
	Grad-CAM++	90.6	33.1	10.6	76.2	5.5
	SmoothGradCAM++	88.3	27.6	14.3	74.8	70.7
	Score-CAM	90.0	32.3	5.9	80.5	121.4
	Recipro-CAM	91.0	41.2	10.0	72.8	3.5
	Proposed method
	Grad-CAM	92.5	34.2	11.9	77.7	9.7
	Grad-CAM++	93.2	32.5	12.6	78.2	10.5
	SmoothGradCAM++	92.2	30.6	17.2	75.6	103.9
	Score-CAM	90.0	32.3	5.9	80.5	1267.2
	Recipro-CAM	92.1	37.8	11.8	75.9	36.4
MoblieNetV2	Original
	Grad-CAM	89.5	21.3	73.8	29.3	8.8
	Grad-CAM++	86.2	30.0	66.9	37.8	8.7
	SmoothGradCAM++	77.7	18.1	76.2	24.5	71.3
	Score-CAM	62.5	33.9	56.3	43.9	315.1
	Recipro-CAM	85.8	32.3	67.1	35.8	5.8
	Proposed method
	Grad-CAM	89.5	27.1	59.3	48.0	12.5
	Grad-CAM++	90.7	35.9	41.7	59.5	13.4
	SmoothGradCAM++	88.8	22.0	71.3	37.1	88.8
	Score-CAM	62.5	33.9	56.3	43.9	3154.3
	Recipro-CAM	90.2	33.8	48.6	55.5	65.6

Appendix B

Table 4: Performance evaluation study on ImageNet based on the ADCC and time metrics. Coh is Coherence, Comp is Complexity, AD is average drop and they are reported for completeness. Time(s) is the average time to compute one PA map using a batch size of one.

Architecture	PA method	Coh $\uparrow$	Comp $\downarrow$	AD $\downarrow$	ADCC $\uparrow$	Latency(ms) $\downarrow$
ResNet50	Original
	Grad-CAM	98.1	34.0	30.0	67.9	11.3
	Grad-CAM++	98.3	34.8	30.0	67.6	11.28
	SmoothGradCAM++	97.3	34.8	34.8	64.6	84.6
	Score-CAM	98.2	34.8	21.6	74.3	134.0
	Recipro-CAM	97.5	27.8	39.9	63.0	7.4
	Proposed method
	Grad-CAM	97.7	35.2	22.3	72.4	25.4
	Grad-CAM++	97.7	35.8	22.6	72.1	26.3
	SmoothGradCAM++	97.9	36.5	23.1	71.8	134.0
	Score-CAM	98.2	34.8	21.6	74.3	14147.9
	Recipro-CAM	97.3	29.5	28.8	70.9	86.6

Appendix C

Table 5: T vs ADCC Study for Resnet-18 and MobileNetV2 on pCLE data.

	PA Method’s ADCC
	ResNet18		MobileNetV2
T	Grad-CAM	ReciproCAM	Grad-CAM	ReciproCAM
1	76.9	75.1	45	53.2
2	77.1	75.3	47.2	53.9
3	77.1	75.6	46.6	55.1
4	77.2	75.9	46.9	54.9
5	77.5	75.6	47.9	55.4
6	77.6	75.9	47.3	55.2
7	77.6	75.9	47.9	55.3
8	77.6	75.8	48	55.3
9	77.5	75.8	47.8	55.5
10	77.7	75.9	48	55.5
20	77.5	75.9	47.9	55.8
40	77.6	75.9	48.2	56.1
80	77.7	76.1	48	56.1
90	77.7	76.1	48.1	55.8
100	77.7	76.1	48.4	55.8