Interpretability in Video-based Human Action Recognition: Saliency Maps and GradCAM in 3D Convolutional Neural Networks.

Interpretability plays a vital role in understanding complex deep learning models by providing transparency and insights. It addresses the black-box nature of these models, aids in human-in-the-loop systems, enhances model development, and supports education. However, existing interpretability algorithms in computer vision primarily target images, leaving a gap for video applications. In this article, we emphasize the importance of interpretability in video-based Human Action Recognition (HAR). We extend existing 2D interpretability techniques to the 3D domain, specifically focusing on saliency maps and gradient-weighted class activation maps. The proposed interpretability system is then employed in the analysis of a well-known HAR dataset to better understand action recognition in videos.


Introduction
Deep learning has revolutionized the field of Artificial Intelligence (AI), enabling remarkable advancements in various applications, including Human Action Recognition (HAR).However, the inherent complexity of deep neural networks often renders them as opaque "black boxes" [1], leaving users and stakeholders in the dark about how these models arrive at their decisions.To address this challenge, the concept of interpretability has emerged as a critical component of deep learning [2].
Interpretability refers to the intrinsic properties of a deep model measuring to which degree the inference result of the deep model is predictable or understandable to human beings [3].By employing specific algorithms designed to clarify and uncover the decision-making mechanisms of deep models [4,5,6], we can evaluate the interpretability of a model.Several propagation-based and gradient-based algorithms have been explored in the literature along with evaluation metrics for interpretable systems [7].
Hence, interpretability enables us to illuminate these black-box models, granting us a deeper understanding of how they make decisions.
HAR involves the detection and interpretation of human actions from video or sensor data, with applications ranging from surveillance to healthcare and sports analytics [8].In particular, video-based HAR typically requires an understanding of both the spatial information in each video frame and the temporal information of the entire sequence.The state-of-the-art in HAR previously relied on Convolutional Neural Networks (CNNs) [9,10,11] until vision transformers emerged [12].However, vision transformers often require extensive data to overcome the inherent inductive bias associated with CNNs [13].
Despite HAR being a widely explored area in computer vision, there is limited research on interpretable HAR models [14].Previous work focused on learning interpretable spatiotemporal representations from 3D skeleton data [15].Meng et al. [16] proposed an interpretable spatial-temporal attention mechanism for video action recognition based on a convolutional Long Short-Term Memory (LSTM) model.Pan et al. [17] introduced a novel interpretability-focused framework that efficiently reduces redundancy in vision transformers while maintain- ing human-understandable processes.Nevertheless, literature on HAR interpretability remains scarce.Additionally, traditional visual interpretability algorithms were originally intended to be used with images [18].In the context of video-based HAR, an interpretable model must be able to identify and localize the exact spatial and temporal features within the video frames that contribute most to classifying a given video into a specific category [19].
In this paper, we aim to provide an interpretability method for HAR, driven by the need for approaches to interpret the decision-making process of video-based AI systems.We propose a simple and efficient pipeline to evaluate the interpretability of a CNN-based HAR model by extending existing 2D interpretability techniques to the 3D domain.To demonstrate the proposed pipeline, we make use of the Inflated Inception 3D (I3D) introduced by Carreira et al. [11] and further extract saliency maps and gradient-weighted class activation maps from 3D convolutional layers.

Data Material
The Human Motion DataBase (HMDB-51) [20] is an action recognition dataset comprising approximately 7000 video clips categorized into 51 distinct action classes, with each class containing a minimum of 101 clips.These video clips were sourced from various online platforms; however, certain quality criteria had to be met for inclusion in the dataset, such as single action per clip, a minimum of 60 pixels in height for the main actor, minimum contrast level, a minimum of 1 second of clip length, and acceptable compression artifacts.Since original video sources used to extract action clips originally varied in size and frame rate, standardization procedures were employed to ensure dataset consistency.All video frames were scaled to a height of 240 pixels while preserving the original aspect ratio.Furthermore, the frame rate of all clips was converted to a uniform 30 frames per second.

Methodology
An overview of our approach is illustrated in Figure 1.We employ the Inflated 3D CNN (I3D) architecture, originally introduced by Carreira and Zisserman [11].This architecture is based on a 3D inflated version of the popular 2D CNN Inception v1 [21].They suggested a method that involves integrating RGB video frames with optical flow for action recognition.This approach revealed the advantages of pre-trained 2D CNNs and the efficiency of transfer learning in the context of 3D CNNs.In our research, we prioritize efficiency by excluding optical flow estimation and relying exclusively on RGB frames, which aligns better with real-world practicality.Moreover, visualizing interpretability algorithms is more straightforward with RGB frames compared to optical flow frames.

Saliency Maps
Saliency maps were first introduced by Simonyan et al. [22], offering a visualization technique for generating class-specific saliency maps from images using a single backpropagation pass in a CNN.Assuming a linear model, the score for a class c computed by the classification layer is represented as: where I(x, y ) represents the image input with indices x and y for weight and height dimensions, while w c and b c are the weight vector and the bias of the model.In this linear model, the magnitude of elements of w c indicates the importance of the corresponding pixels of I(x, y ) for class c.In deep CNNs, where the class score is a highly non-linear function, y c can be approximated as a linear function near a specific image I 0 using the first-order Taylor expansion: The image-specific class saliency for a particular class S c , defined as the magnitude of the derivative of this linear approximation, indicates the pixels that need minimal change to have the most impact on the class score:

Gradient-weighted Class Activation Map
In 2017, Selvaraju et al. introduced a visual explanation technique called Gradient-weighted Class Activation Map (GradCAM) [23].This method leverages gradients associated with a target class to generate a coarse localization map that highlights the significant regions within an image for predicting that class.
The motivation for this work arises from the belief that the last convolutional layers in a CNN strike a balance between high-level semantics and detailed spatial information.GradCAM extracts gradient information from the last convolutional layer to understand neuron importance in making class-related decisions.This is done by computing the gradient of the class score y c (before the softmax) with respect to the K feature maps A k c (x ′ , y ′ ) of the last convolutional layer.These gradients are global-average-pooled to obtain the neuron importance of weights α k c : where indices x ′ and y ′ refer to the width and height dimensions of the 2D convolutional feature maps.Then, a weighted combination of forward feature maps followed by a ReLU operation is performed: The application of ReLU to the linear combination of maps serves the purpose of focusing solely on features that have a positive impact on the class of interest.

Adaption from 2D to 3D
Our adaptation of 2D saliency maps and GradCAM to the 3D domain is motivated by the research conducted by Lin et al. [24].In the video domain, we define the video input V(f , x, y ), where indices f , x, and y denote the frame number, width, and height of the input video (we omit the channel dimension for simplicity).We define the video-specific class saliency of a 3D CNN as in Eq. 3: To enhance visualization, we follow a specific preprocessing method for the saliency map.We apply a logarithmic scale and we perform a normalization step to ensure that the values fall within the [0, 1] range.Additionally, we intensify the relevant pixels to make them more prominent for visualization purposes.
For the GradCAM, the procedure is similar to Eq. 5.The key concept is to account for the temporal dimension when computing global-average-pooled gradients, making the transition from a 2D input to a 3D input: Indices f ′′ , x ′′ , and y ′′ refer to the frame, width, and height dimensions of the 3D convolutional feature maps.Since the resulting coarse maps have the same dimensions as the feature maps at the last convolutional layer, a spline interpolation step is necessary to match the dimensions of the original video, denoted as F × X × Y in Figure 1.
Finally, we perform a normalization step, scaling the GradCAM values to fall within the range [0, 1].

Implementation Details
To implement the 3D adaptation of saliency maps and GradCAM, we initially train the I3D architecture on the HMDB-51 dataset.To do this, we replace the top layer of the I3D architecture with a new one containing random weights and an appropriate number of output classes corresponding to the actions in the dataset.We perform transfer learning from ImageNet and Kinetics, fine-tuning the model on the HMDB-51 dataset without freezing any layers, and utilize cross-entropy [25] as the loss function.
While our primary intention is not to reproduce the exact results obtained by Carreira et al. [11], we opted for a similar training step and set of hyperparameters to achieve similar performance.
The training process consists of a total of 120 epochs with early stopping.For optimization, we use Stochastic Gradient Descent with a learning rate of 0.001 and a momentum of 0.9.Our training infrastructure employs three Tesla V100 GPUs of 32 GB and two Tesla P100 of 16 GB, and we utilize a batch size of 16 per GPU.
Data augmentation is performed during training, encompassing both spatial and temporal aspects.Initially, we resize the smaller video dimension to 256 and then randomly crop a 224 × 224 patch.To ensure a sequence of 64 consecutive frames, we select frames early enough in the video.Videos with shorter lengths are looped to meet the model's input requirements.Additionally, we consistently apply random left-right flipping to each video during training.During testing, the model is applied convolutively to the entire video, extracting 224 × 224 center crops, and predictions are subsequently averaged.

Experiments
While there are several widely recognized datasets for video-based HAR, such as Kinetics [26], Sport1M [27], and UCF101 [28], we have opted to employ HMDB-51 in this work.This decision is motivated by the attributes of this dataset, which are particularly relevant from an interpretability perspective.Firstly, HMDB-51 contains a relatively small number of samples, simplifying the training process.Secondly, it presents a notably higher level of difficulty compared to other datasets.This increased complexity may arise not only due to the limited training data in HMDB-51 but also because this dataset was intentionally designed to be challenging.For instance, many video clips feature different actions within the same scene (e.g., "drawing sword" examples are extracted from the same videos as "sword" and "sword exercise").
The HMDB-51 database includes three distinct splits for training and testing.These splits were carefully constructed to prevent clips from the same video being used for both training and testing.We maintain the 70% training and 30% testing balance in each split.We benchmark our model by computing the mean accuracy over these three standard train/test splits, following the methodology employed by Carreira et al. [11].
In our interpretability evaluation, we focus on a single data split, specifically split 1, to assess the model's interpretability.Additionally, we analyze the confusion matrix in relation to the test set.The saliency map and GradCAM are obtained from the last convolutional layer of the top inception block in the I3D architecture.

Results and Discussion
Our model achieves an average accuracy of 70.8% on the HMDB-51 dataset across the three splits, which is comparable to the 74.8% reported in [11] when using only RGB frames.The confusion matrix of our model's performance is visualized in Figure 2. As depicted, the overall performance of the model appears to be relatively good.However, this confusion matrix also highlights a key aspect of this dataset: the similarities between certain actions.It is evident that the model encounters challenges when classifying specific groups of actions that are similar, such as draw_sword, sword (fight), and sword_exercise, or sit and stand, or drink and eat.Other actions are inherently more challenging to distinguish due to their general nature, such as fall_floor or pick.Contrariwise, certain actions are distinct and easily recognizable, such as climb, golf, or ride_bike.
For our interpretability evaluation, we manually chose video clip samples based on the model's performance in each category.
We selected dribble and situp as accurately classified samples, and fall_floor and sword_exercise as poorly classified samples to generate the saliency maps and GradCAM along with the original video.As illustrated in Figure 3, the saliency map primarily focuses on pixel attribution, accentuating those pixels that have the most significant influence on the model's output.However, this saliency map can be challenging to interpret and is highly sensitive to perturbations.On the contrary, GradCAM generates a coarse localization map that highlights important regions of the input.While it offers ease of interpretation, it does come at the cost of not being able to capture fine-grained details.
According to these visualizations, the model demonstrates the capability to identify the most significant spatiotemporal regions within the video clips.These regions not only relate to the human position or the presence of specific objects closely associated with particular actions but also to the motion and transition between consecutive frames.An example of this can be observed in the dribble and situp categories, where GradCAM emphasizes the spatial regions of interest more prominently in frames that capture specific movements related to these actions.However, it is fundamental to note that the model's performance in detecting the relevant spatiotemporal region of interest does not necessarily translate into accurate classification across all categories.This becomes apparent when examining the misclassified categories.In these examples, the model still effectively identifies the appropriate region of interest.Thus, we can conclude that the model excels in locating relevant spatiotemporal regions in video clips but faces difficulties in correctly distinguish-Figure 3: Illustration of equally distributed frames over time from video sequences in the categories of dribble, fall_floor, situp, and sword_excercise.For each action, we display the frames corresponding to the original video (first row), the saliency map (second row), and the overlaid GradCAM (third row), all aligned in time, accompanied by the predicted label for the video sample.For visualization purposes, we applied some preprocessing to the saliency map.
ing and classifying certain actions.The complexity of detecting these actions can be attributed to two primary reasons.Firstly, the limitation of training data material.Secondly, as illustrated in the fall_floor example, videos in HMDB-51 with shorter lengths than 64 frames are looped to meet the model's input requirements.This looping disrupts the smooth transition between frames, which can be confusing for the model.
It is also worth emphasizing that the HMDB-51 dataset features only one action per video clip, with most of them taking place in distinct scenes and being relatively distinguishable from one another.Nevertheless, realworld applications present unique challenges, and the differences between actions can be extremely subtle.For instance, in domains like Hand Gesture Recognition or Sign Language Recognition, the ability to detect even minor hand or finger movements is of paramount importance.Moreover, there are scenarios where the action of interest occurs within a highly restricted spatial region within the video.This requires the model to focus on a small portion of pixels, which resembles the challenges often encountered in weakly-supervised HAR.While certain approaches can enhance the model's performance, as demonstrated by Carreira et al. [11] through the use of optical flow and the two-stream architecture, these improvements often entail a significant increase in computational requirements.

Conclusion
In this work, we expand well-known interpretability algorithms from their original 2D context to the 3D domain, allowing us to gain a deeper understanding of the spatiotemporal information that video-based Human Action Recognition models use for prediction.The adaption of 3D saliency maps and GradCAM demonstrates their utility in interpreting the decision-making process of video-based AI systems.Despite the limited scope of our tests involving only one architecture and dataset, we expect that our adaptation will be compatible with any 3D CNN and video data.
While interpretability algorithms are not designed to directly improve the performance of deep learning models, they can provide insights into which parts of an input contribute to a specific prediction.Their primary goal is to increase transparency, explainability, and trust in the model rather than optimize its performance.
In future work, we will explore more complex and sophisticated datasets to determine the extent to which 3D CNN models can replicate human behavior in predicting more challenging actions.Moreover, we will investigate alternative interpretability approaches to further enhance our understanding of deep learning models in the context of video-based Human Action Recognition.

Figure 1 :
Figure 1: Methodology overview.RGB frames of video clips from the HMDB-51 dataset are used to train the model.Two gradient-based interpretability algorithms adapted to the 3D domain are proposed: 1) Visualization of saliency maps, generated by computing the gradient of the class score y c w.r.t the input V. 2) Visualization of gradient-weighted class activation maps, generated by computing the gradient of the class score w.r.t the feature maps A k c in the last convolutional layer of the I3D architecture.

Figure 2 :
Figure 2: Confusion matrix depicting the performance of our model on the HMDB-51 test set (split 1).The model exhibits strong overall performance, yet it seems to struggle with some specific actions.