This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Neural Surveillance:
Unveiling the Enigma of Latent Dynamics Evolution through Live-Update Visualization

Xianglin Yang
School of Computing
National University of Singapore
[email protected]
&Jin Song Dong
School of Computing
National University of Singapore
[email protected]
Abstract

Monitoring the training of neural networks is essential for identifying potential data anomalies, enabling timely interventions and conserving significant computational resources. Apart from the commonly used metrics such loss and validation accuracy, the hidden representation could gives more insight to the model progression. To this end, we introduce SentryCam , an automated, real-time visualization tool that reveals the progression of hidden representations during training. Our results show that this visualization offers a more comprehensive view of the learning dynamics compared to basic metrics such as loss and accuracy over various datasets. Furthermore, we show that SentryCam could facilitate detailed analysis such as task transfer and catastrofic forgetting to continual learning setting. The code is available at https://github.com/xianglinyang/SentryCam.

1 Introduction

Understanding the inner working mechanism of deep neural networks (DNN) could help people gain trust in the decisions given by DNNs, detect and debug failures and attacks, and provide guidance on model selection. The explainability of DNNs has become more important given that those models are deployed in many safety-crucial applications including healthcare and self-driving cars.

A natural way to dive into the inner working mechanism is by comprehending the internal progression of DNNs, including loss and validation accuracy. However, monitoring a single global metric such as loss is not the only option. Another inner quality hidden activations are often overlooked. In particular, monitoring hidden representation evolution could yield substantial benefits in several areas: i) enabling the early detection of model deficiencies (e.g., poor generalization or catastrophic forgetting), thus preventing unnecessary wastage of training time and resources. ii) illuminating the path for developing new algorithms, iii) aiding in choosing the most suitable models for specific tasks based on their performance and learning characteristics over time.

However, analyzing extremely high-dimensional objects in real-time or near real-time poses new challenges. A recent line of research has focused on visualizing the internal evolution of DNNs by dimension reduction [24, 23, 4]. Instead of highlighting important input elements for model prediction after training as other explainable methods do, those methods track the progression of hidden activation as a trajectory plot in a two-dimensional space, providing more informative insight compared to single global metrics like loss and validation accuracy. While such approaches are effective in simple settings, they come with drawbacks when facing more complex training scenarios. For instance, TimeVis [23] requires hyperparameter selection for each new dataset In addition, it cannot extend to new scenarios except by rerunning the whole process, which is unsuitable given that we might constantly adjust our model. As for DVI [24], it cannot respond in real time. As a monitoring system, this would be problematic because it would potentially waste time and resources if the training ultimately fails to achieve the desired outcomes.

Motivated by the need for an effective monitoring system, we propose SentryCam , a visual framework designed to monitor any model training process. In particular, we identify three key principles that any visual monitoring system should satisfy: automation, live updates, and extensibility. Building on previous work [4, 23], we construct a multislice graph by creating connections between hidden representations obtained from a single unit across multiple epochs and from multiple units within the same epoch. Technically, (1) we selectively incorporate partial historical context to balance informativeness and computational efficiency, (2) we introduce a new normalization layer and a new distance function for temporal edges to enhance adaptation, (3) we design a sampling preprocessing method based on the empirical observation that there is a positive correlation between the density of the pruned dataset used for visualization and the quality of visualization. We further validate the efficiency and visualization accuracy of our framework through extensive experiments on various datasets.

In summary, our contribution is three-fold:

  1. 1.

    We lay out general principles that a visual monitoring system of internal mechanisms of DNNs should satisfy.

  2. 2.

    We propose an automatic, live-update, and extendable extension of [24] to obtain a new state-of-the-art visualization technique of hidden representations evolution.

  3. 3.

    We demonstrate that the training dynamics can be applied to a continual learning setting and detect data anomalies in the early training stage and prompt alert.

2 Background

Parametric dimension reduction techniques are important methods for visualization research because of their high efficiency and good performance in extensibility. In this work, we briefly review parametric UMAP [18] which preserves the topological structure of data in Riemannian manifold.

Let two embeddings 𝐱i,𝐱j𝐗\mathbf{x}_{i},\mathbf{x}_{j}\in\mathbf{X} with an arbitrary distance metric d:h×h0\mathrm{d}:\mathbb{R}^{h}\times\mathbb{R}^{h}\rightarrow\mathbb{R}_{\geq 0}. The asymmetric similarity pj|ip_{j|i} from 𝐱i\mathbf{x}_{i} to 𝐱j\mathbf{x}_{j} is defined as:

pj|i:=exp(d(𝐱i,𝐱j)ρiσi)p_{j|i}:=\exp{\bigg{(}{-\frac{d(\mathbf{x}_{i},\mathbf{x}_{j})-\rho_{i}}{\sigma_{i}}}\bigg{)}} (1)

where ρi\rho_{i} is the distance between 𝐱i\mathbf{x}_{i} and its nearest neighbor, σi\sigma_{i} is a normalization factor. For symmetry, probabilistic t-conorm is carried out and the similarities between pair 𝐱i\mathbf{x}_{i} and 𝐱j\mathbf{x}_{j} can given by:

pij=pji:=pi|j+pj|ipi|jpj|ip_{ij}=p_{ji}:=p_{i|j}+p_{j|i}-p_{i|j}\cdot p_{j|i} (2)

Similarly, let 𝐘=[𝐲1,𝐲2,,𝐲N]T\mathbf{Y}=[\mathbf{y}_{1},\mathbf{y}_{2},\dots,\mathbf{y}_{N}]^{T} be the corresponding points from embedded low dimensional space l\mathbb{R}^{l}, the similarity metric in low-dimension is given by:

qij:=1(1+a𝐲i𝐲j2)bq_{ij}:=\frac{1}{(1+a\lVert\mathbf{y}_{i}-\mathbf{y}_{j}\rVert^{2})^{b}} (3)

aa and bb are predefined positive scaling factors. The UMAP cost function is the KL-divergence between pijp_{ij} and qijq_{ij}:

𝒞umap:=ijpijlog(pijqij)Cohesion+(1pij)log(1pij1qij)Repulsion\mathcal{C}_{umap}:=\sum_{i}\sum_{j}\ \ \underbrace{p_{ij}\cdot\log\bigg{(}\frac{p_{ij}}{q_{ij}}\bigg{)}}_{\text{Cohesion}}+\underbrace{(1-p_{ij})\cdot\log\bigg{(}\frac{1-p_{ij}}{1-q_{ij}}\bigg{)}}_{\text{Repulsion}} (4)

The algorithm comprises two parts. Initially, we construct a fuzzy simplicial complex, or a high-dimensional graph, based on the input samples’ local relationships. Following this, we optimize a parametric model (e.g., autoencoder) by minimizing the UMAP cost over sampled positively weighted edges and edges using negative sampling randomly over the data. Finally, the low-dimensional embedding given output by our model can mimic the simplicial complex built by the high-dimensional data.

3 Visual Monitor Desiderata

We seek to enhance the understanding and debugging of neural networks by visualizing their training dynamics, serving as a complement to global metrics like loss and accuracy. We outline the essential requirements for a time-traveling visualization approach, ensuring its practicality for monitoring and debugging purposes.

  1. 1.

    Automatic: The method should operate with minimal manual intervention, ensuring that it remains robust across different datasets and scenarios without the need for extensive user configuration. For instance, approaches like TimeVis[23] that require frequent hyperparameter adjustments for each new dataset do not meet this criterion.

  2. 2.

    Live-update: The visualization tool must support real-time updates, providing immediate feedback during the training process. This live-update capability is crucial for timely interventions and adjustments, which can be particularly useful for long training sessions or when quick iteration over models is needed.

  3. 3.

    Extensibility: In the context of dynamic training environments like continual learning or online learning, the visualization tool must not only incorporate historical training data but also adapt to new data as the learning process continues. This ensures that the tool remains applicable and useful, providing insights into the model’s performance and behaviors as it learns and adapts over time.

4 Method

In this section, we present SentryCam , a graph-based visualization framework adept at producing high-quality visualizations efficiently and with minimal requirement for user input. The first step involves creating a composite graph that includes details about the spatial relationship of the current hidden activation and its temporal connection with past context (Graph Construction 4.2). To guarantee a prompt response, we then proceed to prune the graph (Pruning 4.3). Finally, we optimize an autoencoder to learn a low-dimensional embedding that can mimic the graph built by high-dimensional data.

4.1 Preliminary

Let fθ:sCf_{\theta}:\mathbb{R}^{s}\rightarrow\mathbb{R}^{C} represent a neural network that processes s-dimensional input to yield C-dimensional output. We refer to fθf_{\theta} as the subject model. The training involves a dataset 𝒟={𝒟1,𝒟2,,𝒟M}\mathcal{D}=\{\mathcal{D}_{1},\mathcal{D}_{2},\cdots,\mathcal{D}_{M}\}, composed of TT tasks. Each task mm has a dataset 𝒟t={(xim,yim)}i=0nm\mathcal{D}_{t}=\{(x^{m}_{i},y^{m}_{i})\}^{n^{m}}_{i=0} containing nmn^{m} samples (xi,yi)𝒳×𝒴(x_{i},y_{i})\in\mathcal{X}\times\mathcal{Y}. In scenarios where the number of tasks M=1M=1, this setup defaults to a conventional supervised learning framework. Conversely, with more than one task, it aligns with an online learning scenario, introducing either new data or tasks progressively. The training goal is to learn a supervised model fθ:𝒳𝒴f_{\theta}:\mathcal{X}\rightarrow\mathcal{Y}. Denote [n]={1,2,,n}[n]=\{1,2,\cdots,n\}. For each layer ll within fθf_{\theta}, where l[L]l\in[L] in fθf_{\theta}, the subject model can be dissected into two segments: f1:l:srf_{1:l}:\mathbb{R}^{s}\rightarrow\mathbb{R}^{r} and fl:L:rCf_{l:L}:\mathbb{R}^{r}\rightarrow\mathbb{R}^{C}. Let 𝐑lm(t)=f1:l(𝒟m)\mathbf{R}^{m}_{l}(t)=f_{1:l}(\mathcal{D}_{m}) denote the hidden activation of the network at epoch tt.

During training, as each new subject model fθ(t)f_{\theta}(t) emerges at epoch tt, we aim to swiftly create a visualization function V:r2V:\mathbb{R}^{r}\rightarrow\mathbb{R}^{2}. This function serves to project the high-dimensional activations 𝐑lm(t)\mathbf{R}^{m}_{l}(t) into a two-dimensional space, ensuring a prompt and responsive visualization.

4.2 Graph Construction

Working Memory Construction.

For a full set of representations 𝐑lm(t)=f1:l(𝒟m)\mathbf{R}^{m}_{l}(t)=f_{1:l}(\mathcal{D}_{m}) trained on the current dataset 𝒟m\mathcal{D}_{m} at epoch tt, we generate its visualization. The visualization not only needs to preserve the topological structure of 𝐑lm(t)\mathbf{R}^{m}_{l}(t), but also the relationship with past representations from different datasets 𝒟m\mathcal{D}_{m}^{\prime} at various epochs [t1][t-1]. In particular, we build two graphs of pairwise dependencies: a k nearest neighbor graph 𝒢s\mathcal{G}_{s} between the current representations and a bipartite graph 𝒢b\mathcal{G}_{b} between the current representation and the past context. To simplify and clarify the notation, we refer to the current set of representations as 𝐑𝒯\mathbf{R}_{\mathcal{T}} and the past representations as 𝐑𝒞\mathbf{R}_{\mathcal{C}}.

Spatial Relation. Following the approach outlined in [4] for constructing the k nearest neighbor graph, we define 𝒢t=(𝐑𝒯,𝐄𝒯)\mathcal{G}_{t}=(\mathbf{R}_{\mathcal{T}},\mathbf{E}_{\mathcal{T}}) Here, the vertex set 𝐑𝒯\mathbf{R}_{\mathcal{T}} represents the current representation set, and the edge set 𝐄𝒯\mathbf{E}_{\mathcal{T}} is defined as:

𝐄𝒯={(ri,rj,p(ri,rj))ri,rj𝐑𝒯}\mathbf{E}_{\mathcal{T}}=\{(r_{i},r_{j},p(r_{i},r_{j}))\mid r_{i},r_{j}\in\mathbf{R}_{\mathcal{T}}\} (5)

The weight function p(ri,rj):r×r[0,1]p(r_{i},r_{j}):\mathbb{R}^{r}\times\mathbb{R}^{r}\rightarrow[0,1] denotes the similarity measure between representations in 𝐑𝒯\mathbf{R}_{\mathcal{T}}, as specified in Eq. 2.

Temporal Relation. We seek to capture the dynamics of network evolution by a bipartite graph 𝒢b=(𝐑𝒯,𝐑𝒞,𝐄𝒞)\mathcal{G}_{b}=(\mathbf{R}_{\mathcal{T}},\mathbf{R}_{\mathcal{C}},\mathbf{E}_{\mathcal{C}}). Given the complexity, it’s not practical to include representations from all time steps. To address this, we develop a working memory that selectively incorporates partial historical contexts. Specifically, we include representation sets at time step tt^{\prime} that occur at intervals of powers of 22 distance from a current time step tt. Specifically, the vertex set is defined as:

𝐑𝒞=tt=2n,n+,t[t1]𝐑l(t)\mathbf{R}_{\mathcal{C}}=\bigcup_{t-t^{\prime}=2^{n},n\in\mathbb{N}^{+},t^{\prime}\in[t-1]}\mathbf{R}_{l}(t^{\prime}) (6)

Note that the representation is not necessarily extracted from a single dataset.

This method ensures that as tt increases, the intervals between selected time step tt^{\prime} expand. Consequently, this allows for the inclusion of representations from more distant past time steps, therefore capturing long-term memory. Simultaneously, the model includes time steps closer to tt for smaller powers of 2, thereby integrating more recent information and addressing short-term memory aspects. The working memory reduces the complexity of the analysis to 𝒪(logt)\mathcal{O}(\log t), effectively balancing the need to preserve critical information with the practicality of computational efficiency.

In defining the edge weight function for the bipartite graph, a key challenge is that representations from different time steps are not directly comparable due to the lack of meaningfulness in their absolute value differences. To address this, we choose cosine similarity as our edge weight function. We employ Cosine similarity for two reasons: 1) The cosine similarity emphasizes directional similarity, which is more meaningful as it captures the change in network evolution rather than just the extent of change; 2) the outcome range is between zero and one, which provides a more normalized and consistent scale. We then arrive at the following formula:

E𝒞={(rt,rt,cos(rt,rt))rt𝐑𝒯,rt𝐑𝒞}E_{\mathcal{C}}=\{(r^{t},r^{t^{\prime}},\cos(r^{t},r^{t^{\prime}}))\mid r^{t}\in\mathbf{R}_{\mathcal{T}},r^{t^{\prime}}\in\mathbf{R}_{\mathcal{C}}\} (7)

Here, 𝐄𝒞\mathbf{E}_{\mathcal{C}} represents the relationships between the current representation and the past context.

Optimizing Embedding.

Autoencoders (AEs) are highly effective for dimension reduction [5]. We select AEs as our parametric model for learning the latent structure of our composite graph 𝒢=𝒢s𝒢b\mathcal{G}=\mathcal{G}_{s}\cup\mathcal{G}_{b}. In this process, we apply the UMAP cost function (see Eq. 4) combined with a reconstruction loss, optimizing both the dimensional reduction and the accuracy of data representation.

Further, without the prior knowledge of the statistical mean and variance of the input representation set 𝐑𝒯\mathbf{R}_{\mathcal{T}}, we incorporate a Batch Normalization (BN) layer [6] into our autoencoders (AEs) to enhance the training speed and stability. Formally, with slight abuse of notation, consider the kk-th layer in the AEs defined as zk=ReLU(Wkxk+bk)z_{k}=ReLU(W_{k}x_{k}+b_{k}), where xkx_{k} is the input of kk-th layer, WkW_{k} is the weight matrix, bkb_{k} is the bias parameter, and ReLU(x)=max(x,0)ReLU(x)=max(x,0). We modify this to zk=ReLU(BN(Wkxk+bk))z_{k}=ReLU(BN(W_{k}x_{k}+b_{k})). However, this introduces a challenge: the representation 𝐑lm(t)\mathbf{R}^{m}_{l}(t) within each epoch tt experiences shifts in statistical mean and variance. When training the autoencoder with both current and past contexts, these distributional shifts can lead to unstable training outcomes. To overcome this, notice that the Group Normalization (GN) layer performs normalization at the instance level [15]. Therefore we integrate a Group Normalization (GN) layer before the initial BN layer of the encoder and the last BN layer of the decoder.

z1=ReLU(BN(GN(W1x1+b1)))z_{1}=ReLU(BN(GN(W_{1}x_{1}+b_{1}))) (8)

By coupling GN with BN, our model not only retains the benefits of BN but also overcomes the adverse effects of cross-distribution normalization, leading to more stable and reliable training results.

4.3 Graph Pruning

Empirical Observation.

From Figure 1, it is evident that the visualization quality gradually decreases before sharply declining once the pruning ratio surpasses a certain threshold. Informed by our empirical observations, we propose a “Density-Guided Pruning” approach for our visualization process as Algorithm 1. This decline may result from the loss of crucial topological structures (such as decision boundaries), which becomes difficult to estimate when sample density is too low. Empirical data suggests that high-quality visualization can be effectively accomplished with a more modest number of samples. Consequently, this finding motivates us to consider sample density as an indicator of when to halt the pruning process, demonstrating a distinct correlation.

Refer to caption
(a)
Refer to caption
(b)
Figure 1: (a) The relation between the ratio to sample data and the decrease in data density. (b) The visualization performance when utilize a subset of data randomly.
Density-Guided Pruning.

This algorithm begins by calculating the initial density of the dataset (line 1). Subsequently, a binary search is employed to identify the optimal pruning ratio that aligns with a predefined density threshold (lines 4-11). To enhance the efficiency of this search, we introduce a specified precision level for the pruning ratio, set at p=0.1p=0.1. This precision setting allows the algorithm to converge more rapidly, within four iterations or fewer, to the optimal pruning ratio. Additionally, to mitigate potential variability due to the randomness of pruning, the algorithm calculates the density three times for each iteration and takes the average of these calculations as a basis for decision-making. This approach is particularly effective in data-intensive scenarios.

Algorithm 1 Density-Guided Pruning with Precision Regularization
0:  DD: Original dataset, dthresholdd_{threshold}: Pruning threshold for sample density, pp: Precision
0:  optropt_{r}: Optimal pruning ratio
1:  δ\delta\leftarrow CalculateInitialDensity(DD)
2:  lower0lower\leftarrow 0
3:  upper1upper\leftarrow 1
4:  while upperlower>pupper-lower>p do
5:     PruneRatePruneRate\leftarrow RoundToPrecision((upper+lower)/2(upper+lower)/2, pp)
6:     δ\delta^{\prime}\leftarrow AverageCalculatedDensity(DD, PruneRatePruneRate , 3)
7:     if δ<dthreshold\delta^{\prime}<d_{threshold} then
8:        lowerlower\leftarrow RoundToPrecision(PruneRate+pPruneRate+p, pp)
9:     else
10:        upperupper\leftarrow RoundToPrecision(PruneRatePruneRate, pp)
11:     end if
12:  end while
13:  OptimalPruneRatioOptimalPruneRatio\leftarrow RoundToPrecision(upperupper, pp)
14:  return  OptimalPruneRatioOptimalPruneRatio

5 Experiments

In this section, we aim to answer the following research questions:

  • (live-update) How efficient is SentryCam in generating visualization?

  • (Potential Compromise) Being more efficient and extensive, whether and how SentryCam need to compromise in visualization quality?

  • (Extensibility) Whether SentryCam can be applied in a continual learning setting when new data emerge over time?

5.1 Experiment Setup

To test the generalization ability of SentryCam , we generate diverse training dynamics for visualization.

Datasets

We run image classification tasks on three datasets, including CIFAR-10 [8], CIFAR-100 [8], and FOOD101 [2]. We selected image datasets with varying image sizes and class numbers to evaluate whether our approach is effective across different scenarios. The details about datasets are in Table 1.

Table 1: Datasets details for visualization
Datasets Classes Image Size Train Size Test Size Num per Classes
CIFAR10 10 32 50000 10000 5000
CIFAR100 100 32 50000 10000 500
FOOD101 101 224 75750 25250 750
Subject Models for Visualization.

We chose two commonly used architectures as subject models across all datasets: CNN-based models (e.g., ResNet and its variants) and Transformer-based models (e.g., ViT and its variants). To see whether our visualization can work well in different scenarios, we follow two training receipts: 1) training from scratch, 2) fine-tuning from a pre-trained model. For CIFAR10 and CIFAR100 datasets, we train them from scratch using ResNet18/ResNet-34 and ViT with 6 layers and 8 heads respectively for 200 epochs. For FOOD101, we fine tune them on ResNet50 and ViT/B-16 for 20 epochs. More details are in Table 2.

Table 2: Training Receipt for subject models
Training Receipt Dataset Model Arch lr optimizer batch size scheduler epochs Final Accu
Train from Scratch CIFAR10 ResNet18 1e-2 SGD 128 MultiStepLR 200 0.9393
CIFAR100 ResNet34 1e-2 SGD 128 MultiStepLR 200 0.7721
CIFAR10
ViT
(8 heads, 6 layers)
1e-4 Adam 128 CosineAnnealingLR 200 0.8035
CIFAR100
ViT
(8 heads, 6 layers)
1e-4 Adam 128 CosineAnnealingLR 200 0.5581
Fine-Tune FOOD101 ResNet50 1e-4 Adam 256 MultiStepLR 20 0.6723
FOOD101 ViT/B-16 3e-4 Adam 256 CosineAnnealingLR 20 0.8328
Training Receipts for Visualization Models

For embedding from high-dimensional space with dd dimensions, the visualization model architecture is designed as follows: (d,d/2,d/4d/8d/16,2)&(2,d/16,d/8d/4d/2,d)(d,\lfloor d/2\rfloor,\lfloor d/4\rfloor\lfloor d/8\rfloor\lfloor d/16\rfloor,2)\&(2,\lfloor d/16\rfloor,\lfloor d/8\rfloor\lfloor d/4\rfloor\lfloor d/2\rfloor,d). We set unified hyperparameters for all the training cases. We train it with Adam optimizer with learning rate 1e21e-2 and weight decay 1e51e-5. The learning scheduler is StepLR with step size 44 and gamma 0.10.1.

Baselines.

We evaluate our SentryCam against multiple baselines, including (1) DVI, a parametric dimension reduction method that employs sequential training, (2) TimeVis, an autoencoder-based method working in post-hoc manner. For the experiments following, we follow the configuration stated in their original paper.

5.2 Efficiency (RQ1)

We first assessed whether SentryCam could provide live-update visualizations concurrently with the training of the subject models. We report the average training time per Epoch (ATT) to train different models over different datasets and the average visualization time delay per Epoch (AVT) to generate visualization after a checkpoint is generated.

As demonstrated in Table 3, our method consistently outperforms the baseline approaches over the baseline approaches across various datasets and model architectures. Technically, TimeVis builds up a unified graph of all checkpoints after the model training is finished which makes it the slowest. while DVI is also a post hoc method, we have modified it for our experiments to invoke the visualization function immediately after each checkpoint is created. Despite this adaptation, DVI’s AVT is still twice that of ours, and its delay for the latest checkpoint accumulates quickly due to its sequential processing nature.

Moreover, our method could provide live-update visualization as shown in Table 3. For the FOOD101 dataset, SentryCam is capable of generating a visualization of one checkpoint before the next one appears. For the CIFAR10 and CIFAR100 datasets, SentryCam experiences a delay of 2-5 epochs, which is considered entirely acceptable given that the subject model is trained for hundreds of epochs.

Table 3: Efficiency Comparison of Visualization Methods on CIFAR10, CIFAR100, and FOOD101 Datasets. ATT stands for the average training time per epoch; AVT stands for the average visualization time per epoch. The delay time represents how long it takes to generate visualization after a checkpoint is generated.
Model Method CIFAR10 CIFAR100 FOOD101
ATT AVT Min Delay Max Delay Avg Delay ATT AVT Min Delay Max Delay Avg Delay ATT AVT Min Delay Max Delay Avg Delay
CNN based DVI 10.6 112.6 2236.6 22521.6 12374.4 16.0 113.8 3321.8 25974.6 14647.7 237.6 419.0 5089.4 5247.3 5171.4
TimeVis - 5083.0 5083.0 5083.3 - 7179.5 7179.5 7179.5 - 8889.0 8889.0 8889.0
SentryCam 65.7 40.4 87.4 65.7 61.3 42.6 84.4 61.3 234.6 212.6 245.3 234.6
ViT based DVI 27.8 120.3 5654.5 24066.2 14863.4 50.6 119.4 10229.4 23877.4 17058.3 683.2 677.5 14209.1 14441.6 14340.5
TimeVis - 8269.3 8269.3 8269.3 7254.1 7254.1 7254.1 - 8854.7 8854.7 8854.7
SentryCam 67.2 39.0 75.9 67.2 66.1 38.8 70.3 66.1 484.1 430.0 510.3 484.1

5.3 Visualization Quality (RQ2)

We quantitatively evaluate how SentryCam performs in visualization quality compared to other baselines. We also conduct a qualitative evaluation of the visualization results.

5.3.1 Quantitative Analysis

Evaluation Metrics.

Following M-PHATE[4], DVI[24], and TimeVis [23], we evaluate the visualization pesrformance quantitatively with the following metrics.

  • Trustworthiness: evaluate to what extent the local structure is retained. Formally,

    T(k)=12Nk(2N3k1)i=1Nj𝒩ikmax(0,(r(I,j)k))T(k)=1-\frac{2}{Nk(2N-3k-1)\sum_{i=1}^{N}}\sum_{j\in\mathcal{N}_{i}^{k}}\max(0,(r(I,j)-k)) (9)

    where NN and kk represent the number of samples and the neighborhood strength respectively.

  • Intraslice Neighbor Preservation: evaluate how many k nearest neighbors are preserved after dimension reduction at any time step tt.

    1NNi=1|𝒩Highk(t,i)𝒩Lowk(t,i)|\frac{1}{N}\sum^{i=1}_{N}\bigg{|}\mathcal{N}^{k}_{High}(t,i)\cap\mathcal{N}^{k}_{Low}(t,i)\bigg{|} (10)

    where NN is the total number of samples, and 𝒩Highk(t,i)\mathcal{N}^{k}_{High}(t,i) is the k nearest neighbors of sample ii at time step tt from High dimensional space.

  • Interslice Neighbor Ranking Correlation evaluate whether the visualization method could faithfully show the movement of the samples across different time steps. Formally, for a sample xitx^{t}_{i} at time step tt, let ritr^{t}_{i} and r~it\tilde{r}^{t}_{i} be the ranking of itself in other time steps {xit}t[T]\{x^{t}_{i}\}_{t\in[T]} ordered by the distance in high-dimensional space and low-dimensional space respectively, we define the correlation between the two rankings as [23].:

    Corri=spearman(rit,r~it)Corr_{i}=spearman(r^{t}_{i},\tilde{r}^{t}_{i}) (11)
  • Reconstruction Accuracy: evaluate whether the reconstructed representations have the same prediction as the original ones.

Refer to caption
(a) ResNet
Refer to caption
(b) ViT
Figure 2: Intraslice Neighbor Preservation (k=15)
Refer to caption
(a) ResNet
Refer to caption
(b) ViT
Figure 3: Intraslice Trustworthiness
Refer to caption
(a) ResNet
Refer to caption
(b) ViT
Figure 4: Reconstruction Accuracy
Spatial Properties Evaluation

We conduct the neighbor preservation experiment with k=15k=15. Figure 2 and Figure 3 show the results of Intraslice Neighbor Preservation and Trustworthiness respectively. We split the training processes into three stages, namely early, middle, and late, and report their results. As shown in the figures, our method consistently outperforms existing approaches and, in some cases, performs on par with the best alternative methods.

Figure 4 shows the reconstruction accuracy. Again, SentryCam consistently outperforms existing approaches and, in some cases, performs on par with the best alternative methods. In addition, our method is more robust than the other baselines. TimeVis fails to reconstruct data for the FOOD101 dataset using the ResNet architecture, and DVI fails for CIFAR100 using the ViT architecture. In contrast, our method has proven to be robust and reliable, consistently performing without failures across all cases. Furthermore, we observe that SentryCam excels particularly in more complex datasets.

Temporal Property Evaluation

Figure 5 presents the temporal neighbor ranking correlation between high-dimensional data and their low-dimensional representations. SentryCam performs comparably to TimeVis, and both significantly outperform DVI. Technically, TimeVis processes embeddings from all time steps, whereas DVI considers only the previous time step. In contrast, our approach strikes a balance by incorporating representations from variously spaced time intervals. Given that we utilize less data than TimeVis, it is expected that TimeVis would represent an upper performance bound for our method, which is confirmed by the results shown in Figure 5.

It is important to note that for ResNet50 trained on the FOOD101 dataset, DVI quantitatively outperforms SentryCam . However, despite their superior numerical performance in this instance, we demonstrate in Section 5.3.2 that their visualization results are less meaningful, as they tend to collapse into a single cluster, indicating a significant drawback despite achieving good numeric results.

Refer to caption
(a) ResNet
Refer to caption
(b) ViT
Figure 5: Interslice Neighbor Ranking Correlation

5.3.2 Qualitative Analysis

To further assess the effectiveness of various visualization techniques, we examine the visualization results from different methods applied to the embeddings of ResNet and ViT on the FOOD101 dataset at the 20th epoch, as depicted in Figures 7 and 6. Only SentryCam has managed to maintain the integrity of data clusters and present a coherent classification landscape without significant overlap or dispersion.

In particular, DVI produces a linear, elongated cluster as in Figure 7(a). Despite its superior performance in terms of interslice temporal neighbor correlation shown in Figure 5(a), it fails to adequately capture the complex relationships and variations within the data. This issue may stem from an overemphasis on certain dimensions or features, which could lead to a representation that overly highlights outliers and underrepresents the core structural elements of the dataset.

Therefore, SentryCam is the more balanced approach in data visualization methods to ensure accuracy and comprehensiveness in representing data relationships.

Refer to caption
(a) DVI
Refer to caption
(b) TimeVis
Refer to caption
(c) SentryCam
Figure 6: Comparasion of visualization results on ViT over FOOD101 dataset
Refer to caption
(a) DVI
Refer to caption
(b) TimeVis
Refer to caption
(c) SentryCam
Figure 7: Comparasion of visualization results on ResNet over FOOD101 dataset

5.4 Extensibility to new data (RQ3)

Beyond the fundamental approach of supervised learning, more complex tasks such as online learning present significant challenges due to the continuous influx of new data. Continual learning is a paradigm designed to address these challenges effectively. In this section, we present a case study demonstrating how SentryCam can support detailed analysis in such a scenario.

We consider two common scenarios of continual learning: domain incremental learning and class incremental learning [21]. Domain incremental learning (DIL) involves learning to solve the same problem in different contexts. For example, in the MNIST dataset, the model needs to continually learn how to predict whether digits are odd or even as new numbers emerge. Class Incremental Learning (CIL) focuses on distinguishing between incrementally observed classes. For instance, in the MNIST dataset, the model needs to continually learn how to classify new digits as they appear.

We implement a simple CNN model on splitMNIST with 400 units in the penultimate layer to perform DIL and CIL using two baselines: weight regularization-based approach, Functional Regularization Of the Memorable Past (FROMP) [13] and replay-based method, Experience Replay (ER) [17]. We share the same network architecture and same dataset across all baselines and all scenarios.

Table 4: The accuracy for continual learning under different scenarios.
Scenario Baseline Context 1 Context 2 Context 3 Context 4 Context 5 Avg accu
Domain ER 0.943 0.936 0.828 0.972 0.993 0.934
FROMP 0.608 0.931 0.607 0.977 0.993 0.823
Class ER 0.931 0.823 0.764 0.895 0.988 0.880
FROMP 0.910 0.821 0.713 0.572 0.738 0.751
Identify Catastrophic Forgetting

Catastrophic forgetting refers to the phenomenon where a model loses previously acquired knowledge upon learning new information. Figures 8 and 9 display visualizations from a Domain Incremental Learning scenario, utilizing the ER and FROMP strategies respectively, which illustrate the distribution of data from earlier contexts.

These visualizations indicate that both models experience some degree of forgetting as they acquire new tasks, with data from Context 1 exhibiting the most significant loss of detail. Conversely, the data from Context 4 retains a more defined cluster shape, suggesting that more recent contexts are less affected by forgetting. This pattern highlights the models’ varying ability to preserve earlier learned information over successive learning phases.

Refer to caption
(a) Context 1 data
Refer to caption
(b) Context 2 data
Refer to caption
(c) Context 3 data
Refer to caption
(d) Context 4 data
Refer to caption
(e) Context 5 data
Figure 8: Visualization of data from previous contexts under the setting of Domain incremental learning with strategy ER.
Refer to caption
(a) Context 1 data
Refer to caption
(b) Context 2 data
Refer to caption
(c) Context 3 data
Refer to caption
(d) Context 4 data
Refer to caption
(e) Context 5 data
Figure 9: Visualization of data from previous contexts under the setting of Domain incremental learning with strategy FROMP.
Refer to caption
(a) Context 1 data
Refer to caption
(b) Context 2 data
Refer to caption
(c) Context 3 data
Refer to caption
(d) Context 4 data
Refer to caption
(e) Context 5 data
Figure 10: Visualization of data from later contexts under the setting of Class incremental learning with strategy ER.
Refer to caption
(a) Context 1 data
Refer to caption
(b) Context 2 data
Refer to caption
(c) Context 3 data
Refer to caption
(d) Context 4 data
Refer to caption
(e) Context 5 data
Figure 11: Visualization of data from later contexts under the setting of Class incremental learning with strategy FROMP.
Evaluate Task Transfer

Task transfer in continual learning assesses a model’s capability to leverage knowledge acquired from previously learned tasks when addressing new, related tasks. Figures 10 and 11 display visualizations of data from later contexts in a Class Incremental Learning setting, specifically starting from Context 1. In the ER strategy, there is noticeable separation among the data points, although significant overlap still exists. This suggests that while there is some retention and differentiation of learned knowledge, there is room for improvement in how distinctly the model can separate new task data from existing contexts. This overlap, however, also implies a blending of features that might foster generalization, indicating a promising direction for future tasks. In contrast, the FROMP strategy exhibits a high degree of overlap among almost all data from later tasks. This extensive overlap could present challenges in future contexts, as it indicates a potential difficulty in distinguishing new task data from previous learning, which may complicate the learning of distinct new tasks.

6 Related Works

Hidden Activation Visualization.

Visualizing the evolving hidden representation of DNNs can be seen as a dimension reduction problem. Typical dimension reduction techniques can be divided into non-parametric and parametric approaches. For non-parametric methods, they directly optimize the low-dimensional embeddings towards a predefined cost function. t-SNE [22] and UMAP [10] are two typical non-parametric approaches. They construct a k nearest neighbor graph of the high-dimensional dataset and preserve their local relationships between samples. As for parametric methods, autoencoders and their variants are commonly used for dimension reduction [5]. Recently, Topological autoencoders [12] incorporate an additional loss function to the autoencoder based on persistent homology to preserve the topological structures of the target dataset. In addition, Diffusion Maps (DM) [3] provides another perspective on dimension reduction. Diffusion Maps [3] and PHATE [11] utilize eigen-decomposition of the graph’s transition matrix to project the data into a lower-dimensional space that preserves important relational characteristics.

More research works have designed methods specifically for visualizing the evolving state of DNNs to monitor and debug DNNs. M-PHATE [4] adapts PHATE [11] for visualizing the training dynamics of DNNs for model selection by applying MDS on the graph transition matrix. DVI [24] and TimeVis [23] are two autoencoder-based approaches to unveil the embedding space and decision boundaries evolutions during training. Those approaches apply visualization in a posthoc manner, making them not a good choice for a real-time monitor of DNNs.

Training dynamics for Model debugging.

Some researchers leverage the training dynamics of each sample to identify their influence on the model. [20] evaluate an individual training example by counting the number of transitions from being classified correctly to incorrectly throughout the learning track. They find that those “forgetting samples” generally do not contribute to the model’s generalization performance. Data Maps [19] record the mean and standard deviation of the gold label probabilities, predicted for each example across training epochs to identify easy samples with hard samples. AUM [16] identifies mislabeled samples through their logit training traces, distinguishing them from clean samples. Example Difficulty [1] assigns scores to samples based on the effective prediction depth, indicating the layer within the deep model where the sample is accurately classified. SSFT [9] differentiates hard samples from mislabeled ones by initially dividing the data into two subsets and sequentially training on them. Mislabeled samples tend to be forgotten during the training on the second subset, whereas hard samples are not.

Some researchers are interested in how the human-understandable concept evolves during the training process as well. ConceptEvo [14] and Concept-Monitor [7] extract the concept learned by each neuron at each epoch and measure the concept diversity of neurons.

Different from those approaches targeting only one specific model defect, our approach could support open-exploration which leads to various error detection. Our method is complementary to theirs.

7 Conclusion

In this work, we present SentryCam , a novel visualization technique to monitor the internal progression of deep neural network training. We propose three requirements for visualization methods as the monitoring system for DNNs. Additionally, we introduce SentryCam , which satisfies all these requirements. We demonstrate the superiority of SentryCam over other visualization methods in both quality and efficiency. SentryCam is further showcased through vignettes in standard training and continual learning, drawing conclusions that would be challenging without such a visualization. In conclusion, this work highlights the utility of SentryCam as a valuable visualization method for deep learning practitioners.

References

  • Baldock et al. [2021] Robert John Nicholas Baldock, Hartmut Maennel, and Behnam Neyshabur. Deep learning through the lens of example difficulty. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=WWRBHhH158K.
  • Bossard et al. [2014] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, pages 446–461. Springer, 2014.
  • Coifman and Lafon [2006] Ronald R Coifman and Stéphane Lafon. Diffusion maps. Applied and computational harmonic analysis, 21(1):5–30, 2006.
  • Gigante et al. [2019] Scott Gigante, Adam S Charles, Smita Krishnaswamy, and Gal Mishne. Visualizing the phate of neural networks. Advances in neural information processing systems, 32, 2019.
  • Hinton and Salakhutdinov [2006] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. science, 313(5786):504–507, 2006.
  • Ioffe and Szegedy [2015] Sergey Ioffe and Christian Szegedy. Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, page 448–456. JMLR.org, 2015.
  • Khan et al. [2023] Mohammad Ali Khan, Tuomas Oikarinen, and Tsui-Wei Weng. Concept-monitor: Understanding dnn training through individual neurons, 2023.
  • Krizhevsky et al. [2009] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • Maini et al. [2022] Pratyush Maini, Saurabh Garg, Zachary C. Lipton, and J. Zico Kolter. Characterizing datapoints via second-split forgetting, 2022.
  • McInnes et al. [2018] Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018.
  • Moon et al. [2017] Kevin R Moon, David van Dijk, Zheng Wang, William Chen, Matthew J Hirn, Ronald R Coifman, Natalia B Ivanova, Guy Wolf, and Smita Krishnaswamy. Phate: a dimensionality reduction method for visualizing trajectory structures in high-dimensional biological data. BioRxiv, 120378, 2017.
  • Moor et al. [2020] Michael Moor, Max Horn, Bastian Rieck, and Karsten Borgwardt. Topological autoencoders. In International conference on machine learning, pages 7045–7054. PMLR, 2020.
  • Pan et al. [2020] Pingbo Pan, Siddharth Swaroop, Alexander Immer, Runa Eschenhagen, Richard Turner, and Mohammad Emtiyaz E Khan. Continual deep learning by functional regularisation of memorable past. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 4453–4464. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/2f3bbb9730639e9ea48f309d9a79ff01-Paper.pdf.
  • Park et al. [2022] Haekyu Park, Seongmin Lee, Benjamin Hoover, Austin Wright, Omar Shaikh, Rahul Duggal, Nilaksh Das, Judy Hoffman, and Duen Horng Chau. Conceptevo: Interpreting concept evolution in deep learning training. arXiv preprint arXiv:2203.16475, 2022.
  • Pham et al. [2022] Quang Pham, Chenghao Liu, and Steven HOI. Continual normalization: Rethinking batch normalization for online continual learning. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=vwLLQ-HwqhZ.
  • Pleiss et al. [2020] Geoff Pleiss, Tianyi Zhang, Ethan R. Elenberg, and Kilian Q. Weinberger. Identifying mislabeled data using the area under the margin ranking, 2020.
  • Rolnick et al. [2019] David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory Wayne. Experience replay for continual learning. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/fa7cdfad1a5aaf8370ebeda47a1ff1c3-Paper.pdf.
  • Sainburg et al. [2021] Tim Sainburg, Leland McInnes, and Timothy Q Gentner. Parametric umap embeddings for representation and semi-supervised learning, 2021.
  • Swayamdipta et al. [2020] Swabha Swayamdipta, Roy Schwartz, Nicholas Lourie, Yizhong Wang, Hannaneh Hajishirzi, Noah A. Smith, and Yejin Choi. Dataset cartography: Mapping and diagnosing datasets with training dynamics. In Proceedings of EMNLP, 2020. URL https://arxiv.org/abs/2009.10795.
  • Toneva et al. [2018] Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Bengio, and Geoffrey J Gordon. An empirical study of example forgetting during deep neural network learning. arXiv preprint arXiv:1812.05159, 2018.
  • van de Ven et al. [2022] Gido M van de Ven, Tinne Tuytelaars, and Andreas S Tolias. Three types of incremental learning. Nature Machine Intelligence, 4:1185–1197, 2022.
  • Van der Maaten and Hinton [2008] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
  • Yang et al. [2022a] Xianglin Yang, Yun Lin, Ruofan Liu, and Jin Song Dong. Temporality spatialization: A scalable and faithful time-travelling visualization for deep classifier training. In IJCAI, pages 4022–4028, 2022a.
  • Yang et al. [2022b] Xianglin Yang, Yun Lin, Ruofan Liu, Zhenfeng He, Chao Wang, Jin Song Dong, and Hong Mei. Deepvisualinsight: Time-travelling visualization for spatio-temporal causality of deep classification training. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 5359–5366, 2022b.

·