Supplementary Material: FSL Framework to Reduce Inter-Observer Variability

I The ParESN Model

The Parallel ESN Framework presented in this work is inspired by the previous works in [sohiniesn] and [esn2]. The primary difference with the model in [sohiniesn] is the variation in importance for the hidden states with respect to a previous pixel in the same image vs. the same pixel in the previous image. With respect to [esn2], the parallel branches in this work generate similarly trained regional proposals (RPs) instead of a cumulative combination from the three parallel streams in [esn2]. Also, the choice of three parallel layers in this work is optimal amongst the choice of $\{2,3,4,5\}$ parallel RPs based on a grid search and leave-one-out cross validation across vendor image stacks. The detailed system setup is shown in Fig. 1.

Refer to caption — Figure 1: System setup of the proposed ParESN model. Each image and its 3 pre-processed planes are converted into input matrix $\mathbf{U}$ and fed to the 3 parallel ESN branches to update the reservoir state matrix $\mathbf{X}$ per branch. At the end of the training process, a $c$ -dimensional vector is output per pixel location, where $c$ represents the number of classes to be predicted ( $c=2$ for binary segmentation). The per-pixel output ( $p(k)\in\{P_{1},P_{2},P_{3}$ }) is the class label with maximal probability. Each regional proposals (RPs), i.e. $P_{1},P_{2},P_{3}$ from the 3 parallel arms represent cyst-like pixels that are trained from the same images and similar training setups. To demonstrate qualitative overlap between the RPs, the RP from the top, middle and bottom ESN arms are visualized with the red, blue and green image planes respectively.

The OCT cyst images per vendor stack represent a volumetric level scan. This implies if a large cyst appears in a scan, there is a high probability of the same cyst appearing in some shape and form in the previous and succeeding scans as well. The proposed setup is analogous to the video processing setup in [sohiniesn]. Thus, the reservoir states per pixel location of subsequent images will be affected by the previous and current image, represented in (1).

	$\displaystyle\mathbf{x_{\nu}}(k)=(1-\alpha)\,\mathbf{x_{\nu}}(k-1)$		(1)
	$\displaystyle+\alpha\,f\big{(}\mathbf{W}_{\mathrm{in,\nu}}\,[1;\mathbf{u}(k)]+\mathbf{W_{\nu}}\,\mathbf{x_{\nu}}(k-1)\big{)}.$

At the end of the training stage, $W_{out,\nu},\nu=\{1,2,3\}$ are computed for each parallel layer using (2).

\displaystyle\mathbf{w}_{\mathrm{out,\nu}}=\Big{(}\sum_{l=1}^{L}\mathbf{z}_{\nu,l}(k)\mathbf{z}^{\mathrm{T}}_{\nu,l}(k)+\lambda\,\mathbf{\mathds{1}}\Big{)}^{-1}\Big{(}\sum_{l=1}^{L}\mathbf{z}_{\nu,l}(k)y(k)\Big{)}

(2)

where, $\mathbf{z_{\nu,l}}(k)=[1;\mathbf{u}(k);\mathbf{x_{\nu}}(k)]$ are the extended system states for the 3 parallel layers, evaluated over $l=\{1,2\dots L\}$ training images, and $y(k)$ represents the target label at pixel location $k$ , and $\mathbf{\mathds{1}}$ represents identity matrix.

The leave-one-out cross validation experiment across vendor stacks helps identify the optimal parameter set $\{\alpha,\lambda\}$ in (1), (2), respectively. We observe $\alpha=[0.95]$ to be optimal from the search set of $[0.30:0.99]$ in increments of 0.02. This implies very low “leaky-memory” requirement for the data set [DeepESN]. Also, the sensitivity to $\lambda$ is found to be very low, with $\lambda=10^{-5}$ being optimal for the setup in the search set of $[10^{-10}-0.1]$ in order increments of 10.

II Qualitative Analysis of Target Label Selection Algotithm (TLSA)

The key contribution of our work is the use of the RPs to detect the “best” manual annotation at image level. Examples of the TLSA for TL vs noisy TL selection and for $G_{1}$ vs. $G_{2}$ selection are shown below.

II-A Examples of TLSA against noisy labels

For this experiment the Random Crop and Paste function is invoked to generate noisy TLs. In Fig 2, the actual TLs are represented as red, noisy generations using RCAP as blue and their intersections as white regions, respectively. Also, the RPs $P_{1},P_{2},P_{3}$ are represented in red, green and blue planes, respectively. In Fig. 2 qualitative explanations are provided for manual interventions and for automatic selection of actual TL over the noisy counterparts.

II-B Examples of TLSA for best TL selection

In Fig. 3, qualitative examples of best TL selection are shown. In Fig. 3, the left columns represent the TLs such that $G_{1}$ is in red, $G_{2}$ in blue and their intersection is in white. The right columns represent the RPs in red, green and blue planes respectively. In Fig. 3(a), the Few-shot learning (FSL) models are trained on $G_{1}$ , but the RPs agree more with $G_{2}$ , and hence, $G_{2}$ is selected as the best label for all these images. In Fig. 3(b), the FSL models are trained on $G_{2}$ , but the RPs agree more with $G_{1}$ , and hence $G_{1}$ is selected as the best label for all these images. The examples in Fig. 3 demonstrate the extent of variabilities by manual target labels $G_{1}$ and $G_{2}$ .

For the images in Fig 3(a), $G_{1}$ annotations represented many contiguous small cyst regions, based on which, RPs get trained to identify small contiguous cysts in test images. However, $G_{1}$ annotates some larger cyst areas in the test images as opposed to $G_{2}$ that detects smaller cysts. Thus, $G_{2}$ is preferable in all such examples in Fig 3(a). For images in Fig. 3(b), the RPs trained on $G_{2}$ have an affinity to detect large cysts that appear in $G_{1}$ . Hence, in all these examples, $G_{1}$ is the preferred TL in spite of the FSL models being trained on $G_{2}$ . From the examples in Fig. 3, we are able to qualitatively assess the importance of the TLSA for standardizing cyst segmentation to overcome inter-observer variabilities.