Learning Class Unique Features in Fine-Grained Visual Classification: Appendix

Aeiau Zzzz Bauiu C. Yyyy Cieua Vvvvv Iaesut Saoeu Fiuea Rrrr Tateu H. Yasehe Aaoeu Iasoh Buiui Eueu Aeuia Zzzz Bieea C. Yyyy Teoau Xxxx Eee Pppp

Machine Learning, ICML

1 Appendix 1: Experimental Details

Our experiments are performed on fine-grained visual classification (FGVC) benchmarks: CUB-200-2011 (wah2011caltech), FGVC-Aircraft (maji2013fine), Stanford Cars (KrauseStarkDengFei-Fei_3DRR2013) and standard visual classification benchmarks: CIFAR-10 (krizhevsky2009learning), CIFAR-100 (krizhevsky2009learning), STL-10 (coates2011analysis). The statistics of six datasets are shown in 1.

Different methods are compared on ResNet18 (he2016deep), VGGNet11 (simonyan2014very), DenseNet161 (huang2017densely). We conduct all our experiments on Pytorch framework (paszke2019pytorch) with NVIDIA 2080Ti GPUs. In all the datasets, we perform three-folds cross validation to choose the hyper-parameter (including the algorithm-specific parameters, learning rate, decay rate and weight decay, see Table 2. Then we evaluate all the models over three runs and report the mean and standard deviation on the test set. We use Stochastic Gradient Descent (SGD) to update the model parameters. All the images are normalized and augmented by random crop and random horizontal flip. For algorithm-specific hyper-parameters, we set 1.00 for the weight of Confidence Penalty (CP), 0.10 for the smoothing rate of Label Smoothing (LS) and 0.85 for $p_{t}$ of Minimax Loss (MM). For FRL, we set $K=10$ and $\lambda=1$

Table 1: Statistics of six datasets in this work.

Dataset	#Training	#Testing	#Categories
CUB-200-2011	5994	5794	200
FGVC-Aircraft	6667	3333	100
Stanford Cars	8144	8041	196
CIFAR-10	50000	10000	10
CIFAR-100	50000	10000	100
STL-10	5000	8000	10

Table 2: Training hyper-parameters for experiments. In LR policy, n1/n2: decay n1 for every n2 epochs.

Dataset	Image sizet	Crop size	Batch size	Epochs	Learning rate	Weight decay	LR policy
CIFAR-10	32 $\times$ 32	32 $\times$ 32	128	200	0.1	0.0005	0.2/50
CIFAR-100	32 $\times$ 32	32 $\times$ 32	128	200	0.1	0.0005	0.2/50
STL-10	96 $\times$ 96	96 $\times$ 96	64	200	0.1	0.0005	0.2/50
CUB-200-2011	512 $\times$ 512	448 $\times$ 448	16	60	0.004	0.0005	0.9/2
FGVC-Aircraft	512 $\times$ 512	448 $\times$ 448	16	60	0.008	0.0005	Cosine
Stanford Cars	512 $\times$ 512	448 $\times$ 448	16	60	0.01	0.0005	Cosine

2 Appendix 2: The Proofs

2.1 Proof of lemma 1

Proof.

From previous discussions, we have

\displaystyle I_{\theta}(\hat{Y}_{C\setminus t};X^{t})\leq\log{(n-1)}-H_{\theta}(\hat{Y}_{C\setminus t}|X^{t}).

On the other hand, when the conditional probability distribution over non-target classes is uniform, we have

H_{\theta}(\hat{Y}_{C\setminus t}|X^{t})=\log(n-1).

Therefore $I_{\theta}(\hat{Y}_{C\setminus t};X^{t})\leq 0$ , combining with the fact that MI is always non-negative gives $I_{\theta}(\hat{Y}_{C\setminus t};X^{t})=0$ . Finally we have $I_{\theta}(\hat{Y}_{C\setminus t};\Phi(X^{t}))\leq I_{\theta}(\hat{Y}_{C\setminus t};X^{t})=0$ , and thus $I_{\theta}(\hat{Y}_{C\setminus t};\Phi(X^{t}))=0$ . ∎

2.2 Proof of theorem 1

Proof.

Since $k=\mathop{\arg\min}_{z}(q_{C\setminus t})_{z}$ , then $\forall i\neq k,q_{i}\geq q_{k}$ . We have:

	$\displaystyle D_{CE}(p\|\|q)$	$\displaystyle=-\sum_{i\neq k,t}p_{i}\log q_{i}-p_{k}\log{q_{k}}-p_{t}\log q_{t}$
		$\displaystyle\leq-\sum_{i\neq k,t}p_{i}\log q_{k}-p_{k}\log{q_{k}}-p_{t}\log q_{t}$
		$\displaystyle=-(\sum_{i\neq k,t}p_{i}+p_{k})\log q_{k}-p_{t}\log q_{t}$
		$\displaystyle=-(1-p_{t})\log q_{k}-p_{t}\log q_{t}$
		$\displaystyle=D_{CE}(p^{*}\|\|q),$

which concludes Theorem 1. ∎

2.3 Proof of theorem 2

Proof.

Let $k=\mathop{\arg\min}_{z}(q_{C\setminus t})_{z}$ . Since $\forall i\neq t,q_{i}^{*}=\frac{1-q_{t}}{n-1}$ , for any $q\neq q^{*}$ , we have $\forall i\neq t,q_{i}^{*}\geq q_{k}:$

	$\displaystyle D_{CE}(p^{*}\|\|q)$	$\displaystyle=-\sum_{i}p_{i}^{*}\log q_{i}$
		$\displaystyle=-(1-p_{t})\log q_{k}-p_{t}\log q_{t},$
		$\displaystyle\geq-(1-p_{t})\log q_{i}^{*}-p_{t}\log q_{t},$
		$\displaystyle=D_{CE}(p^{}\|\|q^{}),$

which concludes Theorem 2. ∎

2.4 Proof of theorem 3

Proof.

Under the strategy of $s^{*}$ , the expected payoff of the adversary can be calculated as:

\displaystyle U_{P}=\sum_{a_{P}\in a_{P}^{*}}s_{P}^{*}(a_{P})u_{P}(a_{P},a_{Q}),a_{Q}\in a_{Q}^{*}.

Since ${\rm argmin}_{z}{(q_{C\setminus t})_{z}}$ contains all the indexes except $y$ under the condition that $Q$ uniformly distribute all the probabilities over non-target classes, thus $\forall p\in a_{P},p=p*$ . Moreover, $a_{Q}$ contains only one action which is equal to $q^{*}$ , thus:

\displaystyle U_{P}

\displaystyle=n-1\times\frac{1}{n-1}u_{P}(p^{*},q^{*})=u_{P}(p^{*},q^{*}).

Assume that there is another strategy $s_{P}$ that gives higher payoff for the adversary, then:

	$\displaystyle U_{P}^{\prime}$	$\displaystyle=\sum_{a_{P}\in a_{P}}s_{P}^{}(a_{P})u_{P}(a_{P},a_{Q}),a_{Q}\in a_{Q}^{}$
		$\displaystyle\leq\sum_{a_{P}\in a_{P}}s_{P}^{}(a_{P})u_{P}(p^{},a_{Q}),a_{Q}\in a_{Q}^{},\forall p^{}\in a_{P}^{*}$
		$\displaystyle=u_{P}(p^{},q^{})=U_{P},$

which is contrary to the assumption, thus prove that the $s_{P}^{*}$ is the best response to $s_{Q}^{*}$ .

Likewise, we can prove that $s_{Q}^{*}$ is the best response to $s_{P}^{*}$ . Thus we have proved that $s^{*}=(s_{P}^{*},s_{Q}^{*})$ forms a Nash equilibrium. ∎

	$\displaystyle U_{P}^{\prime}$	$\displaystyle=\sum_{a_{P}\in a_{P}}s_{P}^{}(a_{P})u_{P}(a_{P},a_{Q}),a_{Q}\in a_{Q}^{}$
		$\displaystyle\leq\sum_{a_{P}\in a_{P}}s_{P}^{}(a_{P})u_{P}(p^{},a_{Q}),a_{Q}\in a_{Q}^{},\forall p^{}\in a_{P}^{*}$
		$\displaystyle=u_{P}(p^{},q^{})=U_{P},$