This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Learning Class Unique Features in Fine-Grained Visual Classification: Appendix

Aeiau Zzzz    Bauiu C. Yyyy    Cieua Vvvvv    Iaesut Saoeu    Fiuea Rrrr    Tateu H. Yasehe    Aaoeu Iasoh    Buiui Eueu    Aeuia Zzzz    Bieea C. Yyyy    Teoau Xxxx    Eee Pppp
Machine Learning, ICML

1 Appendix 1: Experimental Details

Our experiments are performed on fine-grained visual classification (FGVC) benchmarks: CUB-200-2011 (wah2011caltech), FGVC-Aircraft (maji2013fine), Stanford Cars (KrauseStarkDengFei-Fei_3DRR2013) and standard visual classification benchmarks: CIFAR-10 (krizhevsky2009learning), CIFAR-100 (krizhevsky2009learning), STL-10 (coates2011analysis). The statistics of six datasets are shown in 1.

Different methods are compared on ResNet18 (he2016deep), VGGNet11 (simonyan2014very), DenseNet161 (huang2017densely). We conduct all our experiments on Pytorch framework (paszke2019pytorch) with NVIDIA 2080Ti GPUs. In all the datasets, we perform three-folds cross validation to choose the hyper-parameter (including the algorithm-specific parameters, learning rate, decay rate and weight decay, see Table 2. Then we evaluate all the models over three runs and report the mean and standard deviation on the test set. We use Stochastic Gradient Descent (SGD) to update the model parameters. All the images are normalized and augmented by random crop and random horizontal flip. For algorithm-specific hyper-parameters, we set 1.00 for the weight of Confidence Penalty (CP), 0.10 for the smoothing rate of Label Smoothing (LS) and 0.85 for ptp_{t} of Minimax Loss (MM). For FRL, we set K=10K=10 and λ=1\lambda=1

Table 1: Statistics of six datasets in this work.
Dataset #Training #Testing #Categories
CUB-200-2011 5994 5794 200
FGVC-Aircraft 6667 3333 100
Stanford Cars 8144 8041 196
CIFAR-10 50000 10000 10
CIFAR-100 50000 10000 100
STL-10 5000 8000 10
Table 2: Training hyper-parameters for experiments. In LR policy, n1/n2: decay n1 for every n2 epochs.
Dataset Image sizet Crop size Batch size Epochs Learning rate Weight decay LR policy
CIFAR-10 32×\times32 32×\times32 128 200 0.1 0.0005 0.2/50
CIFAR-100 32×\times32 32×\times32 128 200 0.1 0.0005 0.2/50
STL-10 96×\times96 96×\times96 64 200 0.1 0.0005 0.2/50
CUB-200-2011 512×\times512 448×\times448 16 60 0.004 0.0005 0.9/2
FGVC-Aircraft 512×\times512 448×\times448 16 60 0.008 0.0005 Cosine
Stanford Cars 512×\times512 448×\times448 16 60 0.01 0.0005 Cosine

2 Appendix 2: The Proofs

2.1 Proof of lemma 1

Proof.

From previous discussions, we have

Iθ(Y^Ct;Xt)log(n1)Hθ(Y^Ct|Xt).\displaystyle I_{\theta}(\hat{Y}_{C\setminus t};X^{t})\leq\log{(n-1)}-H_{\theta}(\hat{Y}_{C\setminus t}|X^{t}).

On the other hand, when the conditional probability distribution over non-target classes is uniform, we have

Hθ(Y^Ct|Xt)=log(n1).H_{\theta}(\hat{Y}_{C\setminus t}|X^{t})=\log(n-1).

Therefore Iθ(Y^Ct;Xt)0I_{\theta}(\hat{Y}_{C\setminus t};X^{t})\leq 0, combining with the fact that MI is always non-negative gives Iθ(Y^Ct;Xt)=0I_{\theta}(\hat{Y}_{C\setminus t};X^{t})=0. Finally we have Iθ(Y^Ct;Φ(Xt))Iθ(Y^Ct;Xt)=0I_{\theta}(\hat{Y}_{C\setminus t};\Phi(X^{t}))\leq I_{\theta}(\hat{Y}_{C\setminus t};X^{t})=0, and thus Iθ(Y^Ct;Φ(Xt))=0I_{\theta}(\hat{Y}_{C\setminus t};\Phi(X^{t}))=0. ∎

2.2 Proof of theorem 1

Proof.

Since k=argminz(qCt)zk=\mathop{\arg\min}_{z}(q_{C\setminus t})_{z}, then ik,qiqk\forall i\neq k,q_{i}\geq q_{k}. We have:

DCE(p||q)\displaystyle D_{CE}(p||q) =ik,tpilogqipklogqkptlogqt\displaystyle=-\sum_{i\neq k,t}p_{i}\log q_{i}-p_{k}\log{q_{k}}-p_{t}\log q_{t}
ik,tpilogqkpklogqkptlogqt\displaystyle\leq-\sum_{i\neq k,t}p_{i}\log q_{k}-p_{k}\log{q_{k}}-p_{t}\log q_{t}
=(ik,tpi+pk)logqkptlogqt\displaystyle=-(\sum_{i\neq k,t}p_{i}+p_{k})\log q_{k}-p_{t}\log q_{t}
=(1pt)logqkptlogqt\displaystyle=-(1-p_{t})\log q_{k}-p_{t}\log q_{t}
=DCE(p||q),\displaystyle=D_{CE}(p^{*}||q),

which concludes Theorem 1. ∎

2.3 Proof of theorem 2

Proof.

Let k=argminz(qCt)zk=\mathop{\arg\min}_{z}(q_{C\setminus t})_{z}. Since it,qi=1qtn1\forall i\neq t,q_{i}^{*}=\frac{1-q_{t}}{n-1}, for any qqq\neq q^{*}, we have it,qiqk:\forall i\neq t,q_{i}^{*}\geq q_{k}:

DCE(p||q)\displaystyle D_{CE}(p^{*}||q) =ipilogqi\displaystyle=-\sum_{i}p_{i}^{*}\log q_{i}
=(1pt)logqkptlogqt,\displaystyle=-(1-p_{t})\log q_{k}-p_{t}\log q_{t},
(1pt)logqiptlogqt,\displaystyle\geq-(1-p_{t})\log q_{i}^{*}-p_{t}\log q_{t},
=DCE(p||q),\displaystyle=D_{CE}(p^{*}||q^{*}),

which concludes Theorem 2. ∎

2.4 Proof of theorem 3

Proof.

Under the strategy of ss^{*}, the expected payoff of the adversary can be calculated as:

UP=aPaPsP(aP)uP(aP,aQ),aQaQ.\displaystyle U_{P}=\sum_{a_{P}\in a_{P}^{*}}s_{P}^{*}(a_{P})u_{P}(a_{P},a_{Q}),a_{Q}\in a_{Q}^{*}.

Since argminz(qCt)z{\rm argmin}_{z}{(q_{C\setminus t})_{z}} contains all the indexes except yy under the condition that QQ uniformly distribute all the probabilities over non-target classes, thus paP,p=p\forall p\in a_{P},p=p*. Moreover, aQa_{Q} contains only one action which is equal to qq^{*}, thus:

UP\displaystyle U_{P} =n1×1n1uP(p,q)=uP(p,q).\displaystyle=n-1\times\frac{1}{n-1}u_{P}(p^{*},q^{*})=u_{P}(p^{*},q^{*}).

Assume that there is another strategy sPs_{P} that gives higher payoff for the adversary, then:

UP\displaystyle U_{P}^{\prime} =aPaPsP(aP)uP(aP,aQ),aQaQ\displaystyle=\sum_{a_{P}\in a_{P}}s_{P}^{*}(a_{P})u_{P}(a_{P},a_{Q}),a_{Q}\in a_{Q}^{*}
aPaPsP(aP)uP(p,aQ),aQaQ,paP\displaystyle\leq\sum_{a_{P}\in a_{P}}s_{P}^{*}(a_{P})u_{P}(p^{*},a_{Q}),a_{Q}\in a_{Q}^{*},\forall p^{*}\in a_{P}^{*}
=uP(p,q)=UP,\displaystyle=u_{P}(p^{*},q^{*})=U_{P},

which is contrary to the assumption, thus prove that the sPs_{P}^{*} is the best response to sQs_{Q}^{*}.

Likewise, we can prove that sQs_{Q}^{*} is the best response to sPs_{P}^{*}. Thus we have proved that s=(sP,sQ)s^{*}=(s_{P}^{*},s_{Q}^{*}) forms a Nash equilibrium. ∎