Task-Agnostic Federated Learning with Imbalanced Data
Abstract
In the realm of medical imaging, leveraging large-scale datasets from various institutions is crucial for developing precise deep learning models, yet privacy concerns frequently impede data sharing. Federated learning (FL) emerges as a prominent solution for preserving privacy while facilitating collaborative learning. However, its application in real-world scenarios faces several obstacles, such as task & data heterogeneity, label scarcity, non-identically distributed (non-IID) data, computational variation, etc. In real-world, medical institutions may not want to disclose their tasks to FL server and generalization challenge of out-of-network institutions with un-seen task want to join the on-going federated system. This study addresses task-agnostic and generalization problem on unseen tasks by adapting self-supervised FL framework. Utilizing Vision Transformer (ViT) as consensus feature encoder for self-supervised pre-training, no initial labels required, the framework enables effective representation learning across diverse datasets and tasks. Our extensive evaluations, using various real-world non-IID medical imaging datasets, validate the approach’s efficiency. The proposed model retains 90% of F1 accuracy of centralized approaches for classification task, and outperforms it for segmentation task, which exhibits adaptability to out-of-distribution data. The result indicates that federated learning architecture can be a potential approach toward multi-task foundation modeling.
Index Terms:
federated learning, self-supervised learning, image classification, task agnostic, vision transformerI Introduction
Federated learning (FL) enables model training using data dispersed across multiple locations without direct data sharing. In comparison to models trained at individual sites, federated models can utilize a significantly broader and larger dataset, potentially leading to enhanced performance and greater generalizability. As a result, this training method has found widespread adoption in crucial medical applications such as brain tumor detection and COVID-19 diagnosis, and has been utilized across various data types. However, In conventional FL system, the center model has to know what task each sites want to achieve, e.g., predict brain ages, detect tumor, classify Diabetic retinopathy, brain regional segmentation, etc. Thus, the task-agnosticity limits the generalization of FL system.

Recently, advantage in unsupervised and self-supervised learning narrows the gap in understanding medical images’ context, modality without knowing the tasks. Self-supervised pre-training learns the intrinsic features of images in local clients without labels, embodying less label-specific inductive bias and thus, less susceptible to label distribution skewness. The proposed method learns visual representations more effectively across non-IID clients, even when data are limited at some clients. No task was reveal to server and still, pretrained model learn semantic information of domain knowledge which can be used for few-shot learning at each FL sites.
To the best of authors’ knowledge, many related works assume that the center model knows all, or at least some tasks and possible labels of local clients[31][32]. We address the challenges of data heterogeneity and tasks anonymous by proposing a novel SSL-FL framework, following the unsupervised federated learning setting in [14]. It tackles the non-iid data issue using vision mamba to employ masked image modeling as the self-supervised task, which are efficient in processing long sequences non-iid data because of its convolutional and near-linear computation. The proposed pre-training scheme significantly advances the capability of federated models over highly heterogeneous data partitions. We also propose a novel client-specific fine tuning framework using low rank adaptation, enabling the existing client to achieve high prediction performance by fintuning only 2% of the pretrained model. Moreover, We show that our framework is robust to merging new institutes as well as new task. Our main contribution are summarized as follows:
-
•
We define a new problem setting in federated learning involving task-agnostic between clients and server. Along with that, we proposed a simple yet powerful approach to address task-agnostic challenges at FL clients using self-supervised pretraining. After the pretraining, we are able to transfer the knowledge to the target tasks by a efficient fine-tuning implementation.
-
•
Using real-world datasets, we show that the combined framework robust to multiple tasks as it achieves up 90% of SSL central training performance in term of F1 and accuracy with only their own data and outperforms the central model in segmentation tasks. It is found in the experiments that tasks with less data benefits more from federated pretraining.
II Related Works
Client-Heterogeneity in Federated Learning. As a decentralized approach, FL suffers from performance degradation due to client heterogeneity [36]. While several research efforts [14] have been devoted to addressing the challenges caused by data heterogeneity, task heterogeneity have not been well investigated in literature. Moreover, the success of such models largely relies on supervised ImageNet pre-training, which could suffer from domain discrepancy when fine-tuning with medical images and can be further improved by self-supervised pre-training on a centrally shared large-scale in-domain medical dataset [14]. However, such centrally shared datasets rarely exist in the medical domain due to privacy and ownership concerns. Therefore, it is desired to build a self-supervised FL framework that collaboratively learns a global model by leveraging all available unlabeled data without sharing data among institutions.
Vision Foundation model and fine-tuning Understand medical images without labels is a long-standing problem. At current state, label deficiency is a common challenge in medical imaging. To address this issue, various approaches such as semi-supervised and self-supervised learning methods [37][38][39][40] have been proposed to allow models to learn from partially labeled or unlabeled data. This work make use of advantage in SSL on FL systems for task-agnostic settings.
Specifically, large model fine-tuning and the use of adapters are evolving learning methods and pivotal strategies in the field of large language model[15] [16], vision model[17] [18], and vision language model[19] [20]. These approaches are designed to tailor pre-trained models to specific tasks or datasets with relatively minimal computational cost and data requirements.
III Problem Statement
Suppose there are distributed clients and one centralized server in a federated system. Each client has a private dataset and performs multiple private tasks with the dataset they have. The server neither have access to nor knows the underlining tasks. Take advantage of un-labeled data from multi-modality and anonymous tasks at clients, we want to boost the performance of each clients compare with local supervision. There are similar works on multi-task federated learning that aggregate observed tasks but the server must know beforehand every possible tasks of clients. With identical SSL-FL architecture, address non-IID and data heterogeneity problem, on a single task. Unlike any previous works, we introduce a strictly constrained problem involved task-related leakage concerns at federated server.
To address this problem, we assume that there is a unified block that is compromised among clients and server, which have to be feature encoder. In this work, we pre-trained the feature encoder at clients and aggregate it on server to generalize the feature space with out-of-distribution data and tasks.
IV Methodology
IV-A SSL-FL Pre-training
During the pre-training phase, each local model, indexed by , functions as an autoencoder with an encoder and a decoder . This involves masked image modeling, where some image patches are masked and the original signals in these patches are reconstructed [33]. The encoder in a Vision Transformer processes image patches using multiple layers of self-attention and feed-forward networks to encode complex features and contextual relationships. The decoder reconstructs the original image from these encoded features, utilizing similar layers to generate predictions.
Each local encoder and decoder are trained with local data to minimize the local objective function , where is the mean squared error of the predicted pixel values of the masked patches. The local loss function is given by:
For pre-training, each local client update its local model and by minimizing its own loss on data Then, the server takes a weighted average of all the resulting local models to update the global model and ,which is further sent back to the local clients for the next training iteration. Once pre-training is complete, the final pre-trained global encoder is saved.
IV-B Downstream fine-tuning
During the federated fine-tuning phase, depicted in Figure 2, we initialize the local encoder of the client with the pretrained global encoder acquired from the initial stage. Subsequently, we augment the encoder with a linear downstream task head. The complete model is then fine-tuned using the local labeled data.
Classification Task We freeze the weight of pre-trained VIT and add a trainable LoRA adapter, as well as a fully-connected layer, which put on top of pre-trained VIT, for specific classification tasks - Multilabel classification task, Multiclass classification task and binary classification task.
Segmentation Task The same pre-trained VIT is applied here for segmentation task. We use of unetr framework[42] for our evaluation since unet-family allow pre-trained encoder injected.
V Experiments Implementaion
V-A Datasets
In this work, we use six public fundus dataset in which the first four datasets are used for in-distribution pretraining and task evaluation. The last twos, LACDHS and JSIEC, act as out-of-distribution (ID) test dataset. These two play a role as out-of-network (OOD) new clients joining federated system. We want to evaluate how well own approach generalize to unseen task and OOD data.
Dataset | Train | Test | Task |
---|---|---|---|
RFMiD | 1920 | 640 | Multi-label Classification |
DR | 29600 | 6800 | Severity Rating |
EYEPACS | 8000 | 770 | Bianry Classification |
REFUGE2 | 400 | 400 | Optic Disc Segmentation |
LACDHS | 80 | 20 | Vessel Segmentation |
JSIEC | 800 | 200 | Multi-class classification |
Methods | Accuracy | Precision | Macro F1 |
---|---|---|---|
Local Supervision (Split 1) | 57.5 0.6 | 13.9 0.4 | 15.7 0.3 |
Local Supervision (Split 2) | 59.7 0.3 | 15.3 0.2 | 15.9 0.2 |
Centralized SSL (No FL) | 63.3 0.5 | 16.9 0.3 | 17.3 0.3 |
SSL-FL (Split 1) | 58.9 0.6 | 14.3 0.5 | 14.8 0.5 |
SSL-FL (Split 2) | 60.9 0.6 | 15.3 0.4 | 16.3 0.3 |
Local Supervision (Split 1) | 58.9 0.4 | 55.3 0.2 | 52.1 0.2 |
Local Supervision (Split 2) | 60.4 0.3 | 54.9 0.3 | 55.4 0.4 |
Centralized SSL (No FL) | 62.7 0.3 | 58.7 0.3 | 56.9 0.2 |
SSL-FL (Split 1) | 59.8 0.6 | 56.4 0.3 | 54.2 0.3 |
SSL-FL (Split 2) | 61.1 0.3 | 57.1 0.2 | 53.2 0.2 |
Local Supervision (Split 1) | 71.6 0.3 | 17.2 0.2 | 17.0 0.3 |
Local Supervision (Split 2) | 73.3 0.1 | 18.7 0.0 | 18.5 0.0 |
Centralized SSL (No FL) | 73.3 0.1 | 18.3 0.0 | 17.9 0.0 |
SSL-FL (Split 1) | 71.5 0.3 | 16.9 0.1 | 17.2 0.2 |
SSL-FL (Split 2) | 73.0 0.3 | 18.4 0.1 | 18.4 0.1 |
Methods | Dice Loss | Dice Focal |
---|---|---|
Local Supervision (Scratch) | 94.9 0.8 | 76.0 0.5 |
Centralized SSL (No FL) | 92.4 0.7 | 73.9 1.2 |
SSL-FL (Split 1) | 93.7 1.1 | 69.4 0.8 |
SSL-FL (Split 2) | 91.1 1.1 | 74.5 2.3 |
Local Supervision (Scratch) | 43.9 0.9 | 64.1 3.2 |
Centralized SSL (No FL) | 44.3 1.8 | 66.4 7.2 |
SSL-FL (Split 1) | 45.2 1.7 | 61.4 1.4 |
SSL-FL (Split 2) | 44.7 1.1 | 60.7 1.6 |


-
•
JSIEC Fundus dataset 1000 fundus images which belong to 39 classes are come from the Joint Shantou International Eye Centre (JSIEC), Shantou city, Guangdong province, China. The copyright of these images belongs to JSIEC.
-
•
RFMiD Small Dataset consists of 2560 fundus images captured using three different fundus cameras with 46 conditions annotated through adjudicated consensus of two senior retinal experts.
-
•
EyePACS The complete Rotterdam EyePACS AIROGS dataset, encompassing both training and testing sets, comprises 8000 color fundus images sourced from subjects across roughly 500 diverse sites with varying ethnic backgrounds.
-
•
REFUSE2 The Retinal Fundus Glaucoma Challenge 2nd Edition (REFUGE2) dataset comprises 800 color fundus images accompanied by annotations for glaucoma classification, optic disc/cup segmentation, and fovea localization.
-
•
DR Dataset Each image within the Diabetic retinopathy dataset is assessed by a clinician who assigned a rating on a scale of 0 to 4, corresponding to the absence of diabetic retinopathy (No DR), mild, moderate, severe, and proliferative stages.
-
•
LACDHS Dataset The dataset comprises 100 fundus digital images of the retina sourced from the Armed Forces Institute of Ophthalmology (AFIO) in Rawalpindi, Pakistan. Included within the dataset are annotations detailing the retinal blood vessel network, segmented artery/vein networks utilized for calculating the Arteriovenous Ratio (AVR), as well as annotations of the Optic Nerve Head (ONH).
V-B Experimental Settings
V-B1 Baseline
Since we derive a new problem setting, there are no other methods yield to solve it. We used supervised learning on local labels as lower bound for comparison. We assume that in best scenario, data will be visible from all site so that we train a self-supervised encoder from all data and fine-tune it on each tasks to get upper bound performance.
V-B2 Task and Data Heterogeneity Setup
We model Task imbalance (Split 1, data split by the dataset) and Task balance (Split 2, data split by drawing same number of data from each dataset ) data distributions of the six datasets, as is shown in 2. Centralized task is the model pretrained on the six dataset together without aggregation. Local supervision task represents pretraining and finetuning on its own task dataset but not federatedly learn the data from other clients. Simulated data partitions and task partitions allow for a more flexible and thorough investigation of the model behavior, as they can be easily manipulated to test data and task heterogeneity. All the different tasks are finetuned on their own downstream task and the corresponding dataset.
V-B3 Self-supervised FL Pre-training and Supervised FL Fine-tuning and Evaluation Metrics
Following [34], ViT-B [35] is chosen as the backbone for the proposed models. Following the setup in MAE[33], the input is split into 16 × 16 patches for MAE. In our main experiment, we randomly mask at most 60% of total image patches for MAE. The downstream model is initialized using the pretrained encoder and fine-tuned with a base learning rate starting at for all tasks. We use accuracy, precision and F1 score as the evaluation metric for classification on all the datasets.
VI results and discussion
VI-A Results for Classification
Table II shows the results from different training methodologies across three classification tasks: multilabel classification, binary classification, and severity rating. In all these classification tasks, the Centralized SSL (No FL) method consistently delivers superior performance, suggesting the advantage for federated learning with lack of data and data communication. The federated learning part shows interesting results. In the multilabel classification task, while SSL-FL (Split 1) shows a decrease in performance compared to centralized SSL, SSL-FL (Split 2) regains some ground, suggesting that the balanced nature of Split 2 may help mitigate some challenges associated with federated settings. In the binary classification task, it is worth to note that both federated learning approaches perform better than local supervision alone, but not much for Split 2, which might due to the great amount of training data for local supervisions in this case, leading to strong representation of the encoder and bringing in information of other dataset does not help much. Still Split 1 with unbalanced data scenarios poses specific challenges that affect performance negatively compared to centralized approaches and Split 2. The severity rating task results reveal a similar pattern where Centralized SSL (No FL) achieves the best overall results. We can see again Split 2 shows improved performance over Split 1, which echoes the trend observed in the multilabel task. Also federated learning does not help much when data is enough for local training. Overall, Centralized SSL provides the best outcomes across all tasks and metrics. The performance differences between SSL-FL splits emphasize the importance of data balancing in federated settings, reducing disparities in model learning and generalization and shows that clients with less data may benefit more from the federated training.
VI-B Results for Segmentation and OOD Tasks

Table III presents the results of various methods applied to two different tasks in medical image analysis: optic disc segmentation on the REFUGE dataset and vessel segmentation on the JSIEC dataset. For the optic disc segmentation task on the REFUGE dataset, the results show that Local Supervision achieves the lowest performance with 94.9% Dice coefficient for optic disc segmentation and 76.0% for cup segmentation. Centralized SSL without FL follows closely behind with slightly higher performance. However, when SSL is combined with Federated Learning (SSL-FL), there’s a increase in performance, especially noticeable in the cup segmentation metric, with Split 1 achieving the highest scores. Moving to the vessel segmentation task on the JSIEC dataset, similar trends are observed. Local Supervision achieves the highest Dice coefficients for vessel and background segmentation, with Centralized SSL without FL trailing slightly behind. However, when SSL is coupled with Federated Learning (SSL-FL), there’s again a increse in performance, particularly evident in vessel segmentation for Split 2 because of task balance between clients. These results suggest that while SSL can enhance segmentation performance, the introduction of federated learning in this context might not consistently improve results. The choice of data split also influences the effectiveness of SSL with Federated Learning.
Conclusion
In this paper, we propose a privacy-preserving and federated self-supervised learning framework that collaboratively trains models on decentralized data using masked image modeling as the heterogeneous self-supervised tasks, in which Low Rank Adaptation(LoRA) is used in the fine tuning part. Our framework is robust to non-IID data distribution across clients, and performs well under severe task heterogeneity and data imbalance settings across diverse medical datasets. Experiments show that while tasks(clients) with less data benefits more from federated pretraining, all with different downstream task will perform better than its local supervision.
References
- [1] J. Clerk Maxwell, A Treatise on Electricity and Magnetism, 3rd ed., vol. 2. Oxford: Clarendon, 1892, pp.68–73.
- [2] I. S. Jacobs and C. P. Bean, “Fine particles, thin films and exchange anisotropy,” in Magnetism, vol. III, G. T. Rado and H. Suhl, Eds. New York: Academic, 1963, pp. 271–350.
- [3] K. Elissa, “Title of paper if known,” unpublished.
- [4] R. Nicole, “Title of paper with only first word capitalized,” J. Name Stand. Abbrev., in press.
- [5] Y. Yorozu, M. Hirano, K. Oka, and Y. Tagawa, “Electron spectroscopy studies on magneto-optical media and plastic substrate interface,” IEEE Transl. J. Magn. Japan, vol. 2, pp. 740–741, August 1987 [Digests 9th Annual Conf. Magnetics Japan, p. 301, 1982].
- [6] M. Young, The Technical Writer’s Handbook. Mill Valley, CA: University Science, 1989.
- [7] Rudolph Emil Kalman. A new approach to linear filtering and prediction problems. 1960
- [8] Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
- [9] Albert Gu, Karan Goel, and Christopher Re. Efficiently ´ modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021.
- [10] Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher Re. Combining recurrent, ´ convolutional, and continuous-time models with linear state space layers. In NeurIPS, 2021.
- [11] Albert Gu, Karan Goel, Ankit Gupta, and Christopher Re.´ On the parameterization and initialization of diagonal state space models. In NeurIPS, 2022.
- [12] Ankit Gupta, Albert Gu, and Jonathan Berant. Diagonal state spaces are as effective as structured state spaces. In NeurIPS, 2022.
- [13] Zhu, L., Liao, B., Zhang, Q., Wang, X., Liu, W., Wang, X.: Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417
- [14] Yan, Rui, et al. ”Label-efficient self-supervised federated learning for tackling data heterogeneity in medical imaging.” IEEE Transactions on Medical Imaging (2023).
- [15] Mao, Y.; Mathias, L.; Hou, R.; Almahairi, A.; Ma, H.; Han, J.; Yih, W.-t.; and Khabsa, M. 2021. Unipelt: A unified framework for parameter-efficient language model tuning. arXiv preprint arXiv:2110.07577.
- [16] Mosbach, M.; Andriushchenko, M.; and Klakow, D. 2020. On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines. In ICLR.
- [17] Mostafa, H.; and Wang, X. 2019. Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization. In ICML.
- [18] He, Xuehai, Chunyuan Li, Pengchuan Zhang, Jianwei Yang, and Xin Eric Wang. ”Parameter-efficient fine-tuning for vision transformers.” arXiv preprint arXiv:2203.16329 3 (2022).
- [19] Sun, Jingchen, Jiayu Qin, Zihao Lin, and Changyou Chen. ”Prompt tuning based adapter for vision-language model adaption.” arXiv preprint arXiv:2303.15234 (2023).
- [20] Yu, Tao, Zhihe Lu, Xin Jin, Zhibo Chen, and Xinchao Wang. ”Task residual for tuning vision-language models.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10899-10909. 2023.
- [21] Hu, Edward J., Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. ”Lora: Low-rank adaptation of large language models.” arXiv preprint arXiv:2106.09685 (2021
- [22] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-Efficient Transfer Learning for NLP. arXiv:1902.00751 [cs, stat], June 2019.
- [23] Guo, Yunhui, Honghui Shi, Abhishek Kumar, Kristen Grauman, Tajana Rosing, and Rogerio Feris. ”Spottune: transfer learning through adaptive fine-tuning.” In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4805-4814. 2019.
- [24] Brian Lester, Rami Al-Rfou, and Noah Constant. The Power of Scale for Parameter-Efficient Prompt Tuning. arXiv:2104.08691 [cs], April 2021.
- [25] Jia, Menglin, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. ”Visual prompt tuning.” In European Conference on Computer Vision, pp. 709-727. Cham: Springer Nature Switzerland, 2022.
- [26] Guo, Ziyu, Renrui Zhang, Longtian Qiu, Xianzheng Ma, Xupeng Miao, Xuming He, and Bin Cui. ”Calip: Zero-shot enhancement of clip with parameter-free attention.” In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 1, pp. 746-754. 2023.
- [27] Cui, Baiyun, Yingming Li, Ming Chen, and Zhongfei Zhang. ”Fine-tune BERT with sparse self-attention mechanism.” In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp. 3548-3553. 2019
- [28] Finn, Chelsea, Pieter Abbeel, and Sergey Levine. ”Model-agnostic meta-learning for fast adaptation of deep networks.” In International conference on machine learning, pp. 1126-1135. PMLR, 2017.
- [29] Snell, Jake, Kevin Swersky, and Richard Zemel. ”Prototypical networks for few-shot learning.” Advances in neural information processing systems 30 (2017).
- [30] Nichol, Alex, and John Schulman. ”Reptile: a scalable metalearning algorithm.” arXiv preprint arXiv:1803.02999 2, no. 3 (2018): 4.
- [31] Itahara, Sohei, Takayuki Nishio, Yusuke Koda, Masahiro Morikura, and Koji Yamamoto. ”Distillation-based semi-supervised federated learning for communication-efficient collaborative training with non-iid private data.” IEEE Transactions on Mobile Computing 22, no. 1 (2021): 191-205.
- [32] Lin, Haowen, Jian Lou, Li Xiong, and Cyrus Shahabi. ”Semifed: Semi-supervised federated learning with consistency and pseudo-labeling.” arXiv preprint arXiv:2108.09412 (2021).
- [33] He, Kaiming, et al. ”Masked autoencoders are scalable vision learners.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022
- [34] Yan, Rui, et al. ”Label-efficient self-supervised federated learning for tackling data heterogeneity in medical imaging.” IEEE Transactions on Medical Imaging 42.7 (2023): 1932-1943.
- [35] Dosovitskiy, Alexey, et al. ”An image is worth 16x16 words: Transformers for image recognition at scale.” arXiv preprint arXiv:2010.11929 (2020).
- [36] Mitra, Aritra, et al. ”Linear convergence in federated learning: Tackling client heterogeneity and sparse gradients.” Advances in Neural Information Processing Systems 34 (2021): 14606-14619.
- [37] Atito, Sara, Muhammad Awais, and Josef Kittler. ”Sit: Self-supervised vision transformer.” arXiv preprint arXiv:2104.03602 (2021).
- [38] Baevski, Alexei, et al. ”Data2vec: A general framework for self-supervised learning in speech, vision and language.” International Conference on Machine Learning. PMLR, 2022.
- [39] Weng, Zejia, et al. ”Semi-supervised vision transformers.” European conference on computer vision. Cham: Springer Nature Switzerland, 2022.
- [40] Zhai, Xiaohua, et al. ”S4l: Self-supervised semi-supervised learning.” Proceedings of the IEEE/CVF international conference on computer vision. 2019.
- [41] Smith, Virginia, et al. ”Federated multi-task learning.” Advances in neural information processing systems 30 (2017).
- [42] Hatamizadeh, Ali, et al. ”Unetr: Transformers for 3d medical image segmentation.” Proceedings of the IEEE/CVF winter conference on applications of computer vision. 2022.