Second Place Solution of WSDM2023 Toloka
Visual Question Answering Challenge
Abstract
In this paper, we present our solution for the WSDM2023 Toloka Visual Question Answering Challenge. Inspired by the application of multimodal pre-trained models to various downstream tasks(e.g., visual question answering, visual grounding, and cross-modal retrieval), we approached this competition as a visual grounding task, where the input is an image and a question, guiding the model to answer the question and display the answer as a bounding box on the image. We designed a three-stage solution for this task. Specifically, we used the visual-language pre-trained model OFA as the foundation. In the first stage, we constructed a large-scale synthetic dataset similar to the competition dataset and coarse-tuned the model to learn generalized semantic information. In the second stage, we treated the competition task as a visual grounding task, loaded the weights from the previous stage, and continued to fine-tune the model on the competition dataset, transferring the semantic information learned in the first stage to the competition task. Finally, we designed a bounding box matching and replacing post-processing strategy to correct the model’s prediction results. Our team achieved a score of 76.342 on the final leaderboard, ranking second.
1 Introduction
Visual-language pretraining (VLP) [26, 25, 15, 5, 21, 10] models have seen rapid development and significant advancements in recent years. These models aim to bridge the gap between visual and linguistic information, enabling machines to understand and generate contextually rich content combining images and text. Early work, such as OFA [14] and BLIP [6], laid the groundwork for integrating visual and textual information. OFA aimed to unify vision-language tasks using large-scale pretraining, while BLIP improved data efficiency and model performance through self-supervised learning. Recent advancements, including ChatGPT-4 [11], have led to more powerful models capable of tasks like image generation from text, multimodal conversation, and advanced visual reasoning.
One key application is Visual Question Answering (VQA) [4, 2, 13, 22, 8, 16], where the model answers questions based on an image. This task requires understanding both the image and the natural language question, combining image recognition, language processing, and reasoning. Visual grounding [7, 12, 24, 1, 9, 3] is a specific task within the broader domain of VQA. In VQA, the primary objective is to answer questions about an image using natural language understanding. Visual grounding within VQA focuses on precisely locating and identifying objects or regions in the image that correspond to specific elements mentioned in the textual query. This task requires the model to effectively link the textual description with relevant visual features in the image, enabling accurate and contextually relevant answers to questions posed about visual content.
VLP models [26, 25, 20, 17] achieve visual grounding tasks by first pretraining on large-scale datasets containing paired image and text data. These models extract feature embeddings from images using CNNs and process textual descriptions using transformer architectures like BERT. Through cross-modal attention mechanisms, they align visual and textual features, enabling precise linking of natural language queries to specific visual elements in images. Fine-tuning on task-specific datasets further refines their ability to understand and respond to queries that require nuanced visual comprehension within applications such as VQA and multimodal interaction systems.
With the rapid development of multimodal pre-trained models, they have shown strong generalization capabilities in various downstream tasks, such as visual grounding, visual question answering, cross-modal retrieval, image captioning, and more. We found that the visual grounding task refers to predicting the bounding box for a given text in an image, which inspired us to transform the competition task, namely visual question grounding, into a visual grounding task, effectively leveraging the power of multimodal pre-trained models.
In the first stage, we collected and constructed a large amount of external data for coarse tuning of the model, with image distributions, question distributions, and label distributions similar to the competition data. Secondly, we used the OFA model as the foundation and coarse-tuned a model with strong generalization and rich semantics. In the second stage, we loaded the coarse-tuned weights from the previous stage, treated the competition task as a visual grounding task, constructed different question templates, and continued fine-tuning the OFA model on the competition dataset. Finally, we designed a bounding box matching and replacing the post-processing strategy. The bounding box position predicted by visual grounding is roughly correct, but the accuracy of coordinates is not higher than that of an object detector. Therefore, by calculating the IoU, we replaced the predicted bounding box with the object detector’s bounding box to correct the model’s prediction results. Our team secured the runner-up position on the final leaderboard with a score of 76.342.
2 Related Works
2.1 Visual-Language Pretraining (VLP)
Recent developments in VLP [26, 25, 15, 5, 18, 19] have been propelled by pioneering models such as OFA [14] and BLIP [6]. These models have innovatively utilized large-scale datasets to pretrain transformer-based architectures on paired image-text data. OFA aims to unify various vision-language tasks by leveraging comprehensive pretraining, thereby enhancing performance in applications such as VQA and image-text retrieval. Meanwhile, BLIP focuses on optimizing data efficiency and model performance through advanced self-supervised learning techniques, contributing significantly to the robustness of multimodal understanding. In this evolving landscape, ChatGPT-4 [11] has emerged as a notable advancement, showcasing remarkable capabilities in generating nuanced natural language responses and pushing the boundaries of language modeling tasks.
2.2 Visual Question Answering (VQA)
VQA [4, 2, 13, 22] involves answering questions about images using natural language understanding, which has been a pivotal area of research in multimodal AI. These tasks require models to comprehend both the visual content of images and the semantic context of textual queries. Early approaches focused on integrating vision and language through methods like joint embeddings and attention mechanisms. More recent advancements, leveraging models such as OFA [14] and BLIP [6], have significantly improved the accuracy and efficiency of VQA systems by employing large-scale pretraining on diverse datasets. These models aim to enhance the model’s ability to reason about complex scenes and provide accurate responses based on multimodal inputs. VQA continues to evolve with the introduction of transformer-based architectures like ChatGPT-4, which further pushes the boundaries of multimodal understanding and response generation in AI systems.
2.3 Visual Grounding (VG)
Visual grounding [7, 12, 24, 1, 9, 23] tasks involve the localization and identification of specific objects or regions within an image based on textual descriptions. This task is crucial in multimodal AI applications, where models need to link natural language queries to corresponding visual elements effectively. By leveraging deep learning techniques, such as convolutional neural networks (CNNs) for image feature extraction and transformer architectures for textual understanding, models can align visual and textual modalities. This alignment enables precise object localization, scene understanding, and context-based reasoning, enhancing applications like VQA [4, 2], image retrieval, and interactive systems that require accurate interpretation and response generation based on visual content.
3 Methods
The solution of this competition is based on the OFA [14] visual language pre-training model, so we introduce the OFA model first. OFA is a large multi-modal pre-training model that transforms all pre-training tasks into text generation tasks so that all pre-training tasks can be performed by one decoder module. For example, VG, visual grounding pre-training task construct the template which region does the text ”Man in white shirt” describe? bounding box coordinates are generated by the decoder module; ITM, image text match pre-training task, construct the template, Does the image describe ”Two boys playing frisbee on the grass”? yes or no is generated by the same decoder module. It can be seen, that the visual grounding task is very similar to the competition, which are both output bounding box coordinates. Therefore, we transformed the competition task into visual Grounding task and used the large pre-training model for fine-tuning.
Our methods mainly include three parts, respectively Coarse tuning, Fine tuning, and Postprocessing. In the first part, Coarse tuning aims to build a synthetic dataset similar to the competition dataset and train a weak semantics but strong generalization model on the synthetic dataset. The weak semantics is because the synthetic dataset contains dirty data and labels so the semantic learning is not accurate enough. Strong generalization is due to the large scale of the synthetic dataset, so the generalization ability is relatively strong. In the second part, the Fine-tuning stage loads the coarse-tuning part weight and continues to train the competition dataset. The third part, post-processing, through bounding box matching, replacing, and model ensemble, further improves the accuracy of model prediction.

3.1 Coarse Tuning Stage
The first part is the coarse tuning stage. We construct the coarse-tuning dataset through coco images. There are several features of constructing synthetic datasets. First, the image distribution, question distribution, and bounding box distribution of the coarse tuning dataset should be as close as possible to the competition dataset, so that the model can avoid learning the wrong data distribution; Secondly, any sampling operation must be Random Sampling, which can ensure that the model will not learn any data bias. Finally, do not sample any data in the test public set, which can avoid overfitting the model to the test public set. The coarse-tuning paradigm should be the same as the competition task, similar to the pre-training, save the coarse-tuned model weight for the next fine-tuning part. similar to the pre-training, the more coarse tuning epochs, the stronger generalization, and the better performance.
This figure illustrates our process of constructing the synthetic dataset. For each sample in the training set, we directly input images and questions into the multi-modal pre-training model to obtain textual answers, which we define as pseudo-answers. Now, each sample consists of an image, a question, and a pseudo-answer; therefore, we can build a mapping table of pseudo-answers and questions, where one pseudo-answer corresponds to multiple questions, for example, the pseudo-answers clock and vase. Then, randomly selected an image from the coco and detected several objects by the object detector. Each detected object includes confidence, class counts, class name, and bounding box, where class number is defined as the number of the object in the image, such as the number of clocks is one and the number of roll paper is two. Next, we sorted the detected objects according to confidence, selected an object class with the largest confidence, the class count is one, and belonging to the pseudo answer, then, randomly selected a question corresponding to the pseudo answer from the mapping table. Now, a new sample is obtained, the question of the sample is randomly selected question, the pseudo-answer is the selected object class, and the target bounding box is obtained by the object detector. After the synthetic dataset is constructed, a model with weak semantics but strong generalization can be coarse tuning in the form of visual grounding.

3.2 Fine-tuning Stage
The second part is the Fine-tuning stage. In the form of a visual grounding task, load the trained weight obtained by the last Coarse tuning part and continue to train the model on the competition dataset. Each sample consists of an image, a question, and a pseudo-answer. Question and pseudo answer are combined into the template, which is used as text input of the model. For the four templates we constructed, we finally adopted the second template, and the first word of the question, such as what, where, and which, was replaced with which region. After the template construction, the template and image are input into the OFA model, through text encoder, image encoder, and cross-modal fusion, the coordinates of the bounding box are finally output by text decoder.

3.3 Postprocessing Stage
The third part is the post-processing stage, the bounding box position predicted by OFA is roughly correct, but the accuracy of coordinates is not higher than object detector. Therefore, by calculating the iou, we replace the predicted bounding box with the object detector’s bounding box. For each image, all bounding boxes in the image are detected by object detectors yolor and vitDet, and sorted according to the confidence, which are called candidate Bounding boxes. Then, for the predicted bounding box, the first candidate bounding box with iou higher than zero point six is selected to replace the predicted bounding box. Finally, the model ensemble includes three ways, the first is five folds with standard Corase tuning, the second is ten folds with standard Corase tuning, and the third is five folds with back translation Corase tuning. The back translation was applied to the coarse tuning stage, which expanded the diversity of question data.
4 Experiments
4.1 Dataset.
The Toloka dataset comprises images paired with textual questions, where each entry includes a question-image pair annotated with ground truth bounding box coordinates pinpointing the visual answer. In total, the dataset consists of 45,199 instances distributed across three subsets: 38,990 instances in the training set, 1,705 instances in the public test set, and 4,504 instances in the private test set.
The dataset is structured with several key columns: ”image” contains URLs linking to images hosted on a public content delivery network; ”question” provides English-language queries associated with each image. Additional metadata includes ”width” and ”height” integers representing the dimensions of each image. For bounding box annotation, the dataset includes ”left,” ”top,” ”right,” and ”bottom” integers detailing the coordinates that define the spatial extent of the object or region in the image that corresponds to the answer to the question.
4.2 Leadboards.
Method | Score |
---|---|
Baseline | 71.0 |
Pseudo Answer | 73.5 |
Template | 74.2 |
Coarse Tuning | 75.1 |
Postprocessing | 75.8 |
Test Public | 76.5 |
Test Private | 76.342 |
Table 1 shows the improvement in model performance by each of our components. The baseline is defined as the OFA model which directly inputs the competition dataset and reasoning. The performance of coarse tuning and pseudo answer increased the most because the pseudo answer showed the category of objects corresponding to the bounding box. On the test public set, our method obtained a seventy-six point five score, and on the test private set, our method obtained a seventy-six point three score, which shows that our method has strong generalization.
References
- Chen et al. [2023] Chongyan Chen, Samreen Anjum, and Danna Gurari. Vqa therapy: Exploring answer differences by visually grounding answers. In ICCV, pages 15269–15279. IEEE, 2023.
- Ding et al. [2022] Yang Ding, Jing Yu, Bang Liu, Yue Hu, Mingxin Cui, and Qi Wu. Mukea: Multimodal knowledge extraction and accumulation for knowledge-based visual question answering. In CVPR, pages 5079–5088. IEEE, 2022.
- Fu et al. [2024] Zhongtian Fu, Kefei Song, Luping Zhou, and Yang Yang. Noise-aware image captioning with progressively exploring mismatched words. In AAAI, pages 12091–12099, 2024.
- Gupta et al. [2022] Vipul Gupta, Zhuowan Li, Adam Kortylewski, Chenyu Zhang, Yingwei Li, and Alan L. Yuille. Swapmix: Diagnosing and regularizing the over-reliance on visual context in visual question answering. In CVPR, pages 5068–5078. IEEE, 2022.
- Ke et al. [2023] Junjie Ke, Keren Ye, Jiahui Yu, Yonghui Wu, Peyman Milanfar, and Feng Yang. Vila: Learning image aesthetics from user comments with vision-language pretraining. In CVPR, pages 10041–10051, 2023.
- Li et al. [2022] Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, pages 12888–12900, 2022.
- Li et al. [2023] Mingxiao Li, Zehao Wang, Tinne Tuytelaars, and Marie-Francine Moens. Layout-aware dreamer for embodied visual referring expression grounding. In AAAI, pages 1386–1395, 2023.
- Li et al. [2024] Xin-Chun Li, Shaoming Song, Yinchuan Li, Bingshuai Li, Yunfeng Shao, Yang Yang, and De-Chuan Zhan. MAP: model aggregation and personalization in federated learning with incomplete classes. CoRR, abs/2404.09232, 2024.
- Liu et al. [2023] Yang Liu, Jiahua Zhang, Qingchao Chen, and Yuxin Peng. Confidence-aware pseudo-label learning for weakly supervised visual grounding. In ICCV, pages 2816–2826, 2023.
- Meng et al. [2024] Lingwu Meng, Jing Wang, Ran Meng, Yang Yang, and Liang Xiao. A multiscale grouping transformer with CLIP latents for remote sensing image captioning. IEEE Trans. Geosci. Remote. Sens., 62:1–15, 2024.
- OpenAI [2023] OpenAI. Gpt-4 technical report. CoRR, abs/2303.08774, 2023.
- Rigoni et al. [2023] Davide Rigoni, Luca Parolari, Luciano Serafini, Alessandro Sperduti, and Lamberto Ballan. Weakly-supervised visual-textual grounding with semantic prior refinement. In BMCV, page 229, 2023.
- Tian et al. [2022] Weidong Tian, Haodong Li, and Zhong-Qiu Zhao. Dual capsule attention mask network with mutual learning for visual question answering. In COLING, pages 5678–5688, 2022.
- Wang et al. [2022] Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In ICML, pages 23318–23340, 2022.
- Wang et al. [2023] Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, and Furu Wei. Image as a foreign language: BEIT pretraining for vision and vision-language tasks. In CVPR, pages 19175–19186, 2023.
- Wu et al. [2024] Xiangyu Wu, Qing-Yuan Jiang, Yang Yang, Yi-Feng Wu, Qingguo Chen, and Jianfeng Lu. Tai++: Text as image for multi-label image classification by co-learning transferable prompt. CoRR, abs/2405.06926, 2024.
- Xi et al. [2023] Wenjuan Xi, Xin Song, Weili Guo, and Yang Yang. Robust semi-supervised learning for self-learning open-world classes. In ICDM, pages 658–667, 2023.
- Yang et al. [2022] Yang Yang, Jingshuai Zhang, Fan Gao, Xiaoru Gao, and Hengshu Zhu. Domfn: A divergence-orientated multi-modal fusion network for resume assessment. In ACM MM, pages 1612–1620, 2022.
- Yang et al. [2023a] Yang Yang, Yurui Huang, Weili Guo, Baohua Xu, and Dingyin Xia. Towards global video scene segmentation with context-aware transformer. In AAAI, pages 3206–3213, 2023a.
- Yang et al. [2023b] Yang Yang, Yuxuan Zhang, Xin Song, and Yi Xu. Not all out-of-distribution data are harmful to open-set active learning. In NeurIPS, 2023b.
- Yang et al. [2024] Yang Yang, Nan Jiang, Yi Xu, and De-Chuan Zhan. Robust semi-supervised learning by wisely leveraging open-set data. CoRR, abs/2405.06979, 2024.
- Zhan et al. [2022] Huayi Zhan, Peixi Xiong, Xin Wang, Xin Wang, and Lan Yang. Visual question answering by pattern matching and reasoning. Neurocomputing, 467:323–336, 2022.
- Zhou et al. [2022] Da-Wei Zhou, Yang Yang, and De-Chuan Zhan. Learning to classify with incremental new class. IEEE Trans. Neural Networks Learn. Syst., 33(6):2429–2443, 2022.
- Zhou et al. [2023] Li Zhou, Zikun Zhou, Kaige Mao, and Zhenyu He. Joint visual grounding and tracking with natural language specification. In CVPR, pages 23151–23160, 2023.
- Zhu et al. [2023] Hongguang Zhu, Yunchao Wei, Xiaodan Liang, Chunjie Zhang, and Yao Zhao. Ctp: Towards vision-language continual pretraining via compatible momentum contrast and topology preservation. In ICCV, pages 22200–22210, 2023.
- Zou et al. [2023] Bo Zou, Chao Yang, Chengbin Quan, and Youjian Zhao. Spaceclip: A vision-language pretraining framework with spatial reconstruction on text. In ACM MM, pages 519–528, 2023.