ViRED: Prediction of Visual Relations in Engineering Drawings
Abstract
To accurately understand engineering drawings, it is essential to establish the correspondence between images and their description tables within the drawings. Existing document understanding methods predominantly focus on text as the main modality, which is not suitable for documents containing substantial image information. In the field of visual relation detection, the structure of the task inherently limits its capacity to assess relationships among all entity pairs in the drawings.
To address this issue, we propose a vision-based relation detection model, named ViRED, to identify the associations between tables and circuits in electrical engineering drawings. Our model mainly consists of three parts: a vision encoder, an object encoder, and a relation decoder.
We implement ViRED using PyTorch to evaluate its performance. To validate the efficacy of ViRED, we conduct a series of experiments. The experimental results indicate that, within the engineering drawing dataset, our approach attained an accuracy of 96% in the task of relation prediction, marking a substantial improvement over existing methodologies. The results also show that ViRED can inference at a fast speed even when there are numerous objects in a single engineering drawing.
Index Terms:
Document understanding, Visual relation prediction, Engineering drawingI Introduction
Digitization of engineering design drawings constitutes a crucial component of contemporary industrial processes. Nevertheless, the automated recognition of these engineering drawings in image format still encounters considerable challenges. Electrical engineering drawings, a subset of engineering drawings, are primarily used to depict equipment related to electrical systems. Typically, electrical design engineers are required to review and recreate numerous electrical engineering drawings to transform technical illustrations into production-level drawings. To alleviate this workload, it is crucial to develop automated recognition methods for electrical engineering drawings, which are stored in image format.
In general, a single electrical engineering drawing includes multiple circuits and tables as depicted in Fig. 1. Each circuit describes different electrical devices, while each table presents the parameter settings for the corresponding circuits. To enhance the identification of circuits and tables, it is necessary to extract these components from diagrams and analyze them using specialized models. While there is a considerable amount of existing work on detecting and classifying specific objects within images [1, 2, 3], the correspondence between these objects is often lost in this process. Most current methods for extracting relations within documents depend on the relations between text pairs [4, 5, 6], rendering them insufficient for addressing relations between diagrams and tables. In visual relation detection methods [7, 8, 9], the task and model design limit their capability to predict relationships across all circuit-table pairs, thereby causing missed detections.

The circuit-to-table relations in electrical engineering drawings are complex, potentially exhibiting one-to-one, one-to-many, and other variations. Furthermore, the quantity of circuits and tables in each image differs, presenting difficulties in managing variable-length inputs using non-sequential modeling methods. To address these issues, we propose ViRED, a Visual Relation prediction model for Engineering Drawings based on Transformer architecture [10]. ViRED consists of a visual encoder, an object encoder, and a relationship decoder. The model is trained and fine-tuned utilizing the PubLayNet [11] dataset along with the proprietary electrical engineering drawing dataset, resulting in exceptional performance in the task of relation prediction.
The main contributions of this work are outlined as follows:
-
•
We present a novel vision-based relation detection approach, named ViRED, to address the issue of predicting relations for non-textual components in complex documents. This approach has been specifically implemented for the purpose of circuit-to-table relation matching in electrical design drawings.
-
•
We develop a dataset of electrical engineering drawings derived from industrial design data, and we annotate the instances and their relationships within the dataset.
-
•
We evaluate our method using various metrics on the electrical engineering drawing dataset. Furthermore, we perform comparative analyses with existing approaches and provide a performance comparison between the existing methods and our proposed technique.
-
•
We perform extensive ablation studies to compare the impact of different model architectures, hyperparameters, and training methods on the overall performance. Moreover, we refined our model architecture based on these comprehensive and comparative analysis.
The following sections of this paper are structured as follows. Section II reviews the related work pertinent to this study. Section III elaborates on the proposed methods and models in detail. Section IV discusses the experimental evaluation results of the proposed approach. Lastly, the article ends with a summary of our work in Section V.
II Related Works
II-A Visual Document Understanding
The task of visual document understanding (VDU) focuses on understanding digital documents in image formats. There are several downstream tasks associated with VDU, including key information extraction, relation detection, document layout analysis, and others. Most contemporary VDU techniques rely on deep neural networks that utilize visual, textual, or a mixture of visual and textual modalities.
Approaches for document understanding that utilize computer vision were initially proposed following advancements in convolutional neural networks (CNN). Hao et al. [12] proposed the use of CNN for detecting tables in document images. With the introduction of the R-CNN series model [13, 14, 15, 16], the methodologies for detecting tables [17, 18, 19] and analyzing layouts [20, 11] in document images have benefited from these models, resulting in enhanced performance. Modern Optical Character Recognition (OCR) engines [21] have demonstrated significant efficacy in extracting text from document images. Furthermore, the advancement of language models has contributed to the tasks of document understanding. However, recent approaches in VDU incorporate other modalities such as textual information as auxiliary information. LayoutLM [4] utilizes textual data in conjunction with the bounding box provided by the OCR engine for inference. DocFormer [22], LayoutLMv2 [23] and LayoutLMv3 [24] utilize early fusion techniques to better integrate and leverage visual and textual modality information.

II-B Visual Relation Detection
Visual relation detection (VRD) involves the task of predicting the relations or interactions between pairs of objects within a single image [25]. Typically, VDR is a mid-level vision task that extracts information from low-level vision tasks such as object detection and recognition [26]. Despite the advancement of VDR techniques, the task remains difficult because of conflicts arising from the various potential relations and the lack of labeled data [7].
Lu et al. [26] introduced the first VDR approach utilizing deep learning techniques. This method employed R-CNN for object and predicate detection and incorporated language priors to enhance the accuracy of predicted relations. Subsequently, ViP-CNN was introduced by Li et al. [27] to identify subject, predicate, and object simultaneously. This method incorporates a Phrase-Guided Message Passing Structure to investigate the interconnections of relation components. Zhuang et al. [28] proposed a context-aware interaction classification framework based on an attention mechanism, which is accurate, scalable, and enjoys good generalization ability to recognize unseen context-interaction combinations. Though these methods succeeded in extracting relations in general images, few studies [29] have investigated the use of VRD techniques in domain-specific images, such as structured documents, mechanical blueprints, and engineering drawings.
III Methodology
III-A Problem Definition
Given an image of an electrical engineering drawing, let there be circuits and tables present within the image. There may be a relation between a circuit and a table. That is, there may exist at most relations in the engineering drawing. Assuming the presence of bounding boxes and their associated instance type labels, we aim to determine if there exists a relation between specific circuits and tables.
III-B Model Architecture
ViRED has three main components: a pretrained vision encoder, a lightweight object encoder, and a fast relation decoder. Fig. 2 presents the overview of our model pipeline. We will describe these components in the following sections.
III-B1 Vision encoder
The vision encoder in the model processes an image of size and produces an image embedding in either feature vector format or feature map format with dimensions , where , , and represent the channels, width, and height of the image and feature map. In cases where feature maps are produced, they are flattened and converted into feature vectors. To prevent bias, a pre-trained vision encoder called the Masked Autoencoder (MAE) [30] is utilized in this study. The Document Image Transformer (DiT) [31] is employed as our vision encoder. The vision encoder is only used once for each image, regardless of the number of objects or relationships in the image.
(1) |
(2) |
III-B2 Object encoder
The object encoder efficiently maps a bounding box and its typing information into a vectorial embedding. We use a convolutional neural network (CNN) to encode the bounding boxes. The object’s bounding box is shown as a one-channel image with the same dimensions as the engineering drawing image. Pixels inside the bounding box are shown as 1, while those outside are shown as 0. The bounding box image is then encoded using a three-layer CNN.
(3) |
To inform the relation decoder about whether the embedding token represents a circuit or a table, we aggregate the vector embeddings of bounding boxes with two learned embeddings .
(4) |
III-B3 Relation decoder
The relation decoder takes the encoded masks and vectorized images as input and predicts if there is a relation between each circuit-table pair. In detail, the relation decoder consists of two components: the fusion model and the relation prediction model.
The fusion model is employed for feature fusion between object masks and image features. Inspired by DETR [3], a transformer-decoder-based model is adopted. We modify the standard transformer decoder by eliminating the relative position embedding and causal mask because of the lack of order between objects. This transforms the decoder into a bidirectional transformer decoder. As shown in Fig. 2, each decoder layer consists of four parts. 1. The tokens are initially processed by a self-attention model as shown in Eq. 5, which enables the mask tokens to interact with each other. This step enables the objects to determine their relative positions. 2. Next, a cross-attention model is introduced, where the image embeddings are used as the key and value vector, and mask tokens are used as the query vector like Eq. 6. By utilizing the image-to-object cross-attention model, the mask tokens are modified through a combination of bounding box features and vision features. 3. Then, a feedforward layer updates all the mask tokens, and a dropout layer is used to improve the model’s generalizability. 4. Finally, a residual connection is added to every attention layer and feed-forward layer, in accordance with the typical transformer architecture [10].
(5) |
(6) |
After receiving the object tokens from the fusion model of the relation decoder, it is essential to identify the relationships between them using the relation prediction model. The combined object tokens are concatenated to generate combination tokens. This process is depicted in Fig. 3. To streamline the training and inference procedures, redundant and unfeasible combinations are eliminated. The relation prediction model consists of a three-layer perceptron with ReLU activation and a linear projection layer. This model is utilized to convert hidden features into a two-dimensional logit output. Afterwards, the main objective of the relation prediction model is to determine whether a relationship exists between two objects.
(7) |
(8) |
III-C Pretraining









Due to the scarcity of annotated training data for relation detection in engineering drawings, we utilize the PubLayNet [11] dataset during the pretraining phase. PubLayNet offers a dataset consisting of 340,000 document images, including 3 million bounding boxes and type annotations for the instances within these images. To enhance the model’s comprehension of positional data, we devise a pretraining task that employs our model to categorize the instance type of the masked region as shown in Fig. 2-(d). The architecture of our pretraining model exhibits minor variations compared to the relation prediction model. Since it is necessary to predict the category of the provided bounding boxes (i.e., table or circuit), the type embeddings integrated into the vector embeddings of bounding boxes within the object encoder are excluded. The relation prediction model is substituted with a multi-layer perceptron to perform classification tasks.
IV Experiments
IV-A Experiment Setup
IV-A1 Dataset
In the pretraining phase, our method is trained and tested on the PubLayNet datasets with the same train and test splits as the original dataset. For the finetuning phase, we used a self-annotated dataset of engineering design diagrams containing 283 images, 4,566 entities, and 2,112 relations between entities. The datasets are randomly divided, with 90% assigned for training and 10% for evaluation.
IV-A2 Implementation Details
The dimension of latent representation throughout the model pipeline is set to 768. For the vision encoder backbone, we adopt the DINOv2-B model [35]. The model consists of 12 layers of Transformer encoders, each containing a 12-head multi-head attention block. The input image is cropped into patches of size 14, which are then processed and encoded to a 768-dimensional feature vector. We finetuned the backbone network using the same MAE training method as DiT [31] based on its state-of-the-art performance on visual-based document downstream tasks. The bounding box encoder is a three-layer convolution network with a ReLU activation function and a linear projection layer that maps the feature map to a 768-dimensional vectorized feature. The relation decoder is a 2-layer Transformer Decoder for multi-modal feature fusion. The input image size for the vision encoder is , while the object encoder takes masks as input. Despite the difference in input sizes, we maintain the relative positions between the images and the masks to help the model better understand the positional relations between them.
We implement our model in PyTorch. For the pretraining phase, our model is trained with a batch size of 64 for a total of 100 epochs on the PubLayNet dataset using 90 NVIDIA A800 GPU hours. During the finetuning phase, we initialized the type embeddings to zero and the multi-layer perceptron of the relation extractor with uniformly random numbers. Meanwhile, other parameters remained unchanged from the pretraining checkpoint.
We applied data augmentation to the data while training, including horizontal and vertical flipping with a probability of 0.2, dihedral group transformations, and fixed-size random cropping. Unlike common random cropping, we ensured that instances in the images were not lost due to cropping.
For optimization, we use AdamW with hyper-parameters and weight decay of . The base learning rate for the optimizer is , we employ a linear learning rate schedule from to with 20% epoches for warming up.
IV-B Metrics
Unlike the commonly used R@x metric in relation detection tasks [26], we use mAP, precision, and recall to assess model performance in relation detection tasks. Standard relation detection tasks may not annotate all possible relations in an image. In our task, relations between tables and circuits are clear, so using mAP is a more accurate indicator of model performance. Accuracy is used as the metric for comparison experiments.
IV-C Experiment Results
IV-C1 Qualitative Result
Qualitative results are shown in Fig. 4. The blue and green areas represent the bounding box regions of circuits and tables, respectively. In the presence of a relationship between a table and a circuit, a line segment will link the two entities.
IV-C2 Comparison
In this section, we evaluate our model against current methods for relation detection. We focus on visual relation detection instead of text-based methods for document relation extraction. To align with the visual task, we modify the dataset by categorizing textual descriptions as either ”table” or ”circuit.” Relations are established between all circuit-table pairs, with existing relations labeled as ”describe” and non-existing relations labeled as ”not describe.”
Table I provides a comparative analysis of our relation prediction method in contrast to several established approaches, with the accuracy metric used to measure the precision of the outcomes. Our findings demonstrate that our method surpasses the results of our engineering drawing relation prediction dataset. Importantly, we employ the evaluation methods used by the compared approaches to maintain consistency with our metrics. For MBBR, the accuracy is calculated based on the top-100 relation predictions.
Vision Encoder | Object Encoder | Image Size | #Rel. Decoder Layer | mAP | Precision | Recall |
ViT-B/14 | 3-Layer CNN | 2 | 89.00 | 88.41 | 76.39 | |
ViT-S/14 | 3-Layer CNN | 2 | 86.37 | 85.83 | 59.27 | |
ViT-L/14 | 3-Layer CNN | 2 | 89.80 | 82.79 | 72.66 | |
ResNet-50 | 3-Layer CNN | 2 | 84.33 | 75.94 | 68.47 | |
ViT-B/14 | 3-Layer CNN | 1 | 84.77 | 81.20 | 75.05 | |
ViT-B/14 | 3-Layer CNN | 3 | 88.91 | 85.37 | 79.56 | |
ViT-B/14 | 6-Layer ViT | 2 | 59.57 | 50.00 | 14.07 | |
ViT-B/14 | 3-Layer CNN | 2 | 88.16 | 85.03 | 79.18 | |
ViT-B/14 | 3-Layer CNN | 2 | 87.52 | 86.55 | 80.91 |
IV-C3 Ablation study
In this section, we evaluate our method in different settings and strategies.
We conduct an additive ablation study on the mAP, precision, and recall for relation prediction on the validation set, as illustrated in Tab. III. In the baseline model, we initialize the model with random parameters without pretraining and omit the type embedding from the object encoder. Introducing type embedding allows the model to better differentiate the categories of object tokens. When computing self-attention over object tokens, the relation decoder can simultaneously consider the categories of the objects, thereby reducing computations for improbable relations and enhancing the understanding of relations. Due to the insufficient data in our task’s dataset, we introduced a pretraining mechanism for both the vision encoder and the position encoder-decoder. By undergoing unsupervised training on a large-scale document image dataset, the vision encoder more effectively extracts visual features from electrical engineering drawings. Supervised pretraining was utilized to predict the classification of selected regions based on given positional information, thereby enhancing the model’s understanding of positional inputs. As shown in Table III, the inclusion of type embedding and model pretraining significantly improved the performance across all evaluation metrics.
Ablate | mAP | Precision | Recall |
Base model | 88.26 | 81.52 | 75.44 |
+ Type Embedding | 90.48 | 81.15 | 76.27 |
+ DiT-style Pretrain | 88.93 | 85.69 | 79.16 |
+ Position Pretrain | 91.65 | 90.11 | 83.23 |
We also conducted ablation experiments on the model architecture. We compared the effects of various vision encoders, diverse object encoders, different input image resolutions, and different numbers of transformer decoder layers in the relation decoder on the performance. The experimental results are shown in Tab. II. The hidden dimension of the model is determined by the output dimension of the vision encoder. For ViT-S, ViT-B, ViT-L [35], and ResNet-50 [36], the dimensions for latent representation are set to 384, 768, 1024, and 768, respectively. Upon analyzing the experimental results, it becomes evident that the object encoder architecture significantly influences model performance. The relation decoder struggles to interpret object tokens encoded by ViT, resulting in a notable decline in performance across various metrics. Additionally, the vision encoder impacts the latent representation dimension of the model, thereby also playing a crucial role in determining model performance. The findings suggest that ViT-S performs poorly due to its constrained embedding dimension. For ViT-B and ViT-L, the limited task difficulty results in negligible performance differences. Higher image resolutions improve the vision encoder’s capacity to capture more detailed image features. However, the vision encoder’s ability to encode all these details is restricted by the dimension of the image feature vector, leading to a minimal effect on performance.
IV-C4 Inference Efficiency
To evaluate the inference efficiency of the model, we compute the FLOPs (Floating Point Operations) for the inference process. For uniformity, a batch size of 1 is employed for all models. The input image resolution is fixed at , with the number of objects to be predicted () ranging from 1 to 20. The relations to be predicted include all possible interactions between objects, amounting to . The results are presented in Fig. 5. Owing to the utilization of a lightweight object encoder and relation decoder, the inference efficiency of ViRED is minimally impacted by the number of objects in the electrical engineering drawing. It sustains a rapid inference speed even when a single drawing contains numerous objects.

V Conclusion
We propose a novel relation prediction method and present a dataset for relation detection in electrical engineering drawings. We apply this relation prediction method to the task of relation detection within the dataset, achieving high performance on the validation set through the processes of pretraining and finetuning the model. We conduct a series of experiments, including comparative analyses with existing methods and ablation studies on training strategies and model architectures. The experimental results indicate that our method achieves superior performance on this task.
References
- [1] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2117–2125.
- [2] G. Jocher, “YOLOv5 by Ultralytics,” May 2020. [Online]. Available: https://github.com/ultralytics/yolov5
- [3] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European conference on computer vision. Springer, 2020, pp. 213–229.
- [4] Y. Xu, M. Li, L. Cui, S. Huang, F. Wei, and M. Zhou, “Layoutlm: Pre-training of text and layout for document image understanding,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, ser. KDD ’20. New York, NY, USA: Association for Computing Machinery, 2020, p. 1192–1200. [Online]. Available: https://doi.org/10.1145/3394486.3403172
- [5] W. Yu, N. Lu, X. Qi, P. Gong, and R. Xiao, “Pick: processing key information extraction from documents using improved graph learning-convolutional networks,” in 2020 25th International Conference on Pattern Recognition (ICPR). IEEE, 2021, pp. 4363–4370.
- [6] Q. Peng, Y. Pan, W. Wang, B. Luo, Z. Zhang, Z. Huang, T. Hu, W. Yin, Y. Chen, Y. Zhang et al., “Ernie-layout: Layout knowledge enhanced pre-training for visually-rich document understanding,” arXiv preprint arXiv:2210.06155, 2022.
- [7] K. Liang, Y. Guo, H. Chang, and X. Chen, “Visual relationship detection with deep structural ranking,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018.
- [8] S. Abdelkarim, A. Agarwal, P. Achlioptas, J. Chen, J. Huang, B. Li, K. Church, and M. Elhoseiny, “Exploring long tail visual relationship recognition with large vocabulary,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 15 921–15 930.
- [9] L. Zhao, L. Yuan, B. Gong, Y. Cui, F. Schroff, M.-H. Yang, H. Adam, and T. Liu, “Unified visual relationship detection with vision and language models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 6962–6973.
- [10] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- [11] X. Zhong, J. Tang, and A. Jimeno Yepes, “Publaynet: Largest dataset ever for document layout analysis,” in 2019 International Conference on Document Analysis and Recognition (ICDAR), 2019, pp. 1015–1022.
- [12] L. Hao, L. Gao, X. Yi, and Z. Tang, “A table detection method for pdf documents based on convolutional neural networks,” in 2016 12th IAPR Workshop on Document Analysis Systems (DAS), 2016, pp. 287–292.
- [13] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014.
- [14] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015.
- [15] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, Eds., vol. 28. Curran Associates, Inc., 2015. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2015/file/14bfa6bb14875e45bba028a21ed38046-Paper.pdf
- [16] K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2017.
- [17] S. Schreiber, S. Agne, I. Wolf, A. Dengel, and S. Ahmed, “Deepdesrt: Deep learning for detection and structure recognition of tables in document images,” in 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 01, 2017, pp. 1162–1167.
- [18] D. Prasad, A. Gadpal, K. Kapadni, M. Visave, and K. Sultanpure, “Cascadetabnet: An approach for end to end table detection and structure recognition from image-based documents,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2020.
- [19] J. Fernandes, M. Simsek, B. Kantarci, and S. Khan, “Tabledet: An end-to-end deep learning approach for table detection and table image classification in data sheet images,” Neurocomputing, vol. 468, pp. 317–334, 2022. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0925231221015010
- [20] C. Soto and S. Yoo, “Visual detection with context for document layout analysis,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), K. Inui, J. Jiang, V. Ng, and X. Wan, Eds. Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 3464–3470. [Online]. Available: https://aclanthology.org/D19-1348
- [21] R. Smith, “An overview of the tesseract ocr engine,” in Ninth international conference on document analysis and recognition (ICDAR 2007), vol. 2. IEEE, 2007, pp. 629–633.
- [22] S. Appalaraju, B. Jasani, B. U. Kota, Y. Xie, and R. Manmatha, “Docformer: End-to-end transformer for document understanding,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 993–1003.
- [23] Y. Xu, Y. Xu, T. Lv, L. Cui, F. Wei, G. Wang, Y. Lu, D. Florencio, C. Zhang, W. Che, M. Zhang, and L. Zhou, “LayoutLMv2: Multi-modal pre-training for visually-rich document understanding,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), C. Zong, F. Xia, W. Li, and R. Navigli, Eds. Online: Association for Computational Linguistics, Aug. 2021, pp. 2579–2591. [Online]. Available: https://aclanthology.org/2021.acl-long.201
- [24] Y. Huang, T. Lv, L. Cui, Y. Lu, and F. Wei, “Layoutlmv3: Pre-training for document ai with unified text and image masking,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 4083–4091.
- [25] J. Cheng, L. Wang, J. Wu, X. Hu, G. Jeon, D. Tao, and M. Zhou, “Visual relationship detection: A survey,” IEEE Transactions on Cybernetics, vol. 52, no. 8, pp. 8453–8466, 2022.
- [26] C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei, “Visual relationship detection with language priors,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. Springer, 2016, pp. 852–869.
- [27] Y. Li, W. Ouyang, X. Wang, and X. Tang, “Vip-cnn: Visual phrase guided convolutional neural network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1347–1356.
- [28] B. Zhuang, L. Liu, C. Shen, and I. Reid, “Towards context-aware interaction recognition for visual relationship detection,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 589–598.
- [29] L. Li, C. Yuhui, and L. Xiaoting, “Engineering drawing recognition model with convolutional neural network,” in Proceedings of the 2019 International Conference on Robotics, Intelligent Control and Artificial Intelligence, 2019, pp. 112–116.
- [30] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16 000–16 009.
- [31] J. Li, Y. Xu, T. Lv, L. Cui, C. Zhang, and F. Wei, “Dit: Self-supervised pre-training for document image transformer,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 3530–3539.
- [32] M.-J. Chiou, R. Zimmermann, and J. Feng, “Visual relationship detection with visual-linguistic knowledge from multimodal representations,” IEEE Access, vol. 9, pp. 50 441–50 451, 2021.
- [33] H. Zhang, Z. Kyaw, S.-F. Chang, and T.-S. Chua, “Visual translation embedding network for visual relation detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5532–5540.
- [34] Z. Anastasakis, D. Mallis, M. Diomataris, G. Alexandridis, S. Kollias, and V. Pitsikalis, “Self-supervised learning for visual relationship detection through masked bounding box reconstruction,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 1206–1215.
- [35] M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby et al., “Dinov2: Learning robust visual features without supervision,” arXiv preprint arXiv:2304.07193, 2023.
- [36] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.