∎

¹¹institutetext: Jielin Qiu, William Han, Winfred Wang, Christos Faloutsos, Lei Li ²²institutetext: Carnegie Mellon University, USA
²²email: [email protected] ³³institutetext: Zhengyuan Yang, Linjie Li, Jianfeng Wang, Lijuan Wang ⁴⁴institutetext: Microsoft Cloud & AI, USA

Entity6K: A Large Open-Domain Evaluation Dataset for Real-World Entity Recognition

Jielin Qiu William Han Winfred Wang Zhengyuan Yang
Linjie Li Jianfeng Wang Christos Faloutsos Lei Li Lijuan Wang

(Received: date / Accepted: date)

Abstract

Open-domain real-world entity recognition is a critical and challenging task, involving the identification of a wide range of entities, from objects to specific scenes, in diverse and unstructured environments. This field emphasizes the difficulty of achieving precise recognition across broad and uncontrolled settings. A major obstacle in this field has been the lack of a suitable evaluation dataset, which is difficult to compile due to the sheer number of entities involved and the intensive human effort required for filtering and curating the data. To address this, we present Entity6K, a comprehensive open-domain evaluation dataset specifically designed for real-world entity recognition. Entity6K encompasses an extensive range of 5,700 real-world entities, organized into 26 primary categories. Each entity is supported by five human-verified images, complete with human annotations. A key attribute of Entity6K is its broad and diverse list of entity names and categorizations, filling a gap that existing datasets have not addressed. We also conducted benchmarks using existing models on four distinct tasks: image captioning, object detection, zero-shot image classification, and dense captioning. Through these benchmarks, we aim to demonstrate the effectiveness of our dataset in evaluating models’ abilities to recognize a wide range of real-world entities. We believe that Entity6K will serve as a valuable resource in this field, facilitating more advanced and accurate entity recognition in open-domain settings.

Refer to caption — Figure 1: Comparison between Entity6K and existing datasets, where existing datasets may only contain a single large entity, ambiguous entity name, no bounding box, or short/no captions. However, our dataset contains entities in complex environments, with specific names, and human-labeled bounding boxes and captions.

1 Introduction

Recognizing entities from images is inherently difficult due to several factors. First, the visual complexity and variability of real-world scenes pose challenges in accurately identifying and localizing entities of interest. Images can contain multiple entities, occlusions, variations in lighting conditions, and diverse object appearances, making it challenging to discern and differentiate entities. Second, the open-domain nature of the task introduces the need for extensive knowledge representation and generalization to encompass a wide range of entities, including those not present during training. This requires models to learn abstract representations that capture the underlying characteristics of entities across different visual contexts.

Despite these challenges, open-domain entity recognition from images offers significant value across various domains, including scene understanding, object detection, visual search, recommendation systems, and augmented reality, enhancing user experiences and providing valuable insights from visual data. Additionally, entity recognition from images contributes to the development of intelligent systems, including image captioning, visual question answering, and content generation.

To address the complexities of open-domain entity recognition from images, researchers have developed diverse approaches. Deep learning models with transfer learning techniques, leveraging pretrained models on large-scale datasets, have proven effective in improving recognition accuracy. However, there is not a large and diverse evaluation dataset that can used to test the capability of different models’ entity recognition ability.

The absence of an existing dataset for open-domain real-world entity recognition stems from several reasons. Firstly, the task necessitates a large and diverse entity list encompassing various real-world objects. Compiling and maintaining such a comprehensive list is a formidable challenge, given the constantly evolving nature of entities and the need for accurate and up-to-date information. Additionally, the manual filtering and curation process required to ensure data quality and relevance imposes a substantial human effort and time burden, making it hard to create a dataset at scale. Furthermore, the lack of standardized evaluation benchmarks inhibits progress and hinders meaningful comparisons between different approaches.

Therefore, in this research effort, we have unveiled “Entity6K," a substantial open-domain dataset tailored for the recognition of real-world entities. Entity6K encompasses a collection of 5,700 authentic entities, with each entity thoughtfully matched with five human-validated images and accompanying annotations, thus amassing a grand total of 28,500 images. What sets Entity6K distinctly apart from its predecessors is the remarkable diversity and comprehensiveness of its entity name list, a feature hitherto absent in existing datasets.

Our contributions can be summarized as follows:

•

We have introduced Entity6K, a vast and varied dataset encompassing 5,700 distinct entities, serving as a robust evaluation resource for assessing the entity recognition capabilities of diverse models.
•

Each image within the dataset has undergone meticulous manual scrutiny to guarantee its quality and accuracy. Additionally, we have enriched the dataset by incorporating additional human annotations, ensuring its suitability for thorough evaluation.
•

We have conducted a comprehensive benchmarking exercise to evaluate the performance of several pretrained models across four distinct tasks, including image captioning, object detection, zero-shot image classification, and dense captioning. This analysis sheds light on the capabilities of these models in real-world entity recognition scenarios.

2 Related Work

Open-domain Entity Recognition

in the context of image processing is a burgeoning field that focuses on the automated identification and extraction of a variety of entities, such as objects, people, and locations, from photographic images. This task is particularly challenging as it requires the recognition system to operate without the crutch of domain-specific knowledge or pre-established contextual information. Hu2023OpendomainVE introduced a task in open-domain visual entity recognition, where a model links an image to a Wikipedia entity based on a text query. However, this method requires a text query to retrieve the entity name from the Wikipedia entity list.

Zero-Shot Image Classification

focuses on identifying image classes that weren’t seen during training, a concept explored in studies by (Lampert2014AttributeBasedCF), (Liu2019LargeScaleLR), and (Vinyals2016MatchingNF). Due to its complexity, researchers have also considered the few-shot setting, which deals with limited training data. Key contributions in this area include works by (Snell2017PrototypicalNF), (Finn2017ModelAgnosticMF), (Rusu2018MetaLearningWL), and (Ye2018FewShotLV), focusing on developing effective models for this more manageable approach.

Object Detection

algorithms, such as Faster R-CNN (Ren2015FasterRT) or YOLO (Redmon2015YouOL), can be used to identify and localize objects within an image. These algorithms typically output bounding boxes around detected objects along with their corresponding class labels. Kuo2022FVLMOO proposed F-VLM, an open-vocabulary object detection method built upon Frozen Vision and Language Models. Li2021GroundedLP proposed a GLIP model for learning object-level, language-aware, and semantic-rich visual representations, which unified object detection and phrase grounding for pretraining. Zhang2022GLIPv2UL unified localization pretraining and Vision-Language pretraining, which can be used for object detection and instance segmentation.

3 Entity6K Dataset

In this section, we introduce how the Entity6K dataset is collected and how the annotations were conducted.

3.1 Data Acquisition

Define the Scope

Our initial step in addressing our problem involves the compilation of a diverse array of entity names, encompassing a wide range of real-world entities, including businesses, products, and individuals. To accomplish this task, we’ve categorized our selection into 26 distinct categories, covering areas such as mammals, fish, birds, reptiles, amphibians, landmarks, food items, electronics, crafts, fruits, vegetables, sports, household items, games, toys, currency, celebrities, beverages, healthcare topics, insects, plants, desserts, instruments, rocks, cars, and beauty entities.

Within each of these categories, we employed Wikipedia as a valuable resource to identify specific entity names. Our primary objective is to evaluate the system’s capacity to accurately recognize precise entities, so we prioritize names that exhibit a high level of specificity. For instance, we favor names like “German Shepherd" or “Alaskan Malamute" over more general terms such as “Dog." This unique approach differentiates our dataset from existing ones.

Data Collection and Licenses

After compiling a thorough and varied list of unique entities, ensuring there are no repetitions, the next step involves acquiring images. We accomplish this by utilizing the entity names as search queries on Flickr¹¹1https://www.flickr.com/photos/tags/dataset/. It’s important to note that these images have been generously shared on Flickr by their respective creators under licenses that include Creative Commons BY 2.0, Creative Commons BY-NC 2.0, Public Domain Mark, or Public Domain CC 1.0. These licenses all grant permission for unrestricted usage, redistribution, and modification, specifically for non-commercial purposes.

Fidelity Control

The dataset comprises 28,500 high-quality images with significant diversity, all sourced from Flickr, thereby inheriting the biases in that database. Initially, we compiled 12,003 entity names across 26 categories. For each entity, we collected ten images from Flickr with approved licenses, saving the relevant metadata in a JSON file, including original image URLs, authors, and licenses. Subsequently, Amazon Mechanical Turk²²2https://www.mturk.com/ was employed to assess image quality through two key steps: (1) Three human judges verified if the saved image accurately corresponded to the entity; any mismatches led to image deletion. (2) Following this verification, entities lacking five saved images were removed from our list. For entities with more than five images, five were randomly sampled, forming our final dataset. After these fidelity control measures, we retained 5,700 entities, resulting in a retention rate of approximately 47.5%. The detailed numbers of entities of each category before and after the fidelity control step are shown in Table 1.

Table 1: More details for fidelity control, where “Initial Entities" and “Final Entities" mean the number of entities before/after the fidelity control step, respectively.

Main category Initial Entities Final Entities mammals 778 545 fish 1089 277 birds 739 705 reptiles 141 63 amphibians 211 162 landmark 500 158 food 483 181 electronics 432 103 crafts 490 214 fruit 361 194 vegetable 389 226 sports 694 172 household 120 102 games 198 62 toys 231 99 currency 157 45 celebrity 1515 1009 drink 300 31 healthcare 100 42 insect 369 206 plant 606 436 dessert 400 323 instruments 477 116 rock 217 79 cars 588 133 beauty 418 17 Summary 12,003 5,700

3.2 Human Annotation

The dataset labeling process comprises two distinct stages with Amazon Mechanical Turk:

Bounding Box Annotation

In the initial phase, a single annotator is assigned the task of outlining bounding boxes for each image. The annotator is provided with the corresponding entity name for the image and is responsible for marking the relevant region within that image. The goal is to establish a single bounding box for each image.

Textual Description Annotation

Following the completion of the initial bounding box marking phase by the first annotator, the second step involves five different annotators independently creating textual descriptions for each image. These annotators are given the entity name associated with each image to assist them in crafting their text captions. It’s crucial to emphasize that all annotators are expected to provide comprehensive and detailed textual descriptions, encompassing as much relevant information as possible. For example, annotators are encouraged to write descriptions such as “A cheerful boy, wearing a white helmet, is riding a vibrant green bicycle, while nearby, a young girl in a pink helmet is seated on a serene blue bicycle, sipping refreshing water" rather than simply stating “Two people riding bikes."

3.3 Statistics of the Dataset

In Figure 3, we can observe the statistics of the gathered Entity6K dataset. Furthermore, Table 2 presents a comparison with existing datasets. As depicted in Table 2, our dataset contains an order of magnitude more entities than the existing datasets. Additionally, the entities are categorized and come with verified human annotations, rendering the proposed dataset a valuable resource for real-world entity recognition evaluations.

Table 2: Comparison with existing datasets, where HA is short for Human Annotations.

Dataset Entity Categories HA MSCOCO (Lin2014MicrosoftCC) 80 ✗ ✓ ObjectNet (Barbu2019ObjectNetAL) 313 ✗ ✗ SUN (Xiao2010SUNDL) 397 ✗ ✗ Open Images (Kuznetsova2018TheOI) 600 ✗ ✓ NoCaps (Agrawal2019nocapsNO) 680 ✗ ✓ ImageNet (Russakovsky2014ImageNetLS) 1,000 ✗ ✗ Entity6K (ours) 5,700 26 ✓

Table 3: Comparison of Image Captioning results, where the results are averaged across 26 categories.

4 Experimental Settings

4.1 Tasks

We have chosen four tasks to construct our evaluation benchmark, which includes object detection, zero-shot image classification, image captioning, and dense captioning.

4.2 Evaluation Metrics

According to different tasks, we select the corresponding standard metrics as the evaluation metrics. For object detection, we select Average Precision (AP) as the evaluation metric. For zero-shot image classification, we take the standard accuracy as the evaluation metric. For image captioning, we adopted the BLEU (Papineni2002BleuAM), ROUGE (Lin2004ROUGEAP), Meteor (Banerjee2005METEORAA), and BertScore (Zhang2020BERTScoreET) as evaluation metrics. For the dense captioning task, we take mean Average Precision (mAP) as the evaluation metric. Similar to object detection metric, dense captioning measures an mAP across a range of thresholds for both localization and description accuracy, following Johnson2015DenseCapFC. For localization, it uses box IoU thresholds of .3, .4, .5, .6, .7. For language description, a METEOR score (Banerjee2005METEORAA) with thresholds of 0, .05, .1, .15, .2, .25 is used. The mAP is averaged by the APs across all pairwise of these two types of thresholds.

4.3 Benchmark Models

For different tasks, we selected different baseline models for the benchmark. Specifically, for object detection, GLIP (Li2021GroundedLP), GRiT (Wu2022GRiTAG), DINO (zhang2022dino), and ViT-Adapter (chen2022vitadapter). For zero-shot image classification, we select CLIP (Radford2021LearningTV), ALIGN (Jia2021ScalingUV), and GPT-4 (OpenAI2023GPT4TR). For image captioning, we select BLIP (Li2022BLIPBL), OFA (Wang2022UnifyingAT), GIT (Wang2022GITAG), and GRIT Nguyen2022GRITFA as baselines. For dense captioning, we adopt FCLN Johnson2015DenseCapFC and GRiT Wu2022GRiTAG.

4.3.1 Object Detection

GLIP

For GLIP (Li2021GroundedLP), we use the GLIP-T model that uses the Tiny Swin-Tiny backbone and pretrained on Object365 (Shao2019Objects365AL), GoldG (Li2021GroundedLP), Cap4M (Li2021GroundedLP), SBU (NIPS2011_5dd9db5e), and Conceptual Captions (sharma-etal-2018-conceptual). The backbone for the text encoder is the base BERT model.

GRiT

For GRiT (Wu2022GRiTAG), we use the base GRiT model pretrained with the 12-layer ViT initialized from the masked autoencoder (MAE), which was trained on ImageNet-1K. The text decoder is a 6-layer transformer. The provided checkpoint is also pretrained jointly on object detection and dense captioning.

DINO

For DINO (zhang2022dino), we use the 24 epoch setting, DINO-4scale pretrained checkpoint. This pretrained model uses the ResNet50 as the backbone, where a 6-layer encoder and 6-layer decoder are used for the transformer network (zhang2022dino). The hidden dimension size is 256.

ViT-Adapter

For ViT-Adapter (chen2022vitadapter), we use the large model. The ViT has 24 layers with 16 heads and 303.3 million parameters. The adapter has 16 heads as well and 23.7 million parameters. The backbone used in this pretrained model is the BEiTv2 model (Peng2022BEiTVM).

4.3.2 Zero-shot Image Classification

CLIP-ViT-L

The CLIP (Radford2021LearningTV) model we utilize uses the large ViT transformer architecture as the image encoder and a masked self-attention transformer as the text encoder. We used clip-vit-large-patch14 in this setting.

CLIP-ViT-H

This CLIP (Radford2021LearningTV) rendition uses the huge ViT as the backbone and was trained on the English subset of LAION-5B. We used CLIP-ViT-H-14-laion2B-s32B-b79K in this setting.

ALIGN

The ALIGN model (Jia2021ScalingUV) uses the EfficientNet (Tan2019EfficientNetRM) as the vision encoder and the BERT model as the text encoder. We used ALIGN-base in this setting.

GPT4

GPT-4 (OpenAI2023GPT4TR) is a large multimodal model capable of processing image and text inputs and producing text outputs.

4.3.3 Image Captioning

BLIP

For BLIP (Li2022BLIPBL), we use the “blip-image-captioning-large" pretrained checkpoint, where ViT-Large is used as the vision transformer and the Bert-base model for the text transformer (Li2022BLIPBL). We use the phrase “a picture of" as the prompt for the model, as seen in Li2022BLIPBL.

OFA

For OFA (Wang2022UnifyingAT), we use the “OFA-base" pretrained checkpoint, where ResNet101 is used as the backbone (Wang2022UnifyingAT). This model has 180 million parameters, a hidden size of 768, and an intermediate size of 3072. There are 12 heads, six encoder layers, and six decoder layers.

GIT

For GIT (Wang2022GITAG), we use the “git-base-coco" pretrained checkpoint, which contains six layers for the transformer decoder with 12 attention heads. The hidden size is 768, and the model has 347 million parameters.

GRIT

For GRIT (Nguyen2022GRITFA), we use the checkpoint pretrained on four object detection datasets (i.e., COCO, Visual Genome, Open Images, and Object365) (Nguyen2022GRITFA). The hidden size is set to 512, and the number of heads to 8. The model has six layers for the object detector, three layers for the grid feature network, and three layers for the caption generator (Nguyen2022GRITFA).

4.3.4 Dense Captioning

FCLN (Johnson2015DenseCapFC)

FCLN uses a 13-layer VGG-16 architecture as the backbone and an RNN language model as the text decoder (Johnson2015DenseCapFC). The token and hidden layer size are 512.

GRiT-MAE (Wu2022GRiTAG)

Similar to object detection, we use the base GRiT model pretrained with the 12-layer ViT initialized from the masked autoencoder (MAE). The text decoder is also a 6-layer transformer. Since the provided checkpoint is jointly pretrained on object detection and dense captioning, we use the same checkpoint for the two tasks.

4.4 Experimental Settings

In our evaluation of the performance of existing models, we adhered to the instructions provided by those models. Specifically, we utilized the pretrained weights directly without undergoing any training or fine-tuning processes.

5 Experimental Results

5.1 General Insights

In this section, we provide comparison results and discussions on each task.

Table 4: Comparison of Object Detection results, where the results are averaged across 26 categories. Detailed results for each category can be found in the Appendix.

Object Detection

The Object Detection results are presented in Table 4. According to the findings, GRiT appears to outperform all other baselines across all metrics.

Table 5: Comparison of Zero-shot Image Classification results, where the results are averaged across 26 categories. Detailed results for each category can be found in the Appendix.

Zero-shot Image Classification

The Zero-shot Image Classification results are outlined in Table 5. It is evident that CLIP outperforms ALIGN. Moreover, CLIP with the ViT-H vision encoder exhibits superior performance compared to CLIP with the ViT-L vision encoder, indicating that a larger vision encoder can learn more effective visual representations, enhancing the model’s recognition capabilities. However, GPT-4 achieved the best performance compared with all the baselines, showing its superior ability to recognize real-world entities.

Table 6: Comparison of Dense Captioning results, where the results are averaged across 26 categories. Detailed results for each category can be found in the Appendix.

Image Captioning

The image captioning results are detailed in Table 3. Notably, various models exhibit varying performances across different evaluation metrics, i.e., BLIP surpasses other baselines in ROUGE-1, BLEU, and METEOR. However, OFA outperforms BLIP in ROUGE-2, ROUGE-L, SPICE, and BertScore metrics.

Dense Captioning

The Dense Captioning results can be seen in Table 6. While GRiT outperforms FCLN, it’s noteworthy that the results of both models are relatively low, indicating significant room for improvement in this area.

5.2 Detailed Results for Each Category

The detailed results for each category on each task have been listed in the following tables. Tables 7,8,9,10 show detailed image captioning results by OFA, BLIP, GRiT, and GIT, respectively. Tables 11,12,13,14 show detailed object detection results for GLIP, GRiT, DINO, and ViT-Adapter, respectively. Table 15 shows the detailed Zero-shot Image Classification results, and Table 16 shows the detailed Dense Captioning across 26 categories.

Across these results, an important observation is that given a category, its prevalence in our dataset is not directly correlated to performance. We can see this clearly in categories such as cars and birds, where they comprise 2.3% and 12.4% of our dataset, respectively. However, in most of the results, we see that the metrics for the birds category are often lower than the cars category. We assume this is the case because each model is pretrained on a different set of datasets. Overall, by observing the category-wise performances of all models for each task, we can conclude that none of the models can generalize well to the complex scenes and textual descriptions provided in our dataset, showing the complexity and challenge provided by our proposed dataset.

Table 7: Comparison of Image Captioning results for each category for OFA.

Table 8: Comparison of Image Captioning results for each category for BLIP.

Table 9: Comparison of Image Captioning results for each category for GRiT.

Table 10: Comparison of Image Captioning results for each category for GIT.

Table 11: Comparison of Object Detection results for each category for GLIP.

Table 12: Comparison of Object Detection results for each category for GRiT.

Table 13: Comparison of Object Detection results for each category for DINO.

Table 14: Comparison of Object Detection results for each category for ViT-Adapter.

Table 15: Comparison of accuracies for Zero-shot Image Classification across 26 categories. CLIP-ViT-L: CLIP-ViT-Large-patch14, CLIP-ViT-H: CLIP-ViT-H-14-laion2B-s32B-b79K.

Table 16: Comparison of mAP scores for Dense Captioning across 26 categories.

5.3 Human evaluation

To gain a more accurate assessment of the model’s performance, we conducted human experiments for both the Zero-shot Image Classification and Dense Captioning tasks. We enlisted the participation of three human judges sourced from Amazon Mechanical Turk, comprising two males and one female. The results for each task were calculated by averaging the scores provided by all three human judges and are presented in Table 5 and Table 6, respectively.

Examining the data in Tables 5 and 6, it becomes evident that GPT-4 has achieved performance levels closely aligned with human capabilities in the Zero-shot Image Classification task. However, it’s important to note that in the Dense Captioning task, the results for both models fall notably below human performance levels. This suggests that there is substantial room for improvement in this particular domain.

5.4 Limitation

Although our proposed dataset tackles the shortcomings of current datasets, we foresee that there are still certain limitations that future research can potentially improve.

•

The dataset size has the potential to be expanded further. Although we initially compiled a substantial list of entities, our fidelity control process led to the removal of over half of the entity names due to insufficient images. To address this issue, future endeavors could explore additional resources beyond the Flickr database we utilized, with the aim of augmenting the dataset.
•

Achieving data balance remains a challenge. Despite our efforts to create a diverse dataset, imbalances between different categories may persist. Future efforts could focus on balancing entities within each category while expanding the dataset. However, it’s important to note that certain categories, like species of mammals, may inherently have limited entities, while others, such as celebrity names, could be significantly larger. This inherent nature might lead to persistent imbalances in the enlarged dataset.
•

Insufficient baseline options, particularly in the context of dense captioning, pose a challenge. Currently, only two baselines with publicly available weights can be incorporated into this benchmark. It is anticipated that future research endeavors could expand the available baseline options as new work emerges, providing a more comprehensive selection for evaluation.

6 Conclusion

In this study, we delved into exploring the open-domain recognition capabilities of pretrained multimodal models. To facilitate this exploration, we introduced Entity6K, a substantial open-domain dataset tailored for real-world entity recognition. Comprising 5,700 diverse real-world entities within 26 distinct categories, this dataset is versatile and applicable to a range of tasks. We rigorously evaluated model performance across four tasks: image captioning, object detection, zero-shot image classification, and dense captioning. Through these evaluations, our aim is to provide a valuable evaluation resource for assessing models’ proficiency in recognizing open-domain real-world entities.

Data availability statement

In this paper, we introduced Entity6K, a large open-domain evaluation dataset for real-world entity recognition. Entity6K contains 5,700 real-world entities with 26 main categories, where each entity is associated with five human-verified images and human annotations/captions. Our dataset will be made publicly available soon.