ModalChorus\xspace: Visual Probing and Alignment of Multi-modal Embeddings via Modal Fusion Map
Abstract
Multi-modal embeddings form the foundation for vision-language models, such as CLIP embeddings, the most widely used text-image embeddings. However, these embeddings are vulnerable to subtle misalignment of cross-modal features, resulting in decreased model performance and diminished generalization. To address this problem, we design ModalChorus\xspace, an interactive system for visual probing and alignment of multi-modal embeddings. ModalChorus\xspaceprimarily offers a two-stage process: 1) embedding probing with Modal Fusion Map (MFM), a novel parametric dimensionality reduction method that integrates both metric and nonmetric objectives to enhance modality fusion; and 2) embedding alignment that allows users to interactively articulate intentions for both point-set and set-set alignments. Quantitative and qualitative comparisons for CLIP embeddings with existing dimensionality reduction (\eg, t-SNE and MDS) and data fusion (\eg, data context map) methods demonstrate the advantages of MFM in showcasing cross-modal features over common vision-language datasets. Case studies reveal that ModalChorus\xspacecan facilitate intuitive discovery of misalignment and efficient re-alignment in scenarios ranging from zero-shot classification to cross-modal retrieval and generation.
keywords:
Multi-modal embeddings, dimensionality reduction, data fusion, interactive alignment1603
\vgtccategoryResearch
\vgtcpapertypeApplications
\authorfooter
Y.Ye, S. Xiao, X. Zeng, and W. Zeng are with the Hong Kong University of Science and Technology (Guangzhou) . E-mail: {yyebd@connect., sxiao713 @connect., xzeng159@connect., weizeng@}hkust-gz.edu.cn.
Y. Ye and W. Zeng are also with the Hong Kong University of Science and Technology.
Wei Zeng is the corresponding author.
\teaser
Visual probing of multi-modal CLIP embeddings for text-to-image generation. (a) For the prompt of "waterlily pond by Monet", users first discover misalignment of pre-trained models in the form of concept entanglement between "Monet" and "bridge" using our Modal Fusion Map projection and concept axis view. (b) Data augmentation based on weighted embedding generation can be performed to provide extra alignment reference set. (c) Set-set alignment interaction is performed to align the initial generated images of "waterlily pond by Monet" to the augmented images reflecting user intents. (d) Post-alignment model can generate a set of images with more diversity by disentangling "Monet" and "bridge".