ITEm: Unsupervised Image-Text Embedding Learning for eCommerce

AI-generated keywords: eCommerce product embedding multimodal learning unsupervised learning image-text dataset

AI-generated Key Points

  • Product embedding is crucial in eCommerce applications
  • Integration of multiple modalities in learning product embeddings shows advancements over single-modality embeddings
  • Challenges exist in effectively incorporating all modalities without neglecting valuable information
  • Image-text embedding model (ITEm) developed at eBay enhances attention to image and text modalities
  • ITEm model extends BERT by learning embeddings from both text and images without prior knowledge of regions of interest
  • Evaluation of the pre-trained ITEm model demonstrates substantial improvements over baseline models in tasks like searching for similar products and predicting categories
  • Importance of both product images and titles in distinguishing products highlighted, with a focus on avoiding biased embeddings dominated by a single modality
  • ITEm model aims to learn fine-grained embeddings from both images and titles in an unsupervised manner to mitigate dominance issues caused by one modality containing dominant information
  • Study proposes a method to generate fine-grained representations from product images and titles for eCommerce applications
  • Evaluation involved collecting a large-scale image-text product dataset annotated for tasks like searching for similar products and classification against state-of-the-art unimodal and multimodal models
  • Exploration of learning image-title embeddings without prior knowledge of regions of interest expands applicability beyond eCommerce datasets
  • ITEm proves more efficient than traditional two-stream models by adopting a single-stream model approach with one encoder for vision and text inputs
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Baohao Liao, Michael Kozielski, Sanjika Hewavitharana, Jiangbo Yuan, Shahram Khadivi, Tomer Lancewicki

License: CC BY 4.0

Abstract: Product embedding serves as a cornerstone for a wide range of applications in eCommerce. The product embedding learned from multiple modalities shows significant improvement over that from a single modality, since different modalities provide complementary information. However, some modalities are more informatively dominant than others. How to teach a model to learn embedding from different modalities without neglecting information from the less dominant modality is challenging. We present an image-text embedding model (ITEm), an unsupervised learning method that is designed to better attend to image and text modalities. We extend BERT by (1) learning an embedding from text and image without knowing the regions of interest; (2) training a global representation to predict masked words and to construct masked image patches without their individual representations. We evaluate the pre-trained ITEm on two tasks: the search for extremely similar products and the prediction of product categories, showing substantial gains compared to strong baseline models.

Submitted to arXiv on 22 Oct. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2311.02084v2

In the realm of eCommerce, product embedding plays a crucial role in various applications. The integration of multiple modalities in learning product embeddings has shown significant advancements compared to single-modality embeddings. offers complementary information from different modalities. However, some modalities may hold more informative power than others, posing a challenge in effectively incorporating all modalities without neglecting valuable information. During a research tenure at eBay, an image-text embedding model (ITEm) was developed as an unsupervised learning method to enhance the attention given to image and text modalities. The ITEm model extends BERT by learning embeddings from both text and images without prior knowledge of regions of interest. It trains a global representation to predict masked words and construct masked image patches without individual representations. Evaluation of the pre-trained ITEm model on tasks such as searching for extremely similar products and predicting product categories demonstrated substantial improvements over strong baseline models. The study introduces the , showcasing examples where both product images and titles contribute significantly to distinguishing products. While titles often contain dominant information for most products, neglecting the importance of images can lead to biased embeddings dominated by a single modality. To address this issue, the ITEm model aims to learn fine-grained embeddings from both images and titles in an unsupervised manner. Key contributions of the study include proposing a method to generate fine-grained representations from product images and titles for eCommerce applications, mitigating over-dominance issues caused by one modality containing dominant information. The evaluation involved collecting a large-scale image-text product dataset annotated for tasks like searching for extremely similar products and classification against state-of-the-art unimodal and multimodal models. Furthermore, the study explores learning image-title embeddings without prior knowledge of regions of interest, expanding the applicability of the proposed method beyond eCommerce datasets. By adopting a single-stream model approach with one encoder for both vision and text inputs, ITEm proves to be more efficient than traditional two-stream models that require sequential processing of different modalities' embeddings. Overall, this research contributes valuable insights into effectively leveraging multiple modalities in learning product embeddings for enhanced performance in eCommerce applications.
Created on 08 Jun. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.