Improving Contextual Congruence Across Modalities for Effective Multimodal Marketing using Knowledge-infused Learning

AI-generated keywords: Multimodal Marketing Campaigns Crowdfunding Platforms Visual Language Models Knowledge Graph Integration Cross-Modal Semantic Relationships

AI-generated Key Points

  • Study focuses on predicting success of multimodal marketing campaigns on crowdfunding platforms
  • Integration of common sense knowledge into Visual Language Models (VLMs)
  • Dataset includes pairs of images and text with binary labels for campaign success
  • Goal is to determine likelihood of campaign reaching funding goal within specific timeline
  • Framework employs modular and flexible text and image encoders
  • Pretrained BERT, RoBERTa for text and ViT, ResNet for image encoders fine-tuned using bidirectional transformers
  • Knowledge retrieval involves generating text captions for images using multimodal LVMs, with BLIP outperforming other models
  • Clustering analysis shows impact of external knowledge on semantic relationships between modalities
  • t-SNE visualizations demonstrate denser clusters with closer centroids when external knowledge included, reducing semantic distance between modalities
  • Semantic similarity between text and image modalities increases by approximately 9.9% with external knowledge inclusion
  • Research aims to improve prediction accuracy and advance marketing theory through early detection of persuasive multi-modal campaigns on crowdfunding platforms
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Trilok Padhi, Ugur Kursuncu, Yaman Kumar, Valerie L. Shalin, Lane Peterson Fronczek

License: CC BY-NC-SA 4.0

Abstract: The prevalence of smart devices with the ability to capture moments in multiple modalities has enabled users to experience multimodal information online. However, large Language (LLMs) and Vision models (LVMs) are still limited in capturing holistic meaning with cross-modal semantic relationships. Without explicit, common sense knowledge (e.g., as a knowledge graph), Visual Language Models (VLMs) only learn implicit representations by capturing high-level patterns in vast corpora, missing essential contextual cross-modal cues. In this work, we design a framework to couple explicit commonsense knowledge in the form of knowledge graphs with large VLMs to improve the performance of a downstream task, predicting the effectiveness of multi-modal marketing campaigns. While the marketing application provides a compelling metric for assessing our methods, our approach enables the early detection of likely persuasive multi-modal campaigns and the assessment and augmentation of marketing theory.

Submitted to arXiv on 06 Feb. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2402.03607v1

This study addresses the challenge of predicting the success of multimodal marketing campaigns on crowdfunding platforms by integrating explicit common sense knowledge into large Visual Language Models (VLMs). The dataset consists of pairs of images and text with binary labels indicating campaign success. The goal is to determine the likelihood of a campaign reaching its funding goal within a specified timeline. To enhance VLM performance, a modular and flexible framework for text and image encoders is employed. Pretrained text (e.g., BERT, RoBERTa) and image encoders (e.g., ViT, ResNet) are jointly fine-tuned using bidirectional transformers. Vision encoders such as ResNet-152 and Vision Transformers are experimented with, producing output vectors for each image. Knowledge retrieval involves generating text captions for images using multimodal LVMs, with BLIP performing better than other models. Clustering analysis is conducted over text and image captions to demonstrate how external knowledge impacts semantic relationships between modalities. t-SNE visualizations show that including external knowledge results in denser clusters with closer centroids, indicating reduced semantic distance between modalities. Semantic similarity between text and image modalities increases by approximately 9.9% when external knowledge is included. This research not only aims to improve prediction accuracy but also contributes to advancing marketing theory through early detection of persuasive multi-modal campaigns and assessment of marketing strategies on crowdfunding platforms.
Created on 18 Feb. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.