A Closer Look at Weakly-Supervised Audio-Visual Source Localization

AI-generated keywords: Weakly-supervised audio-visual source localization Ground-truth annotations Co-occurrence of audio and visual signals Negative samples Evaluation protocols

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Authors Shentong Mo and Pedro Morgado focus on predicting the location of visual sound sources in videos.
  • Traditional ground-truth annotation methods for sounding objects are costly, leading to the development of weakly-supervised localization methods.
  • Existing evaluation protocols have flaws, such as early stopping with fully annotated datasets and assuming sound sources are always present.
  • The authors propose an extension to benchmarks like Flickr SoundNet and VGG-Sound Sources by including negative samples in the test set.
  • New metrics are introduced to balance localization accuracy and recall for a more comprehensive evaluation of prior methods.
  • Many existing approaches struggle to identify negatives and suffer from overfitting due to heavy reliance on early stopping.
  • Mo and Morgado present a novel approach using extreme visual dropout and momentum encoders, achieving state-of-the-art performance on benchmarks.
  • The authors provide their code and pre-trained models for further research on GitHub (https://github.com/stoneMo/SLAVC).
  • This study emphasizes the importance of refining evaluation protocols in weakly-supervised audio-visual source localization research.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Shentong Mo, Pedro Morgado

Abstract: Audio-visual source localization is a challenging task that aims to predict the location of visual sound sources in a video. Since collecting ground-truth annotations of sounding objects can be costly, a plethora of weakly-supervised localization methods that can learn from datasets with no bounding-box annotations have been proposed in recent years, by leveraging the natural co-occurrence of audio and visual signals. Despite significant interest, popular evaluation protocols have two major flaws. First, they allow for the use of a fully annotated dataset to perform early stopping, thus significantly increasing the annotation effort required for training. Second, current evaluation metrics assume the presence of sound sources at all times. This is of course an unrealistic assumption, and thus better metrics are necessary to capture the model's performance on (negative) samples with no visible sound sources. To accomplish this, we extend the test set of popular benchmarks, Flickr SoundNet and VGG-Sound Sources, in order to include negative samples, and measure performance using metrics that balance localization accuracy and recall. Using the new protocol, we conducted an extensive evaluation of prior methods, and found that most prior works are not capable of identifying negatives and suffer from significant overfitting problems (rely heavily on early stopping for best results). We also propose a new approach for visual sound source localization that addresses both these problems. In particular, we found that, through extreme visual dropout and the use of momentum encoders, the proposed approach combats overfitting effectively, and establishes a new state-of-the-art performance on both Flickr SoundNet and VGG-Sound Source. Code and pre-trained models are available at https://github.com/stoneMo/SLAVC.

Submitted to arXiv on 30 Aug. 2022

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2209.09634v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In their paper titled "A Closer Look at Weakly-Supervised Audio-Visual Source Localization," authors Shentong Mo and Pedro Morgado delve into the challenging task of predicting the location of visual sound sources in videos. The traditional approach of collecting ground-truth annotations for sounding objects is costly, leading to the development of weakly-supervised localization methods that leverage the co-occurrence of audio and visual signals in datasets without bounding-box annotations. However, existing evaluation protocols have significant flaws, allowing early stopping with fully annotated datasets and assuming sound sources are present at all times. To address these issues, the authors propose an extension to popular benchmarks like Flickr SoundNet and VGG-Sound Sources by including negative samples in the test set. They introduce new metrics that balance localization accuracy and recall, enabling a more comprehensive evaluation of prior methods. Through this extended protocol, they discover that many existing approaches struggle to identify negatives and suffer from overfitting due to heavy reliance on early stopping. In response to these findings, Mo and Morgado present a novel approach for visual sound source localization that tackles both problems effectively. By incorporating extreme visual dropout and momentum encoders, their method combats overfitting and achieves state-of-the-art performance on Flickr SoundNet and VGG-Sound Source benchmarks. The authors make their code and pre-trained models available for further research on GitHub (https://github.com/stoneMo/SLAVC). This study sheds light on the importance of refining evaluation protocols in weakly-supervised audio-visual source localization research and offers a promising solution to enhance model performance in this domain.
Created on 28 Nov. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.