Enhancing Sound Texture in CNN-Based Acoustic Scene Classification

AI-generated keywords: Acoustic Scene Classification CNN CAM DoG Sobel

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Acoustic scene classification involves identifying the scene from which an audio signal is recorded
  • Convolutional neural network (CNN) models have been successful for this task
  • Limited understanding of how audio scenes are perceived in CNN compared to image recognition research
  • Class Activation Mapping (CAM) used to analyze how log-Mel features of different acoustic scenes are learned in a CNN classifier
  • High-energy time-frequency components do not necessarily correspond to strong activation on CAM, instead background sound texture is well learned
  • Difference of Gaussian (DoG) and Sobel operator applied to process log-Mel features, enhancing edge information in time-frequency image representation
  • Evaluation on DCASE 2017 ASC challenge dataset shows that using edge-enhanced log-Mel images as input features significantly improves classification performance
  • Study provides insights into how acoustic scenes are perceived in CNNs and proposes a method to enhance sound texture representation for improved accuracy
  • Findings contribute to advancing acoustic scene classification techniques with potential applications in surveillance systems or environmental monitoring.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Yuzhong Wu, Tan Lee

Submitted to ICASSP 2019

Abstract: Acoustic scene classification is the task of identifying the scene from which the audio signal is recorded. Convolutional neural network (CNN) models are widely adopted with proven successes in acoustic scene classification. However, there is little insight on how an audio scene is perceived in CNN, as what have been demonstrated in image recognition research. In the present study, the Class Activation Mapping (CAM) is utilized to analyze how the log-magnitude Mel-scale filter-bank (log-Mel) features of different acoustic scenes are learned in a CNN classifier. It is noted that distinct high-energy time-frequency components of audio signals generally do not correspond to strong activation on CAM, while the background sound texture are well learned in CNN. In order to make the sound texture more salient, we propose to apply the Difference of Gaussian (DoG) and Sobel operator to process the log-Mel features and enhance edge information of the time-frequency image. Experimental results on the DCASE 2017 ASC challenge show that using edge enhanced log-Mel images as input feature of CNN significantly improves the performance of audio scene classification.

Submitted to arXiv on 06 Jan. 2019

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 1901.01502v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

The paper titled "Enhancing Sound Texture in CNN-Based Acoustic Scene Classification" by Yuzhong Wu and Tan Lee explores the task of acoustic scene classification, which involves identifying the scene from which an audio signal is recorded. Convolutional neural network (CNN) models have been widely adopted for this task and have shown success. However, there is limited understanding of how an audio scene is perceived in a CNN compared to image recognition research. To address this gap, the authors utilize Class Activation Mapping (CAM) to analyze how log-magnitude Mel-scale filter-bank (log-Mel) features of different acoustic scenes are learned in a CNN classifier. They find that distinct high-energy time-frequency components of audio signals do not necessarily correspond to strong activation on CAM. Instead, the background sound texture is well learned by the CNN. In order to make the sound texture more salient and improve classification performance, the authors propose applying the Difference of Gaussian (DoG) and Sobel operator to process the log-Mel features. This enhances edge information in the time-frequency image representation. The authors evaluate their approach on the DCASE 2017 ASC challenge dataset and demonstrate that using edge-enhanced log-Mel images as input features for CNN significantly improves audio scene classification performance. Overall, this study provides valuable insights into how acoustic scenes are perceived in CNNs and proposes an effective method to enhance sound texture representation for improved classification accuracy. The findings contribute significantly to advancing acoustic scene classification techniques and can potentially benefit various applications such as surveillance systems or environmental monitoring.
Created on 05 Sep. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.