Depthwise Spatio-Temporal STFT Convolutional Neural Networks for Human Action Recognition

AI-generated keywords: STFT blocks 3D CNNs Action Recognition Feature Learning Performance

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • The authors propose a new approach called spatio-temporal short term Fourier transform (STFT) blocks for 3D convolutional neural networks (CNNs)
  • STFT blocks overcome challenges of conventional CNNs such as high computational cost, memory requirements, overfitting, and limited feature learning capabilities
  • STFT blocks consist of non-trainable convolution layers capturing spatially and/or temporally local Fourier information using an STFT kernel at multiple low-frequency points
  • Trainable linear weights are used in STFT blocks for learning channel correlations
  • Incorporating STFT blocks into 3D CNNs reduces space-time complexity while enhancing feature learning capabilities
  • Compared to state-of-the-art methods, STFT blocks require fewer parameters and have lower computational costs (3.5 to 4.5 times fewer parameters, 1.5 to 1.8 times lower computational costs)
  • Extensive experiments on seven action recognition datasets demonstrate that STFT block based 3D CNNs achieve comparable or better performance compared to state-of-the-art methods
  • The proposed depthwise spatio-temporal STFT CNNs offer a promising solution for improving action recognition tasks by addressing limitations of conventional approaches while reducing computational complexity and leading to competitive performance on benchmark datasets.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Sudhakar Kumawat, Manisha Verma, Yuta Nakashima, Shanmuganathan Raman

Extended version of our CVPR 2019 work

Abstract: Conventional 3D convolutional neural networks (CNNs) are computationally expensive, memory intensive, prone to overfitting, and most importantly, there is a need to improve their feature learning capabilities. To address these issues, we propose spatio-temporal short term Fourier transform (STFT) blocks, a new class of convolutional blocks that can serve as an alternative to the 3D convolutional layer and its variants in 3D CNNs. An STFT block consists of non-trainable convolution layers that capture spatially and/or temporally local Fourier information using a STFT kernel at multiple low frequency points, followed by a set of trainable linear weights for learning channel correlations. The STFT blocks significantly reduce the space-time complexity in 3D CNNs. In general, they use 3.5 to 4.5 times less parameters and 1.5 to 1.8 times less computational costs when compared to the state-of-the-art methods. Furthermore, their feature learning capabilities are significantly better than the conventional 3D convolutional layer and its variants. Our extensive evaluation on seven action recognition datasets, including Something-something v1 and v2, Jester, Diving-48, Kinetics-400, UCF 101, and HMDB 51, demonstrate that STFT blocks based 3D CNNs achieve on par or even better performance compared to the state-of-the-art methods.

Submitted to arXiv on 22 Jul. 2020

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2007.11365v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

The authors propose a new approach called spatio-temporal short term Fourier transform (STFT) blocks to overcome the challenges of conventional 3D convolutional neural networks (CNNs), such as high computational cost, memory requirements, overfitting and limited feature learning capabilities. An STFT block is composed of non-trainable convolution layers that capture spatially and/or temporally local Fourier information using an STFT kernel at multiple low-frequency points followed by a set of trainable linear weights for learning channel correlations. By incorporating STFT blocks into 3D CNNs, the authors significantly reduce the space-time complexity of the network while enhancing feature learning capabilities. Compared to state-of-the-art methods, the STFT blocks require 3.5 to 4.5 times fewer parameters and have 1.5 to 1.8 times lower computational costs. To evaluate their approach, extensive experiments were conducted on seven action recognition datasets including Something-something v1 and v2, Jester, Diving-48, Kinetics-400, UCF 101 and HMDB 51 which demonstrate that STFT block based 3D CNNs achieve comparable or even better performance compared to state-of-the art methods. In conclusion, the proposed depthwise spatio-temporal STFT convolutional neural networks offer a promising solution for improving action recognition tasks by addressing the limitations of conventional approaches while reducing computational complexity and leading to competitive performance on various benchmark datasets.
Created on 09 Sep. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.