Tracking Anything with Decoupled Video Segmentation

AI-generated keywords: Video Segmentation DEVA Bi-directional Propagation End-to-end Approaches Annotation Challenges

AI-generated Key Points

  • Challenges of annotating training data for video segmentation hinder the extension of end-to-end algorithms to new tasks
  • Proposed solution: DEVA, a decoupled video segmentation approach
  • DEVA consists of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation
  • Bi-directional propagation allows for (semi-)online fusion of segmentation hypotheses from different frames, resulting in coherent segmentation
  • DEVA outperforms end-to-end approaches in data scarce tasks such as large vocabulary video panoptic segmentation, open world video segmentation, referring video segmentation, and unsupervised video object segmentation
  • Comprehensive overview of related works in end-to-end video segmentation provided
  • Bi-directional temporal propagation model denoises image segmentations and merges them with temporally propagated segmentations seamlessly
  • Effectiveness of the model demonstrated through empirical evaluations on datasets including YouTube VOS and Cityscape VPS
  • DEVA framework offers a promising solution to challenges associated with training data annotation for video segmentation by leveraging external data and incorporating existing universal image segmentation models
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Ho Kei Cheng, Seoung Wug Oh, Brian Price, Alexander Schwing, Joon-Young Lee

Accepted to ICCV 2023. Project page: https://hkchengrex.github.io/Tracking-Anything-with-DEVA
License: CC BY-NC-SA 4.0

Abstract: Training data for video segmentation are expensive to annotate. This impedes extensions of end-to-end algorithms to new video segmentation tasks, especially in large-vocabulary settings. To 'track anything' without training on video data for every individual task, we develop a decoupled video segmentation approach (DEVA), composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation. Due to this design, we only need an image-level model for the target task (which is cheaper to train) and a universal temporal propagation model which is trained once and generalizes across tasks. To effectively combine these two modules, we use bi-directional propagation for (semi-)online fusion of segmentation hypotheses from different frames to generate a coherent segmentation. We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks including large-vocabulary video panoptic segmentation, open-world video segmentation, referring video segmentation, and unsupervised video object segmentation. Code is available at: https://hkchengrex.github.io/Tracking-Anything-with-DEVA

Submitted to arXiv on 07 Sep. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2309.03903v1

The existing summary discusses the challenges of annotating training data for video segmentation and how it hinders the extension of end-to-end algorithms to new video segmentation tasks, particularly in large-vocabulary settings. To address this issue, the authors propose a decoupled video segmentation approach called DEVA. DEVA consists of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation. This design allows for the use of an image-level model for the target task which is cheaper to train and a universal temporal propagation model that can be trained once and applied across tasks. To effectively combine these two modules, the authors utilize bi-directional propagation for (semi-)online fusion of segmentation hypotheses from different frames resulting in a coherent segmentation. The authors demonstrate that this decoupled formulation outperforms end-to-end approaches in various data scarce tasks such as large vocabulary video panoptic segmentation, open world video segmentation, referring video segmentation and unsupervised video object segmentation. In addition to introducing DEVA, the authors provide a comprehensive overview of related works in end to end video segmentation. They highlight recent advancements in this field and discuss how their proposed approach differs from existing methods. The authors also present their bi directional temporal propagation model in detail explaining how it denoises image segmentations and seamlessly merges them with temporally propagated segmentations. The effectiveness of this model is demonstrated through empirical evaluations on different datasets including YouTube VOS [69] and Cityscape VPS [27]. Overall, the proposed DEVA framework offers a promising solution to the challenges associated with training data annotation for video segmentation by leveraging external data and incorporating existing universal image segmentation models while achieving favorable results in various important video segmentation tasks.
Created on 08 Dec. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.