Learning and Verification of Task Structure in Instructional Videos
AI-generated Key Points
- Abundance of instructional videos available online makes it possible to learn a diverse range of multi-step task models from these videos
- VideoTaskformer is a new pre-trained video model that focuses on representing the semantics and structure of instructional videos
- Model is pre-trained using a simple yet effective objective - predicting weakly supervised textual labels for steps that are randomly masked out from an instructional video (masked step modeling)
- VideoTaskformer involves learning step representations globally by leveraging the entire surrounding task as context
- Using learned representations, authors can verify if an unseen video correctly executes a given task and forecast which steps are likely to be taken after a given step
- Two new benchmarks introduced for detecting mistakes in instructional videos - verifying if there is an anomalous step and ensuring that steps are executed in the correct order
- Long-term forecasting benchmark introduced where goal is to predict long-range future steps from a given step
- Method outperforms previous baselines on these tasks, demonstrating its effectiveness in measuring quality of step representations
- VideoTaskformer evaluated on three existing benchmarks - procedural activity recognition, step classification, and step forecasting - and demonstrates approach outperforms existing baselines while achieving new state-of-the-art performance
- Unsupervised pre-training using neural networks with automatic speech recognition (ASR) outperforms previous unsupervised methods
- Competitive linear-probe performance reported and improved results when adding task labels
- Results from evaluating approach on activity recognition in EPIC Kitchens-100 included
- Model's performance on the step localization task in COIN reported
Authors: Medhini Narasimhan, Licheng Yu, Sean Bell, Ning Zhang, Trevor Darrell
Abstract: Given the enormous number of instructional videos available online, learning a diverse array of multi-step task models from videos is an appealing goal. We introduce a new pre-trained video model, VideoTaskformer, focused on representing the semantics and structure of instructional videos. We pre-train VideoTaskformer using a simple and effective objective: predicting weakly supervised textual labels for steps that are randomly masked out from an instructional video (masked step modeling). Compared to prior work which learns step representations locally, our approach involves learning them globally, leveraging video of the entire surrounding task as context. From these learned representations, we can verify if an unseen video correctly executes a given task, as well as forecast which steps are likely to be taken after a given step. We introduce two new benchmarks for detecting mistakes in instructional videos, to verify if there is an anomalous step and if steps are executed in the right order. We also introduce a long-term forecasting benchmark, where the goal is to predict long-range future steps from a given step. Our method outperforms previous baselines on these tasks, and we believe the tasks will be a valuable way for the community to measure the quality of step representations. Additionally, we evaluate VideoTaskformer on 3 existing benchmarks -- procedural activity recognition, step classification, and step forecasting -- and demonstrate on each that our method outperforms existing baselines and achieves new state-of-the-art performance.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Welcome to our AI assistant! Here are some important things to keep in mind:
- The assistant will only answer questions related to this specific paper.
- Please note that this is not a bot for casual chatting.
- If you want the answer in a language other than the language you chose for navigating the website, simply add "TRANSLATE IN LANGUAGE L" at the end of your query (replace "LANGUAGE L" with the language of your choice).
- For example, you could ask "Can you extract the most important aspect of the paper? TRANSLATE IN SPANISH".
- If you want to keep the history of your questions/answers you should create an account.
Assess the quality of the AI-generated content by voting
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
Similar papers summarized with our AI tools
Navigate through even more similar papers through atree representation
Look for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.