TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency

ECCV 2022

Medhini Narasimhan    Arsha Nagrani    Chen Sun    Michael Rubinstein   
Trevor Darrell    Anna Rohrbach    Cordelia Schmid   
UC Berkeley    Google Research    Brown University   
[Paper]
[Bibtex]
[GitHub]

Summarizing Instructional VideosToo Long; Didn't Watch? (TL;DW?) We introduce an approach for creating short visual summaries comprising important steps that are most relevant to the task, as well as salient in the video, i.e.\ referenced in the speech. For example, given a long video on ``How to make a veggie burger'' as shown above, the summary comprises key steps such as fry ingredients, blend beans, and fry patty}

Abstract

YouTube users looking for instructions for a specific task may spend a long time browsing content trying to find the right video that matches their needs. Creating a visual summary (abridged version of a video) provides viewers with a quick overview and massively reduces search time. In this work, we focus on summarizing instructional videos, an under-explored area of video summarization. In comparison to generic videos, instructional videos can be parsed into semantically meaningful segments that correspond to important steps of the demonstrated task. Existing video summarization datasets rely on manual frame-level annotations, making them subjective and limited in size. To overcome this, we first automatically generate pseudo summaries for a corpus of instructional videos by exploiting two key assumptions: (i) relevant steps are likely to appear in multiple videos of the same task (Task Relevance), and (ii) they are more likely to be described by the demonstrator verbally (Cross-Modal Saliency). We propose an instructional video summarization network that combines a context-aware temporal video encoder and a segment scoring transformer. Using pseudo summaries as weak supervision, our network constructs a visual summary for an instructional video given only video and transcribed speech. To evaluate our model, we collect a high-quality test set, WikiHow Summaries, by scraping WikiHow articles that contain video demonstrations and visual depictions of steps allowing us to obtain the ground-truth summaries. We outperform several baselines and a state-of-the-art video summarization model on this new benchmark.


Datasets

Pseudo Summaries.
WikiHow Summaries.

Method

Overview of Instructional Video Summarization. We first obtain pseudo summaries for a large collection of videos using our weakly supervised algorithm. Next, using the pseudo summaries as weak-supervision, we train our Instructional Video Summarizer (IV-Sum). It takes an input video along with the corresponding ASR transcript and learns to assign importance scores to each segment in the video. The final summary is a compilation of the high scoring video segments.


Qualitative Results



Paper

M. Narasimhan, A. Nagrani, C. Sun, M. Rubinstein, T. Darrell, A. Rohrbach, C. Schmid.
TL;DW? Summarizing Instructional Videos with
Task Relevance & Cross-Modal Saliency

ECCV, 2022.

[Paper] | [Bibtex]



Acknowledgements

We thank Daniel Fried and Bryan Seybold for valuable discussions and feedback on the draft. This work was supported in part by DoD including DARPA's LwLL, PTG and/or SemaFor programs, as well as BAIR's industrial alliance programs.

Template cloned from here!