Abstract

We introduce a non-parametric approach for infinite video texture synthesis using a representation learned via contrastive learning. We take inspiration from Video Textures, which showed that plausible new videos could be generated from a single one by stitching its frames together in a novel yet consistent order. This classic work, however, was constrained by its use of hand-designed distance metrics, limiting its use to simple, repetitive videos. We draw on recent techniques from self-supervised learning to learn this distance metric, allowing us to compare frames in a mannerthat scales to more challenging dynamics, and to condition on other data, such as audio. We learn representations for video frames and frame-to-frame transition probabilities by fitting a video-specific model trained using contrastive learning. To synthesize a texture, we randomly sample frames with high transition probabilities to generate diverse temporally smooth videos with novel sequences and transitions. The model naturally extends to an audio-conditioned setting without requiring any finetuning. Our model outperforms baselines on human perceptual scores, can handle a diverse range of input videos, and can combine semantic and audio-visual cues in order to synthesize videos that synchronize well with an audio signal.

Paper

M. Narasimhan, S. Ginosar, A. Owens,
A. A. Efros, T. Darrell.
Strumming to the Beat:
Audio-Conditioned Contrastive
Video Textures.

[Paper] | [Bibtex]

Acknowledgements

We thank Arun Mallya, Allan Jabri, Anna Rohrbach, Amir Bar, Suzie Petryk, and Parsa Mahmoudieh for very helpful discussions and feedback. This work was supported in part by DoD including DARPA's XAI, LwLL, and SemaFor programs, as well as BAIR's industrial alliance programs.

Template cloned from here!