We introduce a non-parametric approach for infinite video texture synthesis using a representation
learned via contrastive learning. We take inspiration from Video Textures, which showed that
plausible new videos could be generated from a single one by stitching its frames together in a novel
yet consistent order. This classic work, however, was constrained by its use of hand-designed distance
metrics, limiting its use to simple, repetitive videos. We draw on recent techniques from
self-supervised learning to learn this distance metric, allowing us to compare frames in a mannerthat
scales to more challenging dynamics, and to condition on other data, such as audio. We learn
representations for video frames and frame-to-frame transition probabilities by fitting a video-specific
model trained using contrastive learning. To synthesize a texture, we randomly sample frames with high
transition probabilities to generate diverse temporally smooth videos with novel sequences and
transitions. The model naturally extends to an audio-conditioned setting without requiring any
finetuning. Our model outperforms baselines on human perceptual scores, can handle a diverse range of
input videos, and can combine semantic and audio-visual cues in order to synthesize videos that
synchronize well with an audio signal.
|