We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:


Current browse context:


Change to browse by:

References & Citations

DBLP - CS Bibliography


(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo ScienceWISE logo

Computer Science > Sound

Title: Music2Video: Automatic Generation of Music Video with fusion of audio and text

Abstract: Creation of images using generative adversarial networks has been widely adapted into multi-modal regime with the advent of multi-modal representation models pre-trained on large corpus. Various modalities sharing a common representation space could be utilized to guide the generative models to create images from text or even from audio source. Departing from the previous methods that solely rely on either text or audio, we exploit the expressiveness of both modality. Based on the fusion of text and audio, we create video whose content is consistent with the distinct modalities that are provided. A simple approach to automatically segment the video into variable length intervals and maintain time consistency in generated video is part of our method. Our proposed framework for generating music video shows promising results in application level where users can interactively feed in music source and text source to create artistic music videos. Our code is available at this https URL
Subjects: Sound (cs.SD); Graphics (cs.GR); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Cite as: arXiv:2201.03809 [cs.SD]
  (or arXiv:2201.03809v2 [cs.SD] for this version)

Submission history

From: Yoonjeon Kim [view email]
[v1] Tue, 11 Jan 2022 06:59:21 GMT (9464kb,D)
[v2] Thu, 9 Jun 2022 06:43:39 GMT (9466kb,D)

Link back to: arXiv, form interface, contact.