Presto! Distilling steps and layers for accelerating music generation

Presto! Distilling Steps and Layers for Accelerating Music Generation

Zachary Novack#♭*   Ge Zhu   Jonah Casebeer   Julian McAuley#   Taylor Berg-Kirkpatrick#   Nicholas J. Bryan  

#University of California, San Diego
Adobe Research
*Work done during an internship at Adobe Research

Paper Video 🤗 HF paper

Abstract


Despite advances in diffusion-based text-to-music (TTM) methods, efficient, high-quality generation remains a challenge. We introduce Presto!, an approach to inference acceleration for score-based diffusion transformers via reducing both sampling steps and cost per step. To reduce steps, we develop a new score-based distribution matching distillation (DMD) method for the EDM-family of diffusion models, the first GAN-based distillation method for TTM. To reduce the cost per step, we develop a simple, but powerful improvement to a recent layer distillation method that improves learning via preserving hidden state variance. Finally, we combine our improved step and layer distillation methods together for a dual-faceted approach. We evaluate our step and layer distillation methods independently and show each yield best-in-class performance. Furthermore, we find our combined distillation method can generate high-quality outputs with improved diversity accelerating our base model by 10-18x (32 second output in 230ms, 15x faster than the comparable SOTA model) -- the fastest high-quality TTM model to our knowledge.


Bibtex

@article{Novack2025Presto,
    title={Presto! Distilling steps and layers for accelerating music generation.}, 
    author={Zachary Novack and Ge Zhu and Jonah Casebeer and
            Julian McAuley and Taylor Berg-Kirkpatrick and Nicholas J. Bryan},
    year={2024},
    eprint={TBD},
    archivePrefix={arXiv},
    primaryClass={cs.SD}
}

Cherry-picked Examples

Custom prompts and examples. Each example was generated with 435ms of latency (batch size = 1), comprised of our accelerated model+VAE decoding (230ms) and our mono2stereo module from MusicHiFi (205ms).

Text
Presto (ours)

crazy, wild, dance party


castle, fantasy, medieval --bpm 100

childrens music with harp --bpm 110

classical french horn --bpm 80

driving electronic, trance, house --bpm 140

electro-jazz, west african percussion --bpm 130

hawaiian ukulele, upbeat --bpm 125

hindustani, tabla, bluegrass, fusion --bpm 110

horror film music

jazz, saxophone, big-band, high energy

latin samba song --bpm 90

moody drum-n-bass, rave --bpm 184

music for a car race in the desert --bpm 140

old-school hip-hop with a groovy beat

powerful rock music, distorted guitar

southern hip-hop, hard-core --bpm 80

thrash djent --bpm 150

uptempo swing, big-band, bebop --bpm 200

Song Describer Prompt Examples (Random)

Random Song Describer prompts. One prompt outputs.

Text
Presto (ours)
Base DiT
Stable Audio Open

1055022-This song that features a piano, cello and oboe starts calmly and then arrives to a crescendo to give a triumphant atmosphere

108303-Delayed synth blips and electronic drums combine on an ambient electronica track

103887-Two electric guitars in conversation with each other, one with a wua-wua effect and the other with a strong delay effect

1051196-country-pop relaxing song with happy mood and acoustic guitar, ideal for young girls going out shopping or dreaming about their love

103892-bluesy guitar with a slow repetitive rythm in a smoky room in latin america

1061473-oh, this is kind of post punk music, remind me my high school memory

1009671-Upbeat fast tempo with a blues rock feel that one can dance

1063454-A deeply soothing track featuring two string instruments makes one peaceful with theirselves and the world

1036934-Indie alike track to time-travel to mid 2000s

1062831-Only instrumental and based on electronic samples that picks up as the song progresses

1066198-Feels like there is an argument happening between two people with the constant beat making it like time is progressing forward which can be used for either casual listening, reminiscing on a memory or

1050845-Synthetic orchestral piece in the style of a 90s war movie battle scene soundtrack, with prominent bass drums and dry brass section harmonies

1061369-Probably a three piece band with melodic guitar, heavy bass and drums playing an instrumental and energetic piece

1063332-Groovy instrumental funk rock track with occasional guitar solos that give you a feeling of longing

10575-A track with elements from eurodance and indie guitars

1063331-An instrumental hopefully positive track where the foreground melody features a classic guitar

1009672-A joyful and lively song that will make you want to dance right away

1054396-fast and fun beat-based indie pop to set a protagonist-gets-good-at-x movie montage to

1060600-Melodic jazz piece starting off with a solo sax intro and continues with smooth electric guitar and synthesizer riffs

1007274-acoustic guitar solo track with consistent rhythm and repeating progression, suitable for a relaxing afternoon tea

1004034-Electronic music that has a constant melody throughout with accompanying instruments used to supplement the melody which can be heard in possibly a casual setting

1053845-An electronic ambient track, with a dark tone The song progresses building the sections on top of each other A hollow percussion drum beat builds on top of a dark and soft harp, towards the end a viol

1051201-Im in a charismatic church and feeling sea sick from all the swaying to and fro

103333-This is an instrumental track based on electronic samples that can be used before the start of a movie to give a doomsday or post apocalypse feel

1066197-the song is moving piano is playing over long notes

1054342-A slow tempo guitar picking song of a folk genre which can be played on a road trip it evokes a relaxed sensation

1057890-A cinematic electronic soundtrack which announces the epic journey the protagonist will undertake in a post-apocalyptic world

1061384-An instrumental song that evokes some kind of a rush to go and declare your love

1033252-A 70s style british pop song with drums, guitars, a synth violin sounds and finally bright piano chords

1051193-A song that's intensely personal as its superficial lyrics that take you to nowhere

Qualitative Analysis

Continuous vs. Discrete Conditioning

The benefit of continuous conditioning can be illustrated when performing two-step sampling with discrete vs. continuous inputs using hip-hop adjacent prompts. The discrete conditioned samples struggle to construct high-frequency outputs and render percussive transients (e.g. drum hits) poorly, while the continuous conditioned samples are able to generate high-frequency outputs with clear transients.
Text
Continuous (Presto)
Discrete

Hip-hop music

Inference Noise Schedule Control

Illusration of changing the inference noise schedule parameter rho, thus changing where in the diffusion process each step occurs for 4-step sampling. Two sets of five files are shown with matching prompts and initial noise latents.
Text
rho=1000
rho=7

A squirrel dancing in the backyard, uplifting

Active winter on the montains

Epic videogame boss battle OST

Sea shanty for a drunken sailor

Song for my departed goldfish