Generating audio with RAVE-Latent Diffusion: RLDG_0da02c80cb & RLDG_835770db1c

Most of my past work relating to audio generation and composition with neural nets has been focused on the real time capabilities that nn~ compatible models provide.

One of the things I have recently picked up again is generating audio in offline inference, in particular with the RAVE-Latent Diffusion package and model that Moisés Horta Valenzuela aka hexorcismos provided a while back.

RAVE-Latent Diffusion is a denoising diffusion model designed to generate new RAVE latent codes with a large context window, faster than realtime, while maintaining music structural coherency.

Key requirement is an existing RAVE neural net and an audio dataset (either the same as used in training RAVE, potentially also including other material). The training itself is performed on latent representations of the dataset in a pre defined length, pre processed through the encoder of the RAVE neural net.

The finished RAVE-Latent Diffusion model can then be used to generate new latent embeddings of a defined length and a configuration of seed value, temperature and number of diffusion steps before they are processed through the RAVE decoder again into audio information.

Technically, the audio output generated comes with the structural coherency of the information trained on, e.g. build up, density, spectral distribution over time etc.

When using my own music and release material from the past as training data both with respect to sound aesthetics and structural information, I ended up with this:

…and this:

Each compilation has been created using a dedicated RAVE neural net – one trained on an unedited version of my whole discography (excluding remixes and collaboration work with other artists), the other one on an augmented version of the same. The output has been generated with different settings of seeds, temperature and diffusion steps.

Both are available through Nina and Bandcamp

Apart from their obvious chaotic character, likely coming from the heterogenous original data, I seem to able to make out at least a certain amount of structural similarity on material from RLDG_0da02c80cb where the first 1-1:30 minutes tend to be a bit less packed than the second half of the audio, probably coming from intro sections in the dataset tracks.

Since coherency in structure seems to be significantly noticeable in material with higher repetitivity (e.g. Techno) according to Moisés/ hexorcismos, I intent to dive into some more experiments and empirical research in the near future.

Read Next <img src="https://www.martsman.de/wp-content/themes/miyazaki/assets/images/icons/arrow-down.svg" data-eio="l" />

More Algo Jungle output “Edouard & Leonardo” released via Bandcamp. Exclusive versions on nina and the streaming services

Risset rhythms: Pure Data implementation of eternal accelerando

Neural audio: Recent experimentations with RAVE and MSPrior

“Fibonacci Jungle” receives award at Generative Music Prize 2024, IRCAM

Read Next