Most of my past work relating to audio generation and composition with neural nets has been focused on the real time capabilities that nn~ compatible models provide.
One of the things I have recently picked up again is generating audio in offline inference, in particular with the RAVE-Latent Diffusion package and model that Moisés Horta Valenzuela aka hexorcismos provided a while back.
Key requirement is an existing RAVE neural net and an audio dataset (either the same as used in training RAVE, potentially also including other material). The training itself is performed on latent representations of the dataset in a pre defined length, pre processed through the encoder of the RAVE neural net.
The finished RAVE-Latent Diffusion model can then be used to generate new latent embeddings of a defined length and a configuration of seed value, temperature and number of diffusion steps before they are processed through the RAVE decoder again into audio information.
Technically, the audio output generated comes with the structural coherency of the information trained on, e.g. build up, density, spectral distribution over time etc.
When using my own music and release material from the past as training data both with respect to sound aesthetics and structural information, I ended up with this:
…and this:
Each compilation has been created using a dedicated RAVE neural net – one trained on an unedited version of my whole discography (excluding remixes and collaboration work with other artists), the other one on an augmented version of the same. The output has been generated with different settings of seeds, temperature and diffusion steps.
Both are available through Nina and Bandcamp
Apart from their obvious chaotic character, likely coming from the heterogenous original data, I seem to able to make out at least a certain amount of structural similarity on material from RLDG_0da02c80cb where the first 1-1:30 minutes tend to be a bit less packed than the second half of the audio, probably coming from intro sections in the dataset tracks.
Since coherency in structure seems to be significantly noticeable in material with higher repetitivity (e.g. Techno) according to Moisés/ hexorcismos, I intent to dive into some more experiments and empirical research in the near future.