Embedded Realtime Neural Audio Synthesis using a Jetson Nano

Embedded Realtime Neural Audio Synthesis using a Jetson Nano

In this project, we target the use of our Realtime Audio Variational autoEncoder (RAVE) model in realtime on a Jetson Nano. RAVE is a Deep Learning model that can be seen as a smart (or at least dataset specific) compression pipeline, allowing the encoding of signals into a compact high-level representation (also called latent representation). This latent representation is then decoded into sound, which allows user to peform timbre transfer, latent exploration or high level manipulation. You can try an interactive demo here (I recommend trying the darbouka model while beatboxing).

One of the cool aspects of using this model is its streaming abilities, which means the model can be used on audio streams in realtime instead of processing audio files in an offline fashion. Running RAVE models in realtime can be achieved through the use of our nn~ external (plugin) for Max/MSP and PureData, built specifically to interface streaming deep learning models inside Max/MSP and PD (ports to super collider and Tydal Cycles are also maintained by the community).

Thanks to the pre-built torch biniaries available here, setting up torch on a fresh Jetson Nano install is pretty straight-forward. Installing PureData and nn~ for Jetson Nano can then be done following this script (at least partially, given that it is initially meant to be used with a Raspberry Pi).

And now the fun part !

A small demo of the kind of interactions made possible by RAVE using nn~

RAVE x nn~

Everything here is made with Max/MSP, but the PureData version of the external is close to identical in term of features (+ it is compatible with GPU accelerated processing !). What we did for this project is interact with the Jetson Nano during processing through the use of a 6 DOF motion sensor, with acceleration and rotation being directly fed to the decoder of the model following this patch:

The motion sensor is basically sending osc data through port 8888, which is unpacked and filtered to ACC(1,2,3) and GYRO(1,2,3) available through outlets 1 to 6 of the [unpack f f f f f f] object. We then compute the global acceleration through a L2 norm [expr sqrt(...)] which is then biased [- 2.6] and normalized [expr 4/...] using a scaled sigmoid function. The other dimensions from the sensor (GYRO 1 2 3) are simply multiplied by 2. DISCLAIMER this choice of normalization / bias for the sensor data is purely based on trial/error, and does not reflect some kind of ground truth.

The model we use is called engine, and has been trained on industrial drum loops. Unfortunately, we cannot share it. You can however see the patch in action here

RAVE x nn~

It was a lot of fun building this project, and the use of the Nano opens lots of posibilities for embedded instruments and hardware synths.

If you have any questions, feel free to ask them here or in the RAVE forum.

Happy hacking ❤