How to feed raw audio into the model by nvinferaudio?

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU) dGPU
• DeepStream Version 6.1.1
• JetPack Version (valid for Jetson only) None
• TensorRT Version 8.4
• NVIDIA GPU Driver Version (valid for GPU only) 515.65.01
• Issue Type( questions, new requirements, bugs) quastion
• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing) None
• Requirement details( This is for new requirement. Including the module name-for which plugin or for which sample application, the function description) None

I want to use nvinferaudio with my model, but I found that audio-transform lacks some featues (setting fmin/fmax of mel filter-bank, pre-emphasis).
Therefore, I am trying to feed raw audio info the model and let the model do the rest (stft, conv, classify)

I assumed nvinferaudio support raw audio input because doc says it supports “Encoder Decoder RNN Architecture” and RNN usually use raw audio rather than mel spectogram, as far as I know.

My question is that

  • How to write config to feed raw audio data into the model?
  • What is the input shape in that case (e.g., BATCH x LEN x 1) ?
  • What is the input value range? (e.g., <16bit-raw-value> / net-scale-factor) ?

Another question is that

  • How can I use parse-classifier-func-name? Is the function I/F the same as nvinfer?

There is a complete audio configuration sample in /opt/nvidia/deepstream/deepstream/sources/apps/sample_apps/deepstream-audio

For the model used in /opt/nvidia/deepstream/deepstream/sources/apps/sample_apps/deepstream-audio, the shape is batch x channel x frame_samples x sample_size

The input value range of the model in sample /opt/nvidia/deepstream/deepstream/sources/apps/sample_apps/deepstream-audio is -1.0 ~ 1.0 32bit float point.

Thank you for your reply.

I am asking about the case of raw audio signal input not mel spectogram.
raw audio means the left side in figure while mel-spectogram means the middle side in the figure.

deepstream-audio only show the config for mel-spectogram input.
I guess batch x channel x frame_samples x sample_size is mel-spectogram shape.


@Fiona.Chen Hi, I would appreciate your reply :)

Can I have some support?

ping to moderators (related to nvinferaudio). sorry to bother you.
@Fiona.Chen @fanzh @Amycao @mchi

The nvinferaudio plugin does not support feeding raw audio input to the model. The input audio is first converted to log mel spectogram and then passed on to the model. We will add this feature in our roadmap.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.