Nvinferaudio custom model

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU) Jetson Nano
• DeepStream Version 6.01
• JetPack Version (valid for Jetson only) 4.61
• TensorRT Version 8.01
• NVIDIA GPU Driver Version (valid for GPU only) 470
• Issue Type( questions, new requirements, bugs) Question
• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)
• Requirement details( This is for new requirement. Including the module name-for which plugin or for which sample application, the function description)

Good evening

I am trying to incorporate a custom audio classifier into nvinferaudio in Deepstream. While I understand from the documentation that the model itself works on a mel spectogram I am trying to understand the input tensor from sonyc_audio_classify.onnx which I see is type: float32[batch_size,1,635,128] I am trying relate this to the audio-transform parameter and from what I can see the 128 is the number of mel bins. It would be helpful to have an example frame of input data.

Is this format the standard processing of the nvinveraudio block or is has some input customization been done. Alternatively where can I find the definition of the SONYC classifier to compare against my model and make adjustments.

Any help would be appreciated

Any advice?

1 The Gst-nvinferaudio plugin performs transform (log mel spectogram), on the input frame based on audio-transform property setting, please refer to Gst-nvinferaudio — DeepStream 6.1.1 Release documentation
2 please refer to demo app deeptream-audio in /opt/nvidia/deepstream/deepstream/sources/apps/sample_apps/.

Hi fanzh

Thanks I have gone through that, it would be helpful to know about the SONYC model used, is that available as it does not seem to be a standard well known model unless I cant find it. I do find the SONYC dataset and the journal article that does talk about a very basic classifier.

Or alternatively what does the tensor represent for example with images the input tensor is NCHW what is the equivalent definition for audio. Am I right so say [batch_size, audio_channels (1), data points (635), mel_bins(128)]?

Hi fanzh I didnt reply directly to your post. Where is the best place to request this info?

please refer to deepstream-audio 's config file, here are some important parameters, and you can get description in documentation above.
audio-transform=melsdb,fft_length=2560,hop_size=692,dsp_window=hann,num_mels=128,sample_rate=44100,p2db_ref=(float)1.0,p2db_min_power=(float)0.0,p2db_top_db=(float)80.0

Thanks I have got it working now by the looks of it

I am getting another error though

0:00:08.963285580 11141 0xbd13680 WARN nvinferaudio gstnvinferaudio.cpp:578:gst_nvinferaudio_process_input:<audio_classifier> error: Failed buffering twice
ERROR from audio_classifier: Failed buffering twice

I’ll post another thread for that

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.