Please provide complete information as applicable to your setup.
• Hardware Platform (Jetson / GPU) Jetson Nano • DeepStream Version 6.01 • JetPack Version (valid for Jetson only) 4.61 • TensorRT Version 8.01 • NVIDIA GPU Driver Version (valid for GPU only) 470 • Issue Type( questions, new requirements, bugs) Question • How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing) • Requirement details( This is for new requirement. Including the module name-for which plugin or for which sample application, the function description)
Good evening
I am trying to incorporate a custom audio classifier into nvinferaudio in Deepstream. While I understand from the documentation that the model itself works on a mel spectogram I am trying to understand the input tensor from sonyc_audio_classify.onnx which I see is type: float32[batch_size,1,635,128] I am trying relate this to the audio-transform parameter and from what I can see the 128 is the number of mel bins. It would be helpful to have an example frame of input data.
Is this format the standard processing of the nvinveraudio block or is has some input customization been done. Alternatively where can I find the definition of the SONYC classifier to compare against my model and make adjustments.
1 The Gst-nvinferaudio plugin performs transform (log mel spectogram), on the input frame based on audio-transform property setting, please refer to Gst-nvinferaudio — DeepStream 6.1.1 Release documentation
2 please refer to demo app deeptream-audio in /opt/nvidia/deepstream/deepstream/sources/apps/sample_apps/.
Thanks I have gone through that, it would be helpful to know about the SONYC model used, is that available as it does not seem to be a standard well known model unless I cant find it. I do find the SONYC dataset and the journal article that does talk about a very basic classifier.
Or alternatively what does the tensor represent for example with images the input tensor is NCHW what is the equivalent definition for audio. Am I right so say [batch_size, audio_channels (1), data points (635), mel_bins(128)]?
please refer to deepstream-audio 's config file, here are some important parameters, and you can get description in documentation above.
audio-transform=melsdb,fft_length=2560,hop_size=692,dsp_window=hann,num_mels=128,sample_rate=44100,p2db_ref=(float)1.0,p2db_min_power=(float)0.0,p2db_top_db=(float)80.0