I am trying to do audio processing with Jetson TK1 on GPU. I am using Jack2 with 128 samples period at 48kHz (2.7 ms) in real-time mode.
I did a simple Fir filter using cuFFT (FFT->complex mult->iFFT) for each of the stereo channel on a different stream.
My problem is that most of the time is spent launching kernels, not computing. So even the 2 channels are not processed in parallel.
Dynamic parallelism is not available on this platform, so I can not reduce to one launch, and I did not find a way to launch the FFT from a kernel.
I can not process more data at once, I need short latency behavior. I already used mapped memory to remove memcpy times.
Is there a way to launch a cuFFT without using CPU, so I can do the full chain (FFT->filter->iFFT) in one single kernel ?
Is there a way to do kernels repetitively, without new launch at each time, as I do the same thing for each audio chunk ?