Too much time in CPU


I am trying to do audio processing with Jetson TK1 on GPU. I am using Jack2 with 128 samples period at 48kHz (2.7 ms) in real-time mode.
I did a simple Fir filter using cuFFT (FFT->complex mult->iFFT) for each of the stereo channel on a different stream.

My problem is that most of the time is spent launching kernels, not computing. So even the 2 channels are not processed in parallel.
Dynamic parallelism is not available on this platform, so I can not reduce to one launch, and I did not find a way to launch the FFT from a kernel.
I can not process more data at once, I need short latency behavior. I already used mapped memory to remove memcpy times.

Is there a way to launch a cuFFT without using CPU, so I can do the full chain (FFT->filter->iFFT) in one single kernel ?
Is there a way to do kernels repetitively, without new launch at each time, as I do the same thing for each audio chunk ?

Thanks !

“My problem is that most of the time is spent launching kernels, not computing. So even the 2 channels are not processed in parallel.”

this sound ‘wrong’, either way
so, i would wonder whether there is not something that can be done about it
personally, i would consider this an itch that i wish to scratch

“Is there a way to do kernels repetitively, without new launch at each time”

generally, yes; with emphasis on ‘generally’

the only thing that truly changes with kernel blocks are indices, offsets, etc
you could easily instruct a kernel block to behave as multiple kernel blocks, by having it loop for a set count, and update its indices itself

So, to be clear, you are doing 128-point FFTs/IFFTs ?

I am in fact doing 512-point FFTs/IFFTs, using overlap-save filtering method, at each new 128 sample chunk.

I don’t understand what you mean by “this sound ‘wrong’, either way”
I am new to Cuda, and I understood that computes on independent data should be done in different streams to allow the scheduler/dispatcher to optimize the pipeline.
For repeated kernels, how to wait for a CPU event ? I saw functions for the CPU waiting GPU, but not GPU wait for CPU.

I currently try to do my own FFT/iFFT implementation in kernel directly, not using cuFFT. This way, I can stay in kernel and limit the launch time to 1 cudaLaunch (~50us) per chunk.

So, as you indicated, a 128 sample period occupies 2.7ms. Kernel launch overhead is 0.05ms. Why do you believe that launch overhead is a significant issue?

“My problem is that most of the time is spent launching kernels, not computing.”

I see that with Profiler, and timing measurements.
Each FFT/iFFT by cuFFT is 2 launches (kernel+post/preprocess). cudaLaunch for cuFFT are around 0.1 ms each.
Each launch is in fact more than 0.05ms, taking the ConfigureCall and SetupArguments (and BindTexture for cuFFT).
So in total, only cudaLaunch take me 0.9 ms, so 33% of the time I have. Counting all setup time, I am close to 2 ms, without synchronizing.

I used based filtering to stay in kernel, and reduced the time to around 0.3 ms. However, sometimes the StreamSynchronize hangs for a very long time (> 1s), while the kernel has ended.

Yes, CUFFT has a lot of overhead for doing transforms on very small data sets.

If your launch overhead is 50us, and your processing is 0.3ms using your own method (not CUFFT), and you have a 2.7ms period, I’m struggling to understand the issue.

I am sorry, I was not clear.
I started with cuFFT, and tried to make it work when I asked the question, trying to deal with this huge overhead.

By the mean time, I tested another solution, with my own FFT kernel derived from vvolkov one’s, and that makes much better results. I still have an issue as I said in previous post, but not related to launch overhead.


So the remaining issue is the occasional very long StreamSynchronize time?

Any chance you could provide a full reproducer, along with a complete platform description? (GPU, OS, CUDA version, compile command)

My knowledge of CUFFT is extremely limited, but it seems to me the size of FFT is too small for GPUs in general. I do not see how it could use more than one thread per sample, and for full performance the GPU would want to run thousands, even tens of thousands of threads. In addition, a general-purpose library like CUFFT will always have overhead which is why a custom kernel can be faster.

This is probably not politically correct to say in a forum dedicated to CUDA programming, but depending on the context of these operations, performing such small FFTs on the CPU could be the most efficient solution. The small amount of data processed ensures that it stays in the CPU’s data cache and give you short latency.

what exactly is the host in the case of jetson? and what would mind the StreamSynchronize?

if the host is onboard, how many cores are there?

i wonder whether streams may not suffer from a lack of host cores on jetson (without intervention)

Thanks. I know that GPU are usually for big data sets, and that it does not need usually so low-latency processing. But NVidia promotes the GPGPU, and I am evaluating if this is possible to use GPU in real-time, low-latency, audio processing. I know for this particular simple test that CPU would be faster. But my goal is to use the GFlops of the GPU, so the CPU can do other tasks. And I would like to do hundreds of small FFT sequentially.
So yes, using own FFT implementation is much more efficient. I currently use 64 thread per FFT, and so I can do the 2 channels at the same time, as I still have free Cuda cores (I know it can be improved).

The host is a Tegra K1. This is an embedded ARM processor (quad Cortex-A15), with integrated Kepler mobile GPU (1 MP with 192 cuda cores, 3.2 capabilities). This is not a PCI device, CPU and GPU share directly the same SDRAM (zero-copy). I run Linux4Tegra R21.4, without graphical display to be sure it does not use the GPU.

I will investigate more the Synchronize problem, and try to provide a code to reproduce it. I did not spent enough time on this yet… I will let you know.

well, in that case, my thought is that the host might be a little weak to carry the burden accompanying streams
how are the streamsynchronize flags set? have you tried setting the streamsynchronize flags such that the host would yield, and not busy-wait?

I’ve also seen high overhead with CUFFT for small transforms.

How “real time” does your processing have to be (i.e. are going to playback the filtered audio in real time and care about stutter)? I heard 1 claim that the reason hearing aids are so expensive is because they can’t even tolerate 1ms of latency or it would disorient the person.

If not, have you considered using batched cuFFTs (cufftMakePlanMany() )? For example, if you do 8 128 point transforms at a time, that would introduce at least 21ms of latency for 48 KHz audio.

I probably agree with njuffa that the CPU would be better for audio. As much as I liked the hardware 3D positional audio and environment effects (EAX) on Creative Sound Blaster cards, it’s hard to justify the dedicated hardware when the processing would only take a few % of the CPU.

I assume you really meant “sequentially” and not “in parallel”. So you have very small FFTs, perform them sequentially, and need low latency. Based on that, it seems that this processing task should be performed on the CPU, and that you should investigate using the GPU for other processing tasks that apparently also need to happen but which you did not specify.

I does not make sense to me to use a sub-optimal processing resource just because it is there. If, however, your CPU is already maxed out, trying to push some work to the GPU even if the work isn’t the best fit for the GPU seems like a reasonable approach. As Uncle Joe pointed out, some small amount of audio processing should not tax the CPU much, even if it is a relatively slow ARM processor.

Thanks all for suggestions.

@Uncle Joe:
I will definitely not use cuFFT due to general kernel launch overhead. I need to do a lot of processing between FFTs. So PlanMany is not an option.
Yes, it will be played real-time with audio feedback in professional applications.
I want to process 128-sample chunks in 128-sample time (or 64 if possible). I need to stay below 10 ms, all included (in/process/out), and process will add intrinsic latency.

CPU would be easier, but I am in fact evaluating if the Tegra K1 could replace a core-i7 for this application. And it can not without using the GFlops in the GPU: applications I would like to port use >100 GFlops on a core-i7.
Quickly said: if I can not use the GPU for that, I won’t use the Tegra K1 (only 18.4 GFlops peak without GPU).

I didn’t specify stream creation flags, but I don’t use stream 0.
I started an evaluation program without jack audio. And it does not show this >1sec hang. So I need to investigate more before asking new questions and wasting your time (and this is not related to primary question).

On the primary question, I would currently say the answer is:
The cuFFT library is not a good choice for small set FFT, as there is huge overhead. I need to roll my own FFT in kernel directly.
For small data sets, I should launch only one kernel, and do everything in this kernel, to avoid long time synchronization between CPU and GPU (even with Tegra K1 same physical memory).

“I didn’t specify stream creation flags, but I don’t use stream 0.”



consider approaching this differently.

I am not sure about the peculiarities of Jetson, but consider using an address space that is common between GPU and CPU (Zero copy memory). Have the GPU kernel run in an endless spin loop. The GPU will busy wait for new data to be presented by the CPU.

Synchronize audio buffer contents using some kind of spinlock. Finding a safe and efficient buffer locking mechanism is the critical part here.

Kernel launch overhead is eliminated with this approach, the GPU will consume power all the time though. You also have to make sure the GPU’s watchdog timer does not interfere.


Both cudaDeviceScheduleSpin and cudaDeviceScheduleYield lead to the same results: good most of the time, but not sometimes longer.

Yes, that is the option I will try, now I do FFT in kernel directly.
I just have to find a suitable mechanism.

I did simple lock mechanism using unified memory.
And performance is stable and as expected (<50us for FFT/filter/iFFT).

I will have to ensure there are no limits for kernel size and things like that… but that looks promising now.
The drawback to this approach is that it will be harder to write code than with individual kernels that could been launch with different threads count.

I have to give up for the week-end.

Thanks all for your help and suggestions !