I have problem by running two cufftExecC2C concurrently, even if I use a different FFTplan for each cufftExecC2C. These two cufftExecC2C are in two different Windows(XP) threads which run independently. It happends regularly that these two threads call each of the cufftExecC2C at the same time (or perhaps in miroseconds of interval).
Is there any way to synchronize these two cufftExecC2C so that one is run after the other? Or, is there something wrong in my implementation?
Many thanks in advance.
QY, from Montana
One issue might be that the two host threads use two different CUDA contexts to launch the kernels. May want to ensure they share the context or may want to launch them from the same host thread.
Great thanks for the help.
I’ve tested that it is not a problem when the two cufftExecC2C are launched from the same host thread, since the two cufftExecC2C are using CUDA resources seriesly under this case.
My application has to run the two host threads asynchronously, for a critical task computation, and a low-priority computation.
In your comments “One issue might be that the two host threads use two different CUDA contexts to launch the kernels”, could you tell me how to ensure the two cufftExecC2C share the same context?
perhaps you need to associate different CUDA stream with each plan in different host threads. Then, kernel launches going on different streams have a chance of running concurrently. However, there are still many implicit synchronization events that can prevent concurrent kernel execution. May want to check section “126.96.36.199.4 Implicit Synchronization” of the 4.0 CUDA C Programming Guide for more details.
The CUDA SDK also includes a sample code for concurrent kernels. May want to study how kernels get launched concurrently there.
Your CUDA device also needs to support concurrent kernel execution, i.e. must be compute capability 2.0 or higher.
Contexts may not be an issue, as it seems recent CUDA versions have host threads share the same “primary” device context. Could verify that by obtaining the current CUDA context in each thread and printing (the pointer) to compare…
I really appreciate your helpful suggestions, and will let you know the progress.
Happy New Year!
You will not be able to run concurrent FFTs from multiple threads. Even from a single thread, it is going to be very unlikely since you can’t control the launch configuration.
Is it unlikely or impossible?
I haven’t done it using CUFFT, have done with my own FFT kernels, exactly for real-time (which often implies small job/amount of data).
With CUFFT, the number of blocks might be smaller than the number of SMs and using independent streams would enable concurrent kernel execution. But, depending on the CUFFT implementation it might be impossible.
It is useful that CUFFT provides a way to associate stream with a plan, enabling the possibility of concurrent kernel execution.
So, I am curious, is it impossible or unlikely, and why?
The main reason for using streams is to overlap I/O and compute.
Since you can’t control the execution configuration for CUFFT, you may have concurrent execution in very particular cases ( very small FFTs that are not using all the SMs).
So, I will say that it is highly unlikely, borderline impossible, to achieve concurrent executions with CUFFT.
If you are using custom kernels, you have control on the execution configuration and have a better chance of getting two kernels resident and running at the same time.