concurrent cufftExecC2C cufftExecC2C does not seem to be thread safe

I have a linux system with dual telsa M2090 cards. I’ve tried to utilize both cards for a
task that requires lots of 2D FFTs. Separate host threads are used and the task is split into
two completely independent sections. cudaSetDevice is the first cuda call in each thread to
make use of both cards. Sixteen streams are used within each host thread and corresponding
FFT plans (for cufftExecC2C) are associated with a stream. It runs but produces incorrect
results. With the same code, configured for one GPU, the results are correct (using either
card). If mutex locking is introduced to serialise calls to cufftExecC2C then the results
are correct (but obviously performance is poor). This suggests to me that cufftExecC2C is
neither reentrant nor thread safe. If this is really the case then it makes it impossible
to utilise multiple GPUs with cufft. Any suggestions?

RedHat 6.0
CUDA 4.0
dev driver 270.41.19
libcufft 4.0.17

Could you post a repro case?

The code where I noticed this issue is called from a matlab MEX function, which provides the input. I’ll try and strip it down to a minimal example. That said, do you genuinely expect this function to be reentrant, or at the very least, thread-safe?

The code where I noticed this issue is called from a matlab MEX function, which provides the input. I’ll try and strip it down to a minimal example. That said, do you genuinely expect this function to be reentrant, or at the very least, thread-safe?

It should be thread-safe in CUDA 4.1.
Could you try to use 4.1?

It should be thread-safe in CUDA 4.1.
Could you try to use 4.1?

The test case is attached. It expects 2 GPUs. A set of 2D FFT input blocks are set up (random data), then duplicated on the host. Each GPU processes half of the data and copies the result back over the input data on the host. The host compares the results and reports the maximum deviation (element wise). With the command ‘lock’ argument a mutex serializes the cufftExecC2C calls and the comparison returns (0,0). With the ‘nolock’ command line argument I get intermittent failures (comparison not zero). Failure rate is about 60% for me (from 100 runs and two telsa M2090 cards). Compilation flags used are in a comment at the start of the code.
Haven’t yet tried cuda 4.1 - will do.
testcase.cpp (9.05 KB)

The test case is attached. It expects 2 GPUs. A set of 2D FFT input blocks are set up (random data), then duplicated on the host. Each GPU processes half of the data and copies the result back over the input data on the host. The host compares the results and reports the maximum deviation (element wise). With the command ‘lock’ argument a mutex serializes the cufftExecC2C calls and the comparison returns (0,0). With the ‘nolock’ command line argument I get intermittent failures (comparison not zero). Failure rate is about 60% for me (from 100 runs and two telsa M2090 cards). Compilation flags used are in a comment at the start of the code.
Haven’t yet tried cuda 4.1 - will do.

With CUDA 4.1 the result comparison is good, but there are intermittent failures of cufftExecC2C with result CUFFT_EXEC_FAILED.

With CUDA 4.1 the result comparison is good, but there are intermittent failures of cufftExecC2C with result CUFFT_EXEC_FAILED.

It will be fixed in the final 4.1 release.

It will be fixed in the final 4.1 release.

Great - thanks.

Great - thanks.