concurrent cufftExecC2C cufftExecC2C does not seem to be thread safe

sendmespam · January 5, 2012, 12:57am

I have a linux system with dual telsa M2090 cards. I’ve tried to utilize both cards for a
task that requires lots of 2D FFTs. Separate host threads are used and the task is split into
two completely independent sections. cudaSetDevice is the first cuda call in each thread to
make use of both cards. Sixteen streams are used within each host thread and corresponding
FFT plans (for cufftExecC2C) are associated with a stream. It runs but produces incorrect
results. With the same code, configured for one GPU, the results are correct (using either
card). If mutex locking is introduced to serialise calls to cufftExecC2C then the results
are correct (but obviously performance is poor). This suggests to me that cufftExecC2C is
neither reentrant nor thread safe. If this is really the case then it makes it impossible
to utilise multiple GPUs with cufft. Any suggestions?

RedHat 6.0
CUDA 4.0
dev driver 270.41.19
libcufft 4.0.17

mfatica · January 8, 2012, 5:10am

Could you post a repro case?

sendmespam · January 12, 2012, 1:33am

The code where I noticed this issue is called from a matlab MEX function, which provides the input. I’ll try and strip it down to a minimal example. That said, do you genuinely expect this function to be reentrant, or at the very least, thread-safe?

sendmespam · January 12, 2012, 1:33am

The code where I noticed this issue is called from a matlab MEX function, which provides the input. I’ll try and strip it down to a minimal example. That said, do you genuinely expect this function to be reentrant, or at the very least, thread-safe?

mfatica · January 12, 2012, 1:42am

It should be thread-safe in CUDA 4.1.
Could you try to use 4.1?

mfatica · January 12, 2012, 1:42am

It should be thread-safe in CUDA 4.1.
Could you try to use 4.1?

sendmespam · January 12, 2012, 4:57am

The test case is attached. It expects 2 GPUs. A set of 2D FFT input blocks are set up (random data), then duplicated on the host. Each GPU processes half of the data and copies the result back over the input data on the host. The host compares the results and reports the maximum deviation (element wise). With the command ‘lock’ argument a mutex serializes the cufftExecC2C calls and the comparison returns (0,0). With the ‘nolock’ command line argument I get intermittent failures (comparison not zero). Failure rate is about 60% for me (from 100 runs and two telsa M2090 cards). Compilation flags used are in a comment at the start of the code.
Haven’t yet tried cuda 4.1 - will do.
testcase.cpp (9.05 KB)

sendmespam · January 12, 2012, 4:57am

The test case is attached. It expects 2 GPUs. A set of 2D FFT input blocks are set up (random data), then duplicated on the host. Each GPU processes half of the data and copies the result back over the input data on the host. The host compares the results and reports the maximum deviation (element wise). With the command ‘lock’ argument a mutex serializes the cufftExecC2C calls and the comparison returns (0,0). With the ‘nolock’ command line argument I get intermittent failures (comparison not zero). Failure rate is about 60% for me (from 100 runs and two telsa M2090 cards). Compilation flags used are in a comment at the start of the code.
Haven’t yet tried cuda 4.1 - will do.

sendmespam · January 12, 2012, 6:17am

With CUDA 4.1 the result comparison is good, but there are intermittent failures of cufftExecC2C with result CUFFT_EXEC_FAILED.

sendmespam · January 12, 2012, 6:17am

With CUDA 4.1 the result comparison is good, but there are intermittent failures of cufftExecC2C with result CUFFT_EXEC_FAILED.

mfatica · January 14, 2012, 5:16pm

It will be fixed in the final 4.1 release.

mfatica · January 14, 2012, 5:16pm

It will be fixed in the final 4.1 release.

sendmespam · January 16, 2012, 5:30am

Great - thanks.

sendmespam · January 16, 2012, 5:30am

Great - thanks.