cuFFT cudaFuncSetCacheConfig

Hello,

I am working on running batched FFTs on two separate Tesla devices (2x C2070s).

I have the code structured such that I have one CPU thread per GPU device and each GPU device is reusing a plan that executes a batch of 4 FFT computations. Each FFT computation is 1392 x 1040 in size and are double precision complex.

I have been analyzing the results via the Visual Profiler and noticed that the function cudaFuncSetCacheConfig is being called for each batch.

My questions are:

  1. How does cuFFT impact my other non-cuFFT kernel calls?
  2. Will the cudaFuncSetCacheConfig (if one of my other kernels is using a different preference) cause a device-wide synchronization that impacts all devices on the system?
  3. Where can I find information about what cache config cuFFT is using that way I can configure my kernels to use the same configuration?

I noticed in the profiler that the two devices never run in parallel when executing FFTs. If you would like I could post my profiler output.

Now, on the other hand I have a separate CUDA kernel that is executing and these kernels are able to run at the same time as any other kernel on the second GPU.

“I noticed in the profiler that the two devices never run in parallel when executing FFTs. […]
Now, on the other hand I have a separate CUDA kernel that is executing and these kernels are able to run at the same time as any other kernel on the second GPU.”

Checked with the CUFFT team and they would be interested in looking into this situation. If you have self-contained repro code that you could share, it would be best to file a bug via the registered developer website. Thanks! To file bug reports:

Go to https://developer.nvidia.com/cuda-toolkit

Scroll down to:

Members of the CUDA Registered Developer Program can report issues and file bugs
Login or Join Today

A few updates regarding this matter.

I decided to make a simple test case to analyze the behavior of cuFFT.

Dataset:
16-bit grayscale images organized in a grid that is 8x8.
Each image is 1392 x 1040
FFTs use double precision complex

Here is the procedure:
0) Decompose the grid into two equal parts (each handles 4x8 images)

  1. Create 2 threads (each thread bound to a separate Tesla C2070)
  2. Allocate memory with cudaMalloc for 4x8 images
  3. Read all images from disk and barrier
  4. Copy images from CPU memory to GPU memory using cudaMemcpy and cudaDeviceSynchronize and barrier (yes… a bit overkill)
  5. For each tile execute Forward FFT
  6. Run experiments…

I ran a series of tests on the visual profiler to see what happens…

Here are the experiments I ran with step 6)
a) no synchronization – profiler shows both GPUs executing all FFTs in parallel
b) cudaDeviceSynchronize() – This causes no parallelism (one gpu would execute all FFTs, while the other waited, and once the first GPU completed, then the second would begin)
c) Set a seperate stream per cuFFT and called cudaStreamSynchronize(stream) – This yielded the same results as b
d) cudaMemcpy for the entire 4x8 grid from device to host – both GPUs executed all FFTs and then copied all data back to CPU

I am currently using CUDA 4.2 (I am planning on upgrading to 5.0 later today)

Once I upgrade to 5.0 I will run the experiments again.
Also, I will post the issue on the registered developer site.

edit:
Also unfortunately due to the size of the data I cannot package the code and send it your way. If you would like the code without the data I can send it and you can modify it on your end to handle your own data.

After updating to CUDA 5.0 all my problems have gone away!

The profiler now reports computing FFTs in parallel now with/without synchronization.

/close topic