Bandwidth: A big problem when using cuFFT cudamemcpy between device and Host maybe bottleneck,etc

HardWare: Tesla C1060
software: Linux 2.6.18-53.el5
When I do FFT testing on GPU, I find that bandwidth between GPU memory and Host memory become a big problem. for instance, use cuFFT as following:

cudaMemcpy(d_data, h_data, data_SIZE*sizeof(Complex),cudaMemcpyHostToDevice);
cufftPlan1d(&plan, 65536, CUFFT_C2C, 128);
cufftExecC2C(plan, (cufftComplex *)d_data, (cufftComplex *)d_date, CUFFT_FORWARD)
cudaMemcpy(h_data, d_data,65536*128*sizeof(Complex),cudaMemcpyDeviceToHost);

We use cudaMallocHost to malloc h_data, which can speed up the transmit speed, use gettimeofday for timing. At last, The core computing time (plan & EcecC2C) is about 2.4ms, while cudaMemcpy use 52ms.

Otherwise, if I change cufftPlan1d, use cufftPlan1d(&plan, 1024, CUFFT_C2C, 32*256), the core time is 0.075ms, and the cudaMemcpy time total is 27.6ms. difference between these two siutation is cudaMemcpyDeviceToHost time. It’s also a strange phenomena that I can’t explain, while the cudaMemcpyHostToDevice time is always about 10ms.

Make a calculation, 65536*128/(54.4/1000) = 154,202,352 points per second. In this situation, cuFFT’s Fmax is 154MHz. compare GPU with some FPGA production, such as Virtex-5 SXT-2, claimed to be a 200Mhz Fmax, GPU gain no advantage over FPGA. When use less point, some FPGA production can achieve 400Mhz.
So, the key point is the bandwidth, I don’t know which can be the bottleneck, PCI-E, Host memory, GPU memory or some others. This restrict make GPU so awkward. In many domains where FFT is needed, real time computing is the most inportant. The core time with GPU is very fast but useless. What can we do ???

I never used CUFFT, but do you call a threadsynchronize before measuring the time? It might be that your timing is off.

Normally, on fast hardware (fast main memory, PCI-E v2) people are getting 5-6 GiB/s of host to device bandwidth.
65536*128 of complex values is 64 Mb. That takes 10 to 13 millisecond to transfer with the above speeds.

If you can transfer data while the GPU is calculating, that would mean the only time needed is the transfer time. So it would take 20 - 26 milliseconds in total (322 MHz in your calculation above). So I think it all depends on what you want to do after this FFT with your data.

cuFFT is a well-wraped lib, There is nowhere to add sync.

Yes, when cudaMemcpy is host2device, time is about 11 millisecond. But after exec over, cudaMemcpy using device2host takes about 27-55 millisecond. I can’t tell why did it happed.

Furthermore, cuFFT doesn’t support double precision~

cudaMemcpy(d_data, h_data, data_SIZE*sizeof(Complex),cudaMemcpyHostToDevice);

cufftPlan1d(&plan, 65536, CUFFT_C2C, 128);

cufftExecC2C(plan, (cufftComplex *)d_data, (cufftComplex *)d_date, CUFFT_FORWARD)

CudaThreadSynchronize();

//time the exec here

cudaMemcpy(h_data, d_data,65536128sizeof(Complex),cudaMemcpyDeviceToHost);

Are there a lot of FPGA’s that do double precision C2C fft quickly? (Really have no clue).

I think you are comparing apples and oranges a bit here.

If you have some processing that works on the fft-d data, that runs on GPU, you have only 1 of the memcopies that hurt.

DP support in CUFFT is coming. Not 100% sure of the timeframe beyond that it’s not coming in 2.1 final, but it is on the roadmap (I think it’s aimed for 2.2).