Bandwidth: A big problem when using cuFFT cudamemcpy between device and Host maybe bottleneck,etc

lzhfire · December 20, 2008, 6:43am

HardWare: Tesla C1060
software: Linux 2.6.18-53.el5
When I do FFT testing on GPU, I find that bandwidth between GPU memory and Host memory become a big problem. for instance, use cuFFT as following:

cudaMemcpy(d_data, h_data, data_SIZE*sizeof(Complex),cudaMemcpyHostToDevice);
cufftPlan1d(&plan, 65536, CUFFT_C2C, 128);
cufftExecC2C(plan, (cufftComplex *)d_data, (cufftComplex *)d_date, CUFFT_FORWARD)
cudaMemcpy(h_data, d_data,65536*128*sizeof(Complex),cudaMemcpyDeviceToHost);

We use cudaMallocHost to malloc h_data, which can speed up the transmit speed, use gettimeofday for timing. At last, The core computing time (plan & EcecC2C) is about 2.4ms, while cudaMemcpy use 52ms.

Otherwise, if I change cufftPlan1d, use cufftPlan1d(&plan, 1024, CUFFT_C2C, 32*256), the core time is 0.075ms, and the cudaMemcpy time total is 27.6ms. difference between these two siutation is cudaMemcpyDeviceToHost time. It’s also a strange phenomena that I can’t explain, while the cudaMemcpyHostToDevice time is always about 10ms.

Make a calculation, 65536*128/(54.4/1000) = 154,202,352 points per second. In this situation, cuFFT’s Fmax is 154MHz. compare GPU with some FPGA production, such as Virtex-5 SXT-2, claimed to be a 200Mhz Fmax, GPU gain no advantage over FPGA. When use less point, some FPGA production can achieve 400Mhz.
So, the key point is the bandwidth, I don’t know which can be the bottleneck, PCI-E, Host memory, GPU memory or some others. This restrict make GPU so awkward. In many domains where FFT is needed, real time computing is the most inportant. The core time with GPU is very fast but useless. What can we do ???

E.D_Riedijk · December 20, 2008, 11:56am

I never used CUFFT, but do you call a threadsynchronize before measuring the time? It might be that your timing is off.

Normally, on fast hardware (fast main memory, PCI-E v2) people are getting 5-6 GiB/s of host to device bandwidth.
65536*128 of complex values is 64 Mb. That takes 10 to 13 millisecond to transfer with the above speeds.

If you can transfer data while the GPU is calculating, that would mean the only time needed is the transfer time. So it would take 20 - 26 milliseconds in total (322 MHz in your calculation above). So I think it all depends on what you want to do after this FFT with your data.

lzhfire · December 22, 2008, 10:28am

cuFFT is a well-wraped lib, There is nowhere to add sync.

Yes, when cudaMemcpy is host2device, time is about 11 millisecond. But after exec over, cudaMemcpy using device2host takes about 27-55 millisecond. I can’t tell why did it happed.

Furthermore, cuFFT doesn’t support double precision~

E.D_Riedijk · December 23, 2008, 3:15pm

cudaMemcpy(d_data, h_data, data_SIZE*sizeof(Complex),cudaMemcpyHostToDevice);

cufftPlan1d(&plan, 65536, CUFFT_C2C, 128);

cufftExecC2C(plan, (cufftComplex *)d_data, (cufftComplex *)d_date, CUFFT_FORWARD)

CudaThreadSynchronize();

//time the exec here

cudaMemcpy(h_data, d_data,65536128sizeof(Complex),cudaMemcpyDeviceToHost);

Are there a lot of FPGA’s that do double precision C2C fft quickly? (Really have no clue).

I think you are comparing apples and oranges a bit here.

If you have some processing that works on the fft-d data, that runs on GPU, you have only 1 of the memcopies that hurt.

tmurray · December 23, 2008, 4:10pm

DP support in CUFFT is coming. Not 100% sure of the timeframe beyond that it’s not coming in 2.1 final, but it is on the roadmap (I think it’s aimed for 2.2).

Topic		Replies	Views
Poor CUFFT Performance? Am I doing something wrong? CUDA Programming and Performance	15	15649	May 4, 2010
CUDA Newbie bandwidth question CUDA Programming and Performance	0	7909	January 25, 2008
CUFFT cudaMemCpyDeviceToHost first call is slow CUDA Programming and Performance	3	843	July 1, 2019
cufft1D and MemcpyAsync = gain? optimization of Host/Device transfert CUDA Programming and Performance	2	797	March 13, 2011
cudaMemcpy(Async) and cufftExec Overlapping Transfert and Computation CUDA Programming and Performance	1	7083	November 25, 2010
Cuda Memcopy need over 12ms for 16MB CUDA Programming and Performance	11	2808	January 30, 2009
FFT Speed vs. x86 CUDA Programming and Performance	14	25035	July 27, 2008
CudaMemcpy() speed/bandwidth For host to device CUDA Programming and Performance	5	10078	June 30, 2009
cudaMemcpyDeviceToHost time procces CUDA Programming and Performance	6	3084	August 1, 2008
A few questions on CUDA performance with pictures! CUDA Programming and Performance	6	3413	January 10, 2009

Bandwidth: A big problem when using cuFFT cudamemcpy between device and Host maybe bottleneck,etc

Related topics