HardWare: Tesla C1060
software: Linux 2.6.18-53.el5
When I do FFT testing on GPU, I find that bandwidth between GPU memory and Host memory become a big problem. for instance, use cuFFT as following:
cudaMemcpy(d_data, h_data, data_SIZE*sizeof(Complex),cudaMemcpyHostToDevice);
cufftPlan1d(&plan, 65536, CUFFT_C2C, 128);
cufftExecC2C(plan, (cufftComplex *)d_data, (cufftComplex *)d_date, CUFFT_FORWARD)
cudaMemcpy(h_data, d_data,65536*128*sizeof(Complex),cudaMemcpyDeviceToHost);
We use cudaMallocHost to malloc h_data, which can speed up the transmit speed, use gettimeofday for timing. At last, The core computing time (plan & EcecC2C) is about 2.4ms, while cudaMemcpy use 52ms.
Otherwise, if I change cufftPlan1d, use cufftPlan1d(&plan, 1024, CUFFT_C2C, 32*256), the core time is 0.075ms, and the cudaMemcpy time total is 27.6ms. difference between these two siutation is cudaMemcpyDeviceToHost time. It’s also a strange phenomena that I can’t explain, while the cudaMemcpyHostToDevice time is always about 10ms.
Make a calculation, 65536*128/(54.4/1000) = 154,202,352 points per second. In this situation, cuFFT’s Fmax is 154MHz. compare GPU with some FPGA production, such as Virtex-5 SXT-2, claimed to be a 200Mhz Fmax, GPU gain no advantage over FPGA. When use less point, some FPGA production can achieve 400Mhz.
So, the key point is the bandwidth, I don’t know which can be the bottleneck, PCI-E, Host memory, GPU memory or some others. This restrict make GPU so awkward. In many domains where FFT is needed, real time computing is the most inportant. The core time with GPU is very fast but useless. What can we do ???