great, but I have another problem, performance of cuFFT on size not power of 2
I test 3D real FFT by using
method 1: use fortran F77 package (by Roland A. Sweet and Linda L. Lindgren )
I convert it to C++ code by f2c and use Intel C++ compiler 11.1.035, cuda2.3
method 2: use cufftExecZ2Z or cufftExecC2C
platform: Q6600, 16GB RAM, GTX295, vc2005, icpc 11.1.035
procedure of experiment
step 1: random generate real data
step 2: copy data to cufftDoubleComplex / cufftComplex
step 3: transfer data from host to device
step 4: call cufftExecZ2Z / cufftExecC2C
step 5: transfer data from device to host
description of table entry:
h2d: host to device data transfer
d2h: device to host data transfer
CPU FFTF: forward FFT of CPU version, sequential
GPU FFTF: forward FFT due to cufftExecZ2Z or cufftExecC2C
time: round to ms (use QTime (QT class) to measure time )
[codebox]table 1: result of “double precision”
------------±-------±--------±---------±--------±--------+
| | CPU | GPU | GPU | GPU |
N | size | FFTF | h2d | FFTF | d2h |
------------±-------±--------±---------±--------±--------+
64,64,64 | 4MB | 47 ms | 141 ms | 0 ms | 15 ms |
------------±-------±--------±---------±--------±--------+
128,128,128 | 32MB | 359 ms | 172 ms | 16 ms | 15 ms |
------------±-------±--------±---------±--------±--------+
256,256,256 | 256MB | 2156 ms | 359 ms | 78 ms | 141 ms |
------------±-------±--------±---------±--------±--------+
255,255,255 | 253MB | 2735 ms | 359 ms | 1657 ms | 140 ms |
------------±-------±--------±---------±--------±--------+
300,300,300 | 412MB | 4250 ms | 500 ms | 2125 ms | 250 ms |
------------±-------±--------±---------±--------±--------+
310,310,310 | 455MB | 5953 ms | 515 ms | 4344 ms | 266 ms |
------------±-------±--------±---------±--------±--------+
340,340,340 | 600MB | 7265 ms | 656 ms | 2813 ms | 328 ms |
------------±-------±--------±---------±--------±--------+
table 2: result of “single precision”
------------±-------±--------±---------±--------±--------+
| | CPU | GPU | GPU | GPU |
N | size | FFTF | h2d | FFTF | d2h |
------------±-------±--------±---------±--------±--------+
64,64,64 | 2MB | 47 ms | 141 ms | 0 ms | 0 ms |
------------±-------±--------±---------±--------±--------+
128,128,128 | 16MB | 297 ms | 141 ms | 0 ms | 16 ms |
------------±-------±--------±---------±--------±--------+
256,256,256 | 128MB | 3000 ms | 235 ms | 15 ms | 78 ms |
------------±-------±--------±---------±--------±--------+
255,255,255 | 127MB | 2843 ms | 235 ms | 141 ms | 78 ms |
------------±-------±--------±---------±--------±--------+
300,300,300 | 206MB | 3687 ms | 313 ms | 156 ms | 125 ms |
------------±-------±--------±---------±--------±--------+
310,310,310 | 227MB | 5094 ms | 328 ms | 328 ms | 125 ms |
------------±-------±--------±---------±--------±--------+
340,340,340 | 300MB | 6172 ms | 390 ms | 375 ms | 172 ms |
------------±-------±--------±---------±--------±--------+[/codebox]
d2h reach maximum bandwidth (1,7GB/sec in my machine), I foucs on
timing of FFT kernel (CPU FFTF and GPU FFTF)
It is clear that when N is power of 2, even “double precision”,
cuFFT is 20 times faster than CPU version
however if N is not power of 2, then performance is dramatically slow down
and comparable to CPU version.
This means that if N is (255,255,255), then CPU FFT + openmp is better than cuFFT
Is this normal?