I’m working on some Xeon machines running linux, each with a C1060. I’m trying to verify the performance that I see on som ppt slides on the Nvidia site that show 150+ GFLOPS for a 256 point SP C2C FFT. I only seem to be getting about 30 GPLOPS. I’m only timing the fft and have the thread synchronize around the fft and timer calls. I also double checked the timer by calling both the cuda timer and the linux getTime, both are giving me the same elapsed time. I tried both the 1d and 2d ffts, with the 2d giving a little better performance. The 30 GFLOPS is very close to the number I would expect for the intel processor with MKL libraries. The deviceQuery seems to indicate that the tesla board is working. Calculating flops using 5n log2(n) (times 2n for the 2D FFT, using 256x256). Is it possible that the fft is really being done on the Xeon? Or, something else I may be doing wrong?
Results of deviceQuery:
Device 0: “Tesla C1060”
CUDA Driver Version: 2.30
CUDA Runtime Version: 2.30
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 3
Total amount of global memory: 4294705152 bytes
Number of multiprocessors: 30
Number of cores: 240
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.30 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)