cufft and low occupancy on Kepler

Hi,

I have a single precision application that uses cufft and have noted that performance degraded when the application was moved from a Fermi based GTX560 to a Kepler based Quadro-K4000.
The application processes 256 sized batches of 8K FFTs.
I profiled the FFT using Nvidia Profiler which reported that the kernel spVector8192D::kernelTex has only 25% occupancy because it uses 36Kb of shared memory per block (the K4000 offers 48 KB per SM).

Is there any way to influence how the cufft chooses the launch configuration or is there a version of the cufft that is optimized towards Kepler (I am using CUDA 6.0)?

How big is the performance difference? The occupancy does not look alarmingly low, it should be enough to cover relevant latencies. In general, FFTs are memory bound, so you may want to compare the memory bandwidth of the two GPUs. I found 134 GB/sec listed for the Quadro K4000:

http://www.nvidia.com/content/PDF/data-sheet/DS_NV_Quadro_K4000_OCT13_NV_US_LR.pdf

There seem to be multiple versions of the GTX 560, at least some of which provide higher memory bandwidth than the Quadro K4000:

http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-560ti/specifications

Since GTX 560 and GTX 560 Ti are consumer cards, they can be vendor configured for higher memory clocks compared to NVIDIA reference designs. What is the memory clock reported for yours?

hi,

You should read this

http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf

The specific cuFFT batch 256x8K needs on the GTX560 570 usec and on the K4000 1360 usec, i.e. the
GTX560 is more than twice as fast.

The K4000 has a lower memory bandwidth, but not to this extend.

The K4000 reports with the bandwidthTest:

Running on…

Device 0: Quadro K4000
Quick Mode


Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 87440.3

The GTX560 reports

Running on…

Device 1: GeForce GTX 560 Ti
Quick Mode

Device to Device Bandwidth, 1 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 101306.9

Here is the deviceQuery output

Detected 1 CUDA Capable device(s)

Device 0: “Quadro K4000”
CUDA Driver Version / Runtime Version 6.0 / 6.0
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 3071 MBytes (3220504576 bytes)
( 4) Multiprocessors, (192) CUDA Cores/MP: 768 CUDA Cores
GPU Clock rate: 811 MHz (0.81 GHz)
Memory Clock rate: 2808 Mhz
Memory Bus Width: 192-bit
L2 Cache Size: 393216 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 66 / 0

Device 1: “GeForce GTX 560 Ti”
CUDA Driver Version / Runtime Version 5.5 / 5.5
CUDA Capability Major/Minor version number: 2.1
Total amount of global memory: 2048 MBytes (2147024896 bytes)
( 8) Multiprocessors x ( 48) CUDA Cores/MP: 384 CUDA Cores
GPU Clock rate: 1800 MHz (1.80 GHz)
Memory Clock rate: 2004 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 524288 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 1 / 0

Given that the memory throughput of the GTX 560 is only about 15% higher than the K4000 while the FFT executes more than twice as fast on the GTX 560, and given that you are using CUDA 6.0 (which should be fully optimized for Kepler) I would suggest filing a bug using the bug reporting form linked from the registered developer website.

I filed the bug report as you suggested.

@testi2: Thank you for the Volkov paper reference.