32bit and 64bit device code performance

Hello,

I have written a memory intensive kernel involving a 2D stencil computation and I have noticed a significant performance degradation when built in a 64bit binary compared to a 32bit binary. The kernel execution time actually drops by ~20% in the 64bit binary. The time measurement is performed using CUDA events. In addition I noticed that the 64bit kernel consumes 31 registers whereas the 32bit requires just 27. However, that doesn’t seem to justify an occupancy drop as my thread block configuration includes 128 threads per block (occupancy calculator xls gives 100% occupancy for both register counts).

I performed the experiments on a system with a GTX-660 GPU. The performance degradation was evident on both Ubuntu Linux 64bit and Windows 7 64bit.

Thank you.

With few exceptions, 64-bit integer operations on the GPU are emulated through multiple native instructions. In CUDA, the address size used by device code matches the address size of the host. Moving CUDA code to a 64-bit platform makes all addresses (and therefore pointers) 64-bit quantities, and in this way introduces 64-bit arithmetic for address computations.

The compiler tries to be smart about it, trying to avoid full 64-bit computation where possible, but in general you will observe an increase in register usage (as each register can hold only 32 bits of data) and an increase in dynamic instruction count. The increase in register usage can lead to decreased occupancy which can lead to lower performance. In your case that does not seem to happen (but you may want to double check with the profiler), but the increase in the number of instructions to be executed leads to a slowdown.

Not knowing anything about the particulars of the code, I can only say that in my experience a 20% drop is within the range seen with transitions from 32-bit to 64-bit platforms. You may want to dig in with the profiler to see where slight modifications to the code could restore some of the lost performance. A 2D stencil operation seems to imply that the code could be memory bandwidth limited rather than instruction throughput limited, which in turn would suggest a smaller performance drop from moving to a 64-bit platform. On modern GPU like the GTX 660 the profiler has ample HW counters at its disposal to pinpoint what is impacting performance, so I would suggest spending some time with it.