One question about the registers in Keple structure GK110 (Quadro K6000)

Hello, everyone

I am working on one project with GPU acceleration. I need to frequently write the data to global memory in the loop of the kernel. However, the local size (relevant to global memory) used in kernel is small. Shared memory is not enough to meet my requirements. Therefore, I am trying to make use of registers.

To my understanding, Quadro K6000 has 256k register file per SMX. That means each thread can have 255 registers when the maximal 1024 threads are used in SMX.

Without use of registers, that is the data needs to be frequently written back to global memory in each loop of the kernel, I can use about 90 registers (confirmed by disasembling .cubin file). I can spend about 120 second for kernel calculation.

However, if I want to use registers to save local values in the loop of the kernel. After finishing the loop, I just write global memory one time. In this situation, I can use about 200 registers. However, the calculation time becomes double, about 287second.

I am little confused. I didn’t use up the registers for each thread. Why does the speed become slow?

Thank you very much for your suggestion and comments!

Have you tried using the CUDA profiler to pin-point the performance bottlenecks in your code?

In general, you would want to structure your code so as to minimize data movement, and maximize the amount of computation performed for each piece of data moved.

I believe K6000 is a GK110 part so that means 65536 registers per SM. The more registers you use the more you limit your occupancy. Unless your code has a lot of ILP you wont get good performance with high register counts.

Thank you very much for your quick response.
If K6000 only has 65536 registers, the performance I got seems reasonable since evenly each thread only has 64 registers.
Where can I confirm this information?

  1. run deviceQuery and look at the line that says “Total number of registers available per block”

or

  1. the programming guide:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications__technical-specifications-per-compute-capability

not sure what compute capability your device is? run deviceQuery or look it up here:

https://developer.nvidia.com/cuda-gpus

From the cuda-gpus page, K6000 has compute capability 3.5

From the programming guide (table 13) a compute 3.5 device has:

Maximum number of 32-bit registers per thread block: 64 K

Thank you very much, txbob.

I double check the parameters in your first link.
Maximum number of 32-bit registers per thread block 32 K 64 K 32 K
Maximum number of 32-bit registers per thread 63 255

To my understanding, the register file in all released GPU cards even including Maxwell doesn’t have 256k.The maximum is 64k. If 1024 threads are used per block, each thread can only get 64 registers evenly and can never get 255 registers evenly, is it right?

If I am right, I think you need to revise your article in the blog, that is http://devblogs.nvidia.com/parallelforall/maxwell-most-advanced-cuda-gpu-ever-made/.
They mentioned the register file is 256k.

That blog mentions the register file as 256KB, not 256k. Please reread it:

Register File Size / SM	256KB	256KB

256KB is 256 kilobytes. Since each 32-bit register is composed of 4 bytes, the numbers are in agreement.

64K registers is 256KB.

And while your question doesn’t directly touch on this, you should be aware that some GPUs (cc3.7, for example) have a register file on the SM which is larger than the register file that is available to a given threadblock. You can observe this by studying the table 13 previously linked and look at the “Maximum number of 32-bit registers per thread block” line as well as the line above it “Number of 32-bit registers per multiprocessor”