I am working on one project with GPU acceleration. I need to frequently write the data to global memory in the loop of the kernel. However, the local size (relevant to global memory) used in kernel is small. Shared memory is not enough to meet my requirements. Therefore, I am trying to make use of registers.
To my understanding, Quadro K6000 has 256k register file per SMX. That means each thread can have 255 registers when the maximal 1024 threads are used in SMX.
Without use of registers, that is the data needs to be frequently written back to global memory in each loop of the kernel, I can use about 90 registers (confirmed by disasembling .cubin file). I can spend about 120 second for kernel calculation.
However, if I want to use registers to save local values in the loop of the kernel. After finishing the loop, I just write global memory one time. In this situation, I can use about 200 registers. However, the calculation time becomes double, about 287second.
I am little confused. I didn’t use up the registers for each thread. Why does the speed become slow?
Thank you very much for your suggestion and comments!