The code is like this: a = share[idx]; “a” is stored in a register and “share” is stored in shared memory.
The “a” register does not really exist. In fact, the program will visit shared memory when “a” is used;
Actually this question has been raised before. I have tried “volatile” variables but it does not work. Another way seems to be to modify the assembly code.
Is there any other method? I will very appreciate your help.
How are you verifying that when “a” is accessed, it is actually reading from shared memory?
If your kernel has something like:
shared[tid] = global[idx];
a = shared[tid];
printf("%i\n", a);
Assuming “shared” has 1 element for each thread and that “a” is integer, if you profile exactly this, how many registers does it say were used?
AFAIK, it is up to the compiler to decide the register allocation, and given the register limit of the hardware, it will only spill when the limit is reached. Is it the case your program?
Hi, saulocpp,
In fact I declare dozens of variables, but the register usage is only 32. My aim is to load shared memory data into registers immediately, but actually data is repeatedly loaded every time when being used, I guess. For example, when I visit data “a” the second time, it should have stayed in a register but in fact the program loads “a” from shared memory into register again. The compiler seems to decrease register usage by this way but that’s not what I want.
I found a method useful to my program. That is __threadfence_block(). This function force the program to load shared memory data and synchronize all threads, however, with too large time expense. The performance is worse.
You cannot force the compiler to keep data in registers, you can only facilitate this.
Have you used launch_bounds (or the -maxrregcount flag to nvcc) to reduce register pressure by ensuring the compiler has enough registers at its disposal?
hi, tera,
I declare dozens of variables, but there is no register spill. No matter I visit shared memory directly or load shared data into register, the register usage is always 32 while the max value is 63. I have 240 threads per block and there is 32768 registers in each SM. Registers are enough.
How did you arrive at that conclusion?
At a blocksize of 240 threads, 32768 registers per SM would allow running just two blocks in parallel on each SM with 64 registers/thread. At 32 registers/thread, it allows four blocks in parallel.
You appear to be on compute capability 2.x (Fermi). For Fermi generation devices, 63 registers/thread would only allow an occupancy of 33%, while 32 registers/thread allow 67% occupancy. Therefore it appears very likely the compiler’s heuristics decided 32 registers/thread to be the preferred value, and the register allocator will make best use of registers up to that mark.