relationship between used registers and invoked blocks

concerning this issue, profesor Hwu/UIUC said number of blocks invoked on a SM will reduce if the number of registers needed by automatic variables in a kernel exceeds some threshold, but today i was told that the number of blocks invoked on a SM won’t reduce because part of automatic variables will be restored in local memory with the cost of a performance fall, which will be done at the stage of compilation. which is the true? i am a little puzzled, hope someone can give me a confirmation

Thanks in advance

Gimurk

Well, both statements are actually true.

Firstly, the number of blocks that can run on an SM basically comes down to this relation: num_blockthreads_per_blockregisters_per_thread < num_registers on SM (8k for compute 1.0/1.1 and 16k for compute 1.4).

For the most part, the more automatic variables you use and the more complicated your expressions in your kernel (necessitating temporary registers), the more registers will be used. You can get this value from the cubin. So Wen-mei Hwu is correct. Simple kernels use as few as 4 while slightly more complicated kernels go up to 25.

If you have an EXTREMELY complicated kernel that would require more than 40 or 50 registers, THEN the compiler might start moving some of them to local memory.

All this information is printed in the cubin (if you compile with the -cubin or -keep option) or on stdout if you compile with --ptxas-options -v. You can input it into the occupancy calculator spreadhseet to determine how many block will run and what occupancy the SMs have (though a higher occupancy doesn’t always lead to higher performance).

One additional big factor in the number of blocks running on an SM is the shared memory usage of each block, which I assume you already know about.

You have some hot new hardware? ;)

If only that were true… The 3 is next to the 4 on the keyboard after all (at least it is on mine…). I of course meant compute 1.3 not 1.4.

i got it, and i know relationship between shared memory useage and invoked blocks

Thanks a lot for your kind answer, MisterAnderson42

Gimurk