registers available per thread (newbie Question)

(see: CUDA Programming Guide Version 1.0; page 53)

My Questions:

Is “number of concurrent blocks” the number blocks that can be processed at the same time ( 8 per multiprocessor (see page 64)) or does it mean the total number blocks I initialized in my kernel?

What happens, when I have less registers available, than I need?

B is the number of blocks that are actually running on the same multiprocessor. In practice, you usually turn the formula around. nvcc compiles your kernel and decides how many registers per thread you need. Then when the CUDA runtime loads your kernel, it figures out what B is based on R (property of the hardware) and T (specified in the kernel launch). Now it knows B, and will run that many blocks concurrently for you. Note that if B < 1, you’ll get a launch error. There is a nice occupancy spreadsheet on the CUDA page which calculates B for you (among other things).

If your kernel needs more registers than are available, the compiler will push temporary values into “local memory”. Local memory is thread-local scratch space that is stored in global memory, so it is significantly slower than registers or shared memory. I’m not entirely clear on the heuristic used by nvcc to decide when it is good to use local memory. If you use the --maxrregcount option, you can force nvcc to use fewer registers which can also force it to start using local memory. Examining the .cubin file generated when you use the --keep option will tell you if you are using local memory. (check the lmem field)

Thanks for your good answer!

I got my first kernel to run…

CPU → ~19 ms
GPU-> ~2.4 ms (8800GTS, 640MB)