Registers and local memory

Hi guys,

i am wondering if the registers are automatically copied to the local memory (i know it is slow) if there are too many variables.
My cuda program runs fine, without calling the functionality in the kernel. 22x22 threads per block are possible.

If i add the functionality to the kernel, the more code i add the less threads per block i can use. for example if i increase the block size from 16x16 to 17x17 the program crashes.

My question: Do i have to copy variables explicitly to local memory or why does my program crash?

Thanks

nvcc can choose to spill registers to local memory at compile time. The heuristic is not exactly documented, but register spilling happens when it thinks your kernel is using “too many” registers, or if it bumps into a hard platform limit. Additionally, you can force a lower register limit with the --maxrregcount option to nvcc.

This is a different issue. Since nvcc does not assume anything about the block configuration of your kernel (which is specified at runtime) and the register requirements are set at compile time, you can launch your kernel with impossible block configurations. If you pass --ptxas-options=-v to nvcc, ptxas will print out how many registers your kernel uses per thread. Then you can figure out how many registers your block needs and compare to the max registers per SM on your particular CUDA device.

The generated code for your kernel cannot dynamically spill registers to local memory if your launch configuration would require it. If the compiler is using too many registers for your desired block configuration, you have to work backwards and figure out the maximum registers per thread you can accommodate given your block size and pass that value to nvcc with the --maxrregcount option.

nvcc can choose to spill registers to local memory at compile time. The heuristic is not exactly documented, but register spilling happens when it thinks your kernel is using “too many” registers, or if it bumps into a hard platform limit. Additionally, you can force a lower register limit with the --maxrregcount option to nvcc.

This is a different issue. Since nvcc does not assume anything about the block configuration of your kernel (which is specified at runtime) and the register requirements are set at compile time, you can launch your kernel with impossible block configurations. If you pass --ptxas-options=-v to nvcc, ptxas will print out how many registers your kernel uses per thread. Then you can figure out how many registers your block needs and compare to the max registers per SM on your particular CUDA device.

The generated code for your kernel cannot dynamically spill registers to local memory if your launch configuration would require it. If the compiler is using too many registers for your desired block configuration, you have to work backwards and figure out the maximum registers per thread you can accommodate given your block size and pass that value to nvcc with the --maxrregcount option.

at the moment my code can only be run in 16x16 dimension. does that mean, if i want a bigger block dimension - the only possible way is to reduce my coun/usage of variables? I really wonder about that. What happens when the code size and the variables increase much more? when exactly is the local memory then used?
thanks a lot

at the moment my code can only be run in 16x16 dimension. does that mean, if i want a bigger block dimension - the only possible way is to reduce my coun/usage of variables? I really wonder about that. What happens when the code size and the variables increase much more? when exactly is the local memory then used?
thanks a lot

You do not need to reduce the number of variables, but the number of registers the kernel uses. You can do that either with the [font=“Courier New”]–maxrregcount n[/font] directive as Seibert explained, or with a [font=“Courier New”]launch_bounds(max_threads_per_block)[/font] qualifier to the kernel (see Appendix B.16 of the Programming Guide for details). The latter will automatically determine the maximum number of registers the kernel may use to allow launching it with max_threads_per_block threads per block.

You do not need to reduce the number of variables, but the number of registers the kernel uses. You can do that either with the [font=“Courier New”]–maxrregcount n[/font] directive as Seibert explained, or with a [font=“Courier New”]launch_bounds(max_threads_per_block)[/font] qualifier to the kernel (see Appendix B.16 of the Programming Guide for details). The latter will automatically determine the maximum number of registers the kernel may use to allow launching it with max_threads_per_block threads per block.

Variables at the C level and registers in the final binary do not have a one-to-one relationship. ptxas takes the PTX assembly language generated by the compiler and uses a variety of techniques to assign the minimum number of registers required. In order to get the compiler to use local memory instead of registers, you have to tell the compiler (one way or the other) to limit the register usage.

Variables at the C level and registers in the final binary do not have a one-to-one relationship. ptxas takes the PTX assembly language generated by the compiler and uses a variety of techniques to assign the minimum number of registers required. In order to get the compiler to use local memory instead of registers, you have to tell the compiler (one way or the other) to limit the register usage.

oh, that’s interesting. I will try to figur that out! i think i know what to do now ;) if i stuck i will let you know.
Thanks a lot to show me the way - both of you - seibert and tera.

oh, that’s interesting. I will try to figur that out! i think i know what to do now ;) if i stuck i will let you know.
Thanks a lot to show me the way - both of you - seibert and tera.