(env: cuda sdk 1.0 on ubuntu)
In cuda, a set of registers are shared by all threads within a block. There’s a limit on the available number of registers. so the more registers a kernel function uses, the smaller execution configuration it needs to have.
however, the nvcc will compile codes that are over registers limit just as usual and when you run it, cuda will fail silently without any error messages.
Is there a way to find out just how many registers the compiled kernel functions use such that the exe config can be adjusted according?
for example, let’s say the register limit is 1000 for the sake of discussion and I have a function
__global__ use100registers();
then i can at most use 100 threads per grid:
dim3 grid(1); dim3 threads(100);
test<<<grid,threads>>>use100registers();
but how do i know how many registers does a function uses? for my project i use trial and error method if find out the max thread number, is there a easier way to find that out?