How many registers used in the code

Hi, I just started to learn CUDA. But I don’t know how to caluate how many registers be used in a code, such as

__share S[8][8]
S[tidx][tidy]=Global1[bidxtidx+tidx][bidytidy+tidy];
__syncthreads();

Global2[tidx][tidy]=S[tidx-1][tidy-1]+S[tidx][tidy-1]+S[tidx+1][tidy-1]+S[tidx-1][tidy]+S[tidx][tidy]+S[tidx+1][tidy]+S[tidx-1][tidy+1]+S[tidx][tidy+1]+S[tidx+1][tidy+1];
__syncthreads();

and if I don’t use share memory, is there any difference?

Just add “-keep” in the .cu file command line properties, “nvcc -keep -I $…” and open the generated .cubin file from the project folder. This will give you the number of registers and the shared memory usage.
And its shared memory access is faster than the Global Memory access …so its advisable to make benefit of shared memory by caching the data from Global Memory.

Happy Learning

[quote name=‘Karan Sharma’ post=‘480905’ date=‘Dec 22 2008, 11:24 PM’]

Just add “-keep” in the .cu file command line properties, “nvcc -keep -I $…” and open the generated .cubin file from the project folder. This will give you the number of registers and the shared memory usage.

And its shared memory access is faster than the Global Memory access …so its advisable to make benefit of shared memory by caching the data from Global Memory.

Happy Learning

Or just use -cubin to generate only the cubin file (don’t have the whole mess of preprocessing files)

nvcc -cubin file.cu

Yes, reading from shared memory is faster. You can time your kernels and test for yourself. You should becareful in your app though because it looks like you will need to overread into shared memory so that you can access the tidx+1 and tidx-1, tidy+1, and tidy-1 indices. (Shared memory is visible to all threads within a block, but not across blocks.) You can refer to the Sobel example for an example of shared memory and overreading.

Just use the –ptxas-options=-v option to nvcc. It’ll cause the number of registers, smem and lmem bytes to be printed during compilation.

Paulius

Right, that’s the much better alternative.

Now the obvious question: Why the hell doesn’t nvcc print this info automatically?