Just add “-keep” in the .cu file command line properties, “nvcc -keep -I $…” and open the generated .cubin file from the project folder. This will give you the number of registers and the shared memory usage.
And its shared memory access is faster than the Global Memory access …so its advisable to make benefit of shared memory by caching the data from Global Memory.
Just add “-keep” in the .cu file command line properties, “nvcc -keep -I $…” and open the generated .cubin file from the project folder. This will give you the number of registers and the shared memory usage.
And its shared memory access is faster than the Global Memory access …so its advisable to make benefit of shared memory by caching the data from Global Memory.
Or just use -cubin to generate only the cubin file (don’t have the whole mess of preprocessing files)
nvcc -cubin file.cu
Yes, reading from shared memory is faster. You can time your kernels and test for yourself. You should becareful in your app though because it looks like you will need to overread into shared memory so that you can access the tidx+1 and tidx-1, tidy+1, and tidy-1 indices. (Shared memory is visible to all threads within a block, but not across blocks.) You can refer to the Sobel example for an example of shared memory and overreading.