I think this will help in figuring out the best block and grid dimensions for running this kernel.
My line of thought is like this:
The number of threads (irrespective of the blocks) that one can run on a multi-processor depends on the register usage of the kernel.
a) For an MP that has 8192 registers, it will take 512 threads using 16 registers
each to saturate the full bandwidth.
B) I would ideally like to place at least 2 blocks in this MP.
So, having 256 threads per block would be ideal in this scenario.
c) I would also know that 512 threads corresponds to 16 warps. The remaining
8 warps of the MP are un-used. They JUST CANNOT be used and they ARE
being WASTED.
At this point, I can think of what I can do to optimize my kernel so that I can stuff in more threads inside the multi-processor.
This is why I would like to know the register usage of the kernel and how it can be optimized to get the maximum concurrency.
Yes, I did Mr.Bangalore and I did find how to find my register usage there in that post from Mark Harris. That is what I had posted above.
For using the CUDA occupancy calculator, you first need to know your register usage. The XLS sheet does NOT do anything fancy. You need to feed in the right data for it.
Harris’ post basically says that you need to use “-cubin” option to generate the cubin file that has the info about your program. Just see the “registers=xxx” line and you would know how many registers you r using.
Alternatively, my own stuff — Do a -keep option and count the number of registers from the generated PTX assembly file. :D
dlmeetei - (Dalai Lama in a meeting?? huh…) ,
Me too from Bangalore man. Nice to know one another guy out there.