Registers per thread

How can I determine how much registers (per thread) does the OpenCL kernel use? I am able to see PTX code in Parallel Insight profiler and there are “.reg” declarations there. May I just sum up them? Or the actual cubin code might consume less registers (some optimization is done when compiling PTX into binary)?


IIRC, the Compute Visual Profiler that comes with the CUDA Tookit displays that info (also for OpenCL programs).

Thanks a lot, it is Compute Visual Profiler indeed. I managed to get “registers per work item” values (and they are much less than in PTX file), but failed to get other important values (“In this profiling session some profiler output rows are dropped due to incorrect gpu time stamp values and the profiler output is incomplete.” error).

Thanks again.

You can add an extra option as argument when you compile the kernel from the source code.


This is mentioned here (it’s CUDA 3.0). The extra output can be fetched with clGetProgramBuildInfo (CL_PROGRAM_BUILD_LOG).

It should say sth like

ptxas info    : Used 16 registers, 36+8 bytes smem, 180 bytes cmem[1]

However it doesn’t worked everytime for me. Sometimes there is more information and at recompilation of the programm with slightly changed code it wouldn’t show anymore.


Sebastian, thank you, it might be usefull for me. For the time being I am able to use Compute Visual Profiler, it always show actual register usage. The “funny” thing is that once I managed to get Compute Visual Profiler running… Parallel Nsight stopped collecting any OpenCL info when profiling my program. It kind of sucks as I am not able to see PTX code generated from my kernels anylonger.