Hi, it’s really nice to have this forum back :)
I’m now working on my cuda application performance profiling with nvcc compiler ptxas and visual profiler. I meet a strange inconsistency issue while using these two tools.
my application is compiled with -maxrregcount=32 and run on both GTX680 and GTX480. Compiler ptxas info shows that there are 15 registers usage per thread; in contrast, visual profiler shows that there are 32 registers usage per thread.
I’m confused about the inconsistency between these two tools, am I doing anything wrong? or should I do any further setting while using visual profiler?
Thanks in advance for any reply. :)
Your kernel itself may require only 15 registers per thread, but if it calls other kernels (or system functions like malloc) then the CUDA driver may need to allocate more registers per thread for its execution.
Are you using separate compilation?
Does your kernel call other kernels? If so, what does compiler report as the registers/thread of those kernels.
Hi, thanks for replying.
I use separate compilation for my CUDA application. Both of my kernels are working individually without dynamic parallelism. I did use some intrinsic functions(ex:__ffs) and atomic function(atomicAdd) in my kernels but there is no other kernel call.
I’ve compiled and profiled the same source code on two different machines. The only compiling difference is I use arch=sm_20 for GTX480 and arch=sm_30 for GTX680.
List the profiling difference on GTX480 and GTX680 below:
Kernel A used 15 registers, Kernel B used 15 registers
Kernel A used 30 registers, Kernel B used 31 registers
Kernel A used 15 registers, Kernel B used 17 registers
Kernel A used 32 registers, Kernel B used 32 registers
Is there any definition difference or scope difference of register usage per thread on GTX480 v.s GTX680 and compiler v.s profiler?
I’m wondering if there is any register usage overhead for profiler dealing with profiling process under different profiling setting?
In fact, I found out that I’ve ever got the profiling result(.csv) with exact same register usage with compiler ptxas info. Since I only slightly modified source code for performance comparison, I’ve tried to modify my source code or change the profiler metrics configuration setting to reproduce this profiling result but failed.
I’m not sure what I should do and wondering if you have any advice for these issues I met?
Thanks in advance :)