thanks a lot! I just upgraded my cuda from 7 to 7.5, now I can use Kernel Profile - PC Sampling feature and see hotspots in the code line level.
Just trying to understand the output: nvvp points out memory dependency is the leading bottleneck, accounting for 56% of the PC samples. However, the top offending lines that account for most of the memory dependencies are only register operations. for example, the second highest memory dependent line is
where v is a register variable with 4 float members (like a float4). The corresponding assembly is
LDL R4, [R3+0xc];
FADD.FTZ R4, R4, 1; # this line has high memory dependency
STL [R3+0xc], R4;
I am curious, how could a register operation cause memory dependency problem?
My kernel is heavy-weighted, using about 80 registers, could this be caused by register spilling? The GPU I profiled on is a GTX 980Ti, and is supposed to hold max 255 registers per thread based on nvvp report.