Nsight Compute is showing R0, initially loaded from SR_CTAID.X (blockIdx.x), as having a register dependency that persists beyond a kernel’s instruction address space, as shown below.
I am unsure how to replicate this situation, I could not attribute any specific problems to it as I was steadily modifying the code (which was already syntactically functional when this pattern appeared), and it eventually disappeared in later optimized versions.
However, since it might be indicative of an underlying pattern that could be useful to stay aware of, I wish to ask the general questions: has this behavior been encountered before and what does it suggest, in your experience?
I haven’t studied it carefully, nor do I have previous experience to rely on. I notice a CALL instruction, and it strikes me that the register frame used for a function call could (from a “liveness” perspective) “persist beyond a kernel’s instruction address space.”
If the “later optimized versions” resulted in inlining of a that function, or otherwise avoid calling it, such dependency might not be observed.
Also, when investigating anomalies (I’m not sure this is one), it’s not a bad idea to try the latest available version of the tool.
If my observation is relevant, then I doubt such a dependency pattern is cause for concern, although a possible optimization in CUDA is to avoid non-inlined function calls.