I apologise if this is a novice question already covered here or in the programming guide, but my searches have me dumbfounded still.
When I inspect my kernel variables with Parallel Nsight 1.51, some values appear as ???. But after I do some arithmetic on that value, it yields an actual (and correct) value. The value stored in the first variable must therefore be correct, but why does it not show in the first instance?
Both calculations use kernel argument parameters, passed in by value. A portion of the kernel code is like below:
__global__ void myKernel(uint3 domainDim, uint3 superCellDim, ... ) {
...
const unsigned int numCells = domainDim.x * domainDim.y * domainDim.z; // Total number of cells in my domain
const unsigned int numSuperCells = numCells / (superCellDim.x * superCellDim.y * superCellDim.z);
...
}
In this case inspecting with Nsight shows that numCells's value is ???:
The nature of code generation on the GPU means that some variables only ever exist in register. Because of register reuse, it can mean that variables in C code are not “live” all the time, because the register they were assigned has been reused. And because variables are not available at all times, the C source debugger cannot display their values at all times. In cuda-gdb, the debugger provides a warning of the form “variable not live, displaying garbage” to let you know that you aren’t looking at the register in the state it was when the variable of interest was using it. I am guessing that is what the ??? nsight is reporting mean. In your specific example, it is possible that numCells is never used after numSuperCells is calculated, so the assembler has chosen to use the same register for both.
If you need this for debugging, one possibility is to declare both as volatile. That should force them into local memory, which will be slower, but should eliminate the register re-use problem.
avidday, thanks for your insight. you are indeed correct!
When I break immediately after the numCells = … line, the register still holds the expected value. Stepping into the numSuperCells = … line sees the value of numCells turn into a ‘???’, as before. I can be fairly confident your statement is correct since Nsight inspector now shows that BOTH these variables turn red, so the register has been reassigned from holding the value of numCells to hold the value of numSuperCells now.
As an aside, numCells is in fact used ONCE more, further down in my code, as the guard in my while loop (in the typical manner of while(tid<N)…).
Would I be correct in assuming that to save registers the compiler simply hardcodes in the value it obtained for numCells before the register reassignment?
It might do that, but that would imply that numCells was fully defined at compile time. That doesn’t look to be the case (although of coarse I could be wrong). Another possibility is that the compiler is doing some else - repeating the calculation, for example, or doing a local memory store of the result in a “copy” and reading the value back later on. The compiler is also pretty good at re-arranging code to eliminate dependencies, so it might have worked out a crafty way to not need numCells again, even though the original C implies that.
The only way to be sure is to have a look at the PTX output from the compiler, and then to disassemble the output from the assembler and see what is really going on.
Indeed, as you suspect, numCells isn’t defined at compile time, so it can’t be hardcoded. Perhaps some variable rearrangement is taking place.
While I’m not there yet, the intention is to eventually assign superCellDim dynamically at runtime based on the size of the input data (and hence numCells).
Threadblock(s) work on “superCells” [font=“Courier New”][1][/font] and within these, cells map to one or more thread(s). The intention is to optimise the kernel launch based on the available device properties.
With all this behind-the-scenes tinkering by the compiler, is it possible to turn it off completely?
[font=“Courier New”][1][/font] I know superCell sounds a bit wanky but because it won’t necessarily be a 1-to-1 mapping to a threadblock, I originally distinguished them as cellBlocks… until one day I realised how ridiculous the name was. So I changed it, cos I didn’t want to change name the grid ‘prison’.
And because sometimes the dimensions of the domain are not exact multiples of the superCellDim, it had to pad some with zeros.
During that time I called these paddedCellBlocksExternal Image