my understanding of the “built-in variables” is, that they are equal to intermediate values that will be evaluated prior to execution for each grid/block/thread.
Now I would like to know if even complex address calculations like below will be compiled into one single constant value:
float val = sharedMem[ (threadIdx.x>>1)*40+(threadIdx.y>>2)*20+(threadIdx.y & 3) ];
No. threadIdx is a value in shared memory. So that address calculation will result in many shared memory reads and all the shifts and multiplies. You can verify it yourself by compiling a simple code with the -keep option and reading the ptx.
If the threadIdx variable is part of shared memory, doesn’t it mean that the threads of a half-Warp will encounter banking issues everytime the threadIdx variable is needed? … or is the use of threadIdx actually a broadcast access where all 16 threads read from the same variable thereby preventing a 16-way bank conflict?
Well, I’ve never actually looked at the exact addressing done to read threadIdx, but I would assume that it is done properly to avoid shared memory conflicts. Reads of the *Dim variables can easily be done via the broadcast mechanism.
Now that I think a little more on it, I’m fairly certain that blockDim, gridDim, and blockIdx come from shared memroy, but I’m not sure on threadIdx. That would require 4 bytes of shared memory per thread in the block, a usage of shared memory I’ve never noticed.
In any case, the compiler still cannot simply optimize all the threadIdx references away. After all, the same code is being executed on all ALUs in the multiprocessors, so there must be some mechanism by which they read threadIdx
threadIdx is initially in register R0 of each thread (the x and y components are in the 16 LSBs and MSBs, respectively), not in shared memory. It is placed there by the hardware on invocation of each block. If the kernel doesn’t use threadIdx the compiler may choose to use the register for something else.
blockIdx, blockDim, and gridDim are passed as parameters in shared memory, because all threads in a block will read the same location when these are read.
For the maximum of 512 threads per block possible, does it mean that you have 512 registers from the Register File in the SM set aside for this? Are these 512 registers separate from the 8192 (?) number for each SM?
They are regular registers, they are just initialized with threadIdx. Since they are regular registers, they can be overwritten by other values if your program doesn’t use threadIdx, or for example, if it uses only some constant function of threadIdx:
a = g[2 * threadIdx.x + 1];
…
b = f[2 * threadIdx.x + 1];
In this case the compiler would probably replace the initial value of R0 with 2 * R0 + 1.
So, if we have a block size of 256 threads, for example, does that mean that the number of registers available for the blocks active in each SM must now be reduced to (8192 - 256)?
The register count is still 8192. It’s just that the minimum register usage by any kernel is 1.
Think about it this way: even if threadIdx wasn’t in a register to begin with, it would have to be moved into a register at some point in order to serve as an operand to the instructions of the program.
Ok but, if I’m calculating the number of registers that my code can use per SM, I guess I should always remember to decrease that number (8192), by the number of threads in a block, since each thread in the block will need a register for threadIdx. Isn’t that right?
In the case of blockIdx/blockDim/gridDim you would be right, I assume, because if you read that all threads in the block will access the same shared memory variable. This is an optimized case free of bank conflicts.
I think a earlier possibility is to run 2 kernels concurrently, with each of the kernels using some of the multiprocessors. I believe it was even hinted upon by some NVIDIA employee.
Well, I think the reason is just practical, as all kernel parameters are passed in shared memory. Also, writing to constant memory requires a DMA operation, whereas the shared memory is automatically set up for each block.