address evaluation threadIdx,blockDim treated as constants?


my understanding of the “built-in variables” is, that they are equal to intermediate values that will be evaluated prior to execution for each grid/block/thread.

Now I would like to know if even complex address calculations like below will be compiled into one single constant value:

float val = sharedMem[ (threadIdx.x>>1)*40+(threadIdx.y>>2)*20+(threadIdx.y & 3) ];

Would the machine code look something like this ?

mov reg0, memPtr, constValue

Thanks for your help,


No. threadIdx is a value in shared memory. So that address calculation will result in many shared memory reads and all the shifts and multiplies. You can verify it yourself by compiling a simple code with the -keep option and reading the ptx.


If the threadIdx variable is part of shared memory, doesn’t it mean that the threads of a half-Warp will encounter banking issues everytime the threadIdx variable is needed? … or is the use of threadIdx actually a broadcast access where all 16 threads read from the same variable thereby preventing a 16-way bank conflict?

Well, I’ve never actually looked at the exact addressing done to read threadIdx, but I would assume that it is done properly to avoid shared memory conflicts. Reads of the *Dim variables can easily be done via the broadcast mechanism.

Now that I think a little more on it, I’m fairly certain that blockDim, gridDim, and blockIdx come from shared memroy, but I’m not sure on threadIdx. That would require 4 bytes of shared memory per thread in the block, a usage of shared memory I’ve never noticed.

In any case, the compiler still cannot simply optimize all the threadIdx references away. After all, the same code is being executed on all ALUs in the multiprocessors, so there must be some mechanism by which they read threadIdx

threadIdx is initially in register R0 of each thread (the x and y components are in the 16 LSBs and MSBs, respectively), not in shared memory. It is placed there by the hardware on invocation of each block. If the kernel doesn’t use threadIdx the compiler may choose to use the register for something else.

blockIdx, blockDim, and gridDim are passed as parameters in shared memory, because all threads in a block will read the same location when these are read.



For the maximum of 512 threads per block possible, does it mean that you have 512 registers from the Register File in the SM set aside for this? Are these 512 registers separate from the 8192 (?) number for each SM?

They are regular registers, they are just initialized with threadIdx. Since they are regular registers, they can be overwritten by other values if your program doesn’t use threadIdx, or for example, if it uses only some constant function of threadIdx:

a = g[2 * threadIdx.x + 1];

b = f[2 * threadIdx.x + 1];

In this case the compiler would probably replace the initial value of R0 with 2 * R0 + 1.


So, if we have a block size of 256 threads, for example, does that mean that the number of registers available for the blocks active in each SM must now be reduced to (8192 - 256)?

The register count is still 8192. It’s just that the minimum register usage by any kernel is 1.

Think about it this way: even if threadIdx wasn’t in a register to begin with, it would have to be moved into a register at some point in order to serve as an operand to the instructions of the program.


Ok but, if I’m calculating the number of registers that my code can use per SM, I guess I should always remember to decrease that number (8192), by the number of threads in a block, since each thread in the block will need a register for threadIdx. Isn’t that right?

Just read the register count in the cubin output from nvcc -cubin. It will include the register used by ThreadIdx.

Or add this to your nvcc commandline : --ptxas-options=-v

I don’t remember who posted it, but it is way faster than searching through the cubin file for the information you need.

In the case of blockIdx/blockDim/gridDim you would be right, I assume, because if you read that all threads in the block will access the same shared memory variable. This is an optimized case free of bank conflicts.

I was wondering: since the gridDim, blockDim are constants across all multiprocessors

(given how a kernel is launched), why are these shared memory variables and not

constant variables? Perhaps, in the future, the user will be able to vary the block sizes

across processors? Thanks.


I think a earlier possibility is to run 2 kernels concurrently, with each of the kernels using some of the multiprocessors. I believe it was even hinted upon by some NVIDIA employee.

I think the more plausible reason is that shared memory is (in general), much faster than constant memory.

Is it? Reading from constant memory is as fast as a register if the value is already in the cache. Quote from the programming guide :

For all threads of a half-warp, reading from the constant cache is as fast as reading

from a register as long as all threads read the same address.

So another reason might be to not ‘pollute’ the constant cache.

Well, I think the reason is just practical, as all kernel parameters are passed in shared memory. Also, writing to constant memory requires a DMA operation, whereas the shared memory is automatically set up for each block.