How fast is threadIdx.x?


Iid like to know if it’s quicker to store things like threadIdx.x etc in registers if they are going to be used multiple times or whether it is just as fast calling threadIdx.x each time it is needed, thereby saving registers?

For example:

// Inside kernel:

	// Block index

	int bx = blockIdx.x;

	int by = blockIdx.y;

	// Thread index

	int tx = threadIdx.x;

	int ty = threadIdx.y;

	// Grid dimensions

        int gx = gridDim.x

        int gy = gridDim.y

	// Element index

	int I = bx*BLOCK_SIZE_SQ*gy + by*BLOCK_SIZE + tx*BLOCK_SIZE*gy + ty;

//Example use:

// Ignore first row and first and last columns

        if( (by == 0 && ty == 0) || (k == 0 && bx == 0 && tx == 0) || (k == N && bx == gx-1 && tx == BLOCK_SIZE-1) )

	{	}

	else if( act_s[ty+1][tx] && act_s[ty][tx] )


		float ttmp = gplus[I-1] + gminus[I];

		tvde[I] = ttmp * detm1[I];

		tvdu[I] = ttmp * dqxm1[I];

		tvdv[I] = ttmp * dqym1[I];


you don’t need to worry about that. threadIdx.x[y,z] is a special register, and compiler will copy it to a 32-bit register.

If you still feel uncomfortable, then you can check assembly code by cuobjdump yourself.
(cuobjdump is included into RC4.0, you must be a registered developer prior to download RC4.0)

Thank you for the fast reply!

So that basically means I can get rid of the six registers (tx, ty, bx, gx, gy) and just call threadIdx.x[y], blockIdx.x[y] etc everytime instead? Essentially freeing up more registers to use elsewhere? Excellent.

No, it basically means that the compiler will generate the same code whether you copy threadIdx.x to an extra variable or use it directly. The compiler is quite smart.

tx, ty, bx, gx, gy are C variables, not registers. It is up to the compiler to decide how to map your data to hardware registers during PTX assembly.

What I am saying is valid for compute 1.x devices. I have not really optimized anything for Compute 2.x yet.

threadIdx and blockIdx variables are 16 bit unsigned shorts and internally stored in shared memory (although this is hidden from the PTX). However for aritmetic operations, 16 bits are often converted to 32 bits first - which may create a tiny bit of overhead every time this happens.

If you really intend to hold stuff in registers, it is useful to use the volatile keyword.

volatile unsigned int tx = threadIdx.x;

In your case this may only be advised for the index I which is the result of a computation using various thread and block indices.

volatile int I = …

I have observed reduced overall register use and in some cases improved performance when using the volatile keyword at strategically chosen locations. When doing so, I always inspect the resulting PTX assembly and the overall register count reported by nvcc. Sometimes things change for the worse, and sometimes things change for the better with the “volatile trick”. Of course I keep the changes only if things have improved ;)