Analysing the registers


I am keen on knowing how the registers are assigned in CUDA kernel, because in my code registers is the limiting factor. My kernel is using way too many registers than expected. So, I tried to analyse register usage for a simple vector addition code.

#define L 1024

__global__ void add(int *A_dev, int *B_dev, int *C_dev)


	int tid = threadIdx.x + blockIdx.x * blockDim.x;

	while (tid<L)



		tid += blockDim.x * gridDim.x;



int main (void)


	int A[L],B[L],C[L];

	int *A_dev,*B_dev,*C_dev;

	for (int i=0;i<L;i++)






	// allocate the memory on the GPU

	cutilSafeCall(cudaMalloc( (void**)&A_dev, L * sizeof(int) ) );

	cutilSafeCall(cudaMalloc( (void**)&B_dev, L * sizeof(int) ) );

	cutilSafeCall(cudaMalloc( (void**)&C_dev, L * sizeof(int) ) );


	cutilSafeCall(cudaMemcpy(A_dev, A, L *sizeof(int) , cudaMemcpyHostToDevice));

	cutilSafeCall(cudaMemcpy(B_dev, B, L *sizeof(int) , cudaMemcpyHostToDevice));

	add<<< (L/512),512 >>> (A_dev,B_dev,C_dev);

	( cudaThreadSynchronize() );

	cutilSafeCall(cudaMemcpy(C,C_dev, L*sizeof(int), cudaMemcpyDeviceToHost));


	for (int i=0;i<L;i++)


		printf("C[%d] is %d\n",i,C[i]);


	cudaFree( A_dev );

	cudaFree( B_dev );

	cudaFree( C_dev );

	return 0;


I compiled the code using --ptxas-options=-v flag and got the following

ptxas info : Compiling entry function ‘Z3addPiS_S’ for ‘sm_20’

ptxas info : Function properties for Z3addPiS_S

0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

ptxas info : Used 13 registers, 56 bytes cmem[0]

Where are 13 registers being used? Is there any way to actually find out how the registers are assigned?

Any help would be appreciated.



Yes your kennel is using 13 registers. Every variable you define in kernel uses registers. For example the int tid takes 1 register, but the operations use registers as well.

Add a [font=“Courier New”]launch_bounds()[/font] directive to the kernel to limit register use (see appendix B.18 of the Programming Guide).

Compile with [font=“Courier New”]nvcc -cubin[/font] and use [font=“Courier New”]cuobjdump -sass[/font] to disassemble the kernel and see exactly what each register is used for.

As tera alrready stated, to find out how registers are being used simply disassemble the generated machine code with cuobjdump and backannotate the source code. Please note that this test case is not particularly instructive for understanding register usage because for sm_2x the compiler has no incentive to try and reduce register usage if fewer than 16 registers are used. In fact the compiler will ignore any directives that try to push register usage below 16 registers on sm_2x (and will print an advisory message to that effect). The reason is that full occupancy can be achieved on sm_2x devices with 16 registers per thread, as there are 32768 registers and at most 1536 threads per SM.

Thank you all for the replies!

I found out that using ‘–use_fast_math flag’, number of registers used per thread can be reduced. I executed a same code on a GTX 560Ti and a Tesla C2070. I expected a better performance on Tesla. But that is not the case!! In fact, on Tesla the code performs better( but still worse than GTX 560Ti) in terms of timing without fast_math flag. Am I missing something here?

I am attaching the kernel code

__global__ void calculateCD1(int i)


	const float eps=1.0e-16;

	__shared__ float temp11sh[512];

	__shared__ float temp12sh[512];

	__shared__ float temp21sh[512];

	__shared__ float temp22sh[512];

	__shared__ float temp13sh[512];

	__shared__ float temp14sh[512];

	__shared__ float temp23sh[512];

	__shared__ float temp24sh[512];


	int ind,offset;

	int tid = threadIdx.x + blockIdx.x * blockDim.x;


	for (int iter=0;iter<28;iter++)


			for (int k=0;k<stage;k++)


					if (tid<N/2*nblocks)




							temp11sh[ind]=R[(2*ind)*(1+stage)+k+N*(1+stage)*offset]; //shuffle


                                                        temp13sh[ind]=R[(2*ind)*(1+stage)+k+N*(1+stage)*offset]; //shuffle


							temp21sh[ind]=L[ind*(1+stage)+(k+1)+N*(1+stage)*offset]; //itersassign1

							temp22sh[ind]=L[(ind+N/2)*(1+stage)+(k+1)+N*(1+stage)*offset]; //iterassign1

							temp23sh[ind]=L[ind*(1+stage)+(k+1)+N*(1+stage)*offset]; //itersassign1

							temp24sh[ind]=L[(ind+N/2)*(1+stage)+(k+1)+N*(1+stage)*offset]; //iterassign1
























							R[ind*(1+stage)+k+1+N*(1+stage)*offset] = __fdividef((1+ temp11sh[ind]*temp12sh[ind]*temp22sh[ind]),(temp11sh[ind]+temp12sh[ind]*temp22sh[ind]));

							R[(ind+N/2)*(1+stage)+k+1+N*(1+stage)*offset] = __fdividef(temp12sh[ind]*(1+temp11sh[ind]*temp21sh[ind]),(temp11sh[ind]+temp21sh[ind]));

							R[ind*(1+stage)+k+1+N*(1+stage)*offset] = __fdividef((1+ temp13sh[ind]*temp14sh[ind]*temp24sh[ind]),(temp13sh[ind]+temp14sh[ind]*temp24sh[ind]));

							R[(ind+N/2)*(1+stage)+k+1+N*(1+stage)*offset] = __fdividef(temp14sh[ind]*(1+temp13sh[ind]*temp23sh[ind]),(temp13sh[ind]+temp23sh[ind]));


							L[(2*ind)*(1+stage)+k+N*(1+stage)*blockIdx.x]= __fdividef((1+ temp21sh[ind]*temp22sh[ind]*temp12sh[ind]),(temp21sh[ind]+temp22sh[ind]*temp12sh[ind]));

							L[(2*ind)*(1+stage)+k+N*(1+stage)*blockIdx.x+(1+stage)]= __fdividef(temp22sh[ind]*(1+temp21sh[ind]*temp11sh[ind]),(temp21sh[ind]+temp11sh[ind]));

							L[(2*ind)*(1+stage)+k+N*(1+stage)*blockIdx.x]= __fdividef((1+ temp23sh[ind]*temp24sh[ind]*temp14sh[ind]),(temp23sh[ind]+temp24sh[ind]*temp14sh[ind]));

							L[(2*ind)*(1+stage)+k+N*(1+stage)*blockIdx.x+(1+stage)]= __fdividef(temp24sh[ind]*(1+temp23sh[ind]*temp13sh[ind]),(temp23sh[ind]+temp13sh[ind]));







	tid += blockDim.x * gridDim.x;



Say I launch a kernel which uses the entire shared memory of 48KB. And all the computations in the kernel are done using the shared memory only. Let this kernel take ‘x’ registers per thread. Now I launch the same kernel with half the shared memory (24KB). Assuming that register usage is not the limiting factor, I will be able to launch 2 blocks. Now, let this kernel take ‘y’ registers per thread. Can we relate x and y?


-use_fast_math implies -prec_div=false. On sm_2x, by default every single-precision floating-point division or reciprocal maps to an IEEE-754 division / reciprocal. With -prec_div=false it instead maps to approximate versions of these operations. The register usage of the approximate versions is significantly lower than the accurate versions. If memory serves, the difference in register use is on the order of 8 registers.

But how come the timing performance is worse on Tesla C2070 compared to GTX560Ti. Both are Fermi devices. In fact Tesla C2070 should fare better because of the grater number of cores and SMs.
Tesla C2070 has 14SMs, 448 cores operating at 1.15GHz. GTX 560Ti has 8SMs, 384 cores operating at 1.66GHz.
So, theoretically shouldn’t there be a speedup of b(448/384)(1.15/1.66) ~ 1.4[/b] on Tesla card?


Well if the code you posted here is still the one you use for running your test, there’s no wonder GTX 560Ti runs faster:

#define L 1024


   add<<< (L/512),512 >>> (A_dev,B_dev,C_dev);

So you use only 2 threadblocks for your kernel, and therefore 2 SMs maximum. What use are the 14 SMs of the C2070 vs. the 8 SMs of the GTX 560Ti here?

Moreover, the GTX has an higher frequency, and (I haven’t check neither the memory frequency nor the bus width so the following might not apply) since it doesn’t support ECC correction, it’s memory bandwidth can get higher too.

Try a configuration where you at least use the whole hardware on both cards, and then compare the performances.

I believe the GTX560Ti is based on the sm_21 architecture. I have no experience with the sm_21 consumer parts at all. [earlier: I seem to recall that the sm_21 cores have higher throughput on a cycle-by-cycle basis than sm_20 cores due to some dual issue capabilities.] Strike that. After searching the internet I find that sm_21 has 48 cores serving each 32-thread warp, whereas for sm_20 there are 32 cores for each 32-thread warp. So the operation throughput for sm_21 is higher per thread, not per core. However my earlier comment below still applies.

Even assuming the per-cycle operation throughput of the two core types is identical, 384 cores @ 1.66 GHz would deliver 637.44 Giga-operations / second, while 448 cores @ 1.15 GHz deliver 515.2 Giga-operations/second, where I use the term Giga-operation to refer to 10**9 operations.

Thanks a lot guys. That was really helpful!