Number of registers per thread massively increase on loops

Hi, I’m currently working on some kernel that uses about 41 registers per thread, Nevertheless when I add a loop on top of the code the number of registers in use are multiplied by the loop count !!!
Why is the compiler always using new registers and not reusing old ones that are not in use.

And why is it that for every Array Index calculated a new register is in use, how can i fix that ??

Please help
Thanks.

You are accessing an array inside your loop?
If the array is in register space/local-memory space then each register is used per each array index.
If the array is in shared memory space, the compiler might try to optimise it by caching the array in register space nevertheless. If that is the case, you could try __threadfence_block() in the loop or make the shared array of volatile type.

These are all guessing. Without showing the actual code (possible reduced to show only this behaviour, without all unrelated stuff) it is hard to say what is hapenning.

Thanks for your reply, Here is the actual code.

The loop I’m talking about is that on “for(line = 0; line < 2; line++)” … the registers just doubles per Count … so how do I fix that. And I want to know why is it using 41 registers per thread … I guess thats too much for whats written here.

__global__ void ComputeHistogram(int* Histogram, int x, int y)

{

	__shared__ ushort SubHistograms[BINS*BINS*6];

	  

	int t_id = threadIdx.x;		   /* thread id*/

	int w_id = threadIdx.y;		   /* warp id */

	int b_id = blockIdx.x;			/* block id */

	int color, r, g, b;

	float total;

	int i, j, l, line;

	ushort* PerWarpHistogram = &SubHistograms[BINS*BINS*w_id];

	

	/* each thread would put a zero in BINS locations */

	/* to minimize bank conflicts here .. each thread per half-warp should write to a separate 1024 */

	/* that is basically imposible to do here */

	#pragma unroll

	for(int l=0; l < BINS; l++)

		PerWarpHistogram[l*32 + t_id] = 0;

	

	__threadfence_block();

	#pragma unroll 1

	for(line = 0; line < 2; line++)

	{

		color = tex2D(texReference, x+t_id+w_id*blockDim.x, y+b_id+line);

	

		r = ((color >> 16) & 0xFFU );

		g = ((color >>  8) & 0xFFU );

		b = ((color >>  0) & 0xFFU );

		total = r + b + g + 1;

		r = ((color >> 16) & 0xFFU )/ total * BINS;

		g = ((color >>  8) & 0xFFU )/ total * BINS;

		

		addWord(PerWarpHistogram, r*32+g, (t_id % 32) << 11);

	

		for(l=0; l < BINS; l++)

			PerWarpHistogram[l*32 + t_id] &= 0x07FFU;

		

		__syncthreads();

		

		/* Add all the sub histograms */

		/* each warp adds a row to from all the histograms to the first */

		/* we have 32 rows .. we can use all the warps for the first 30 rows .. ie: add 5 times */

		for(i=0; i < 5; i++)

		   for(j=1; j < 6; j++)

			   SubHistograms[(w_id+i*6)*BINS + t_id] += SubHistograms[(w_id+i*6+j*BINS)*BINS + t_id];

		

		/* using 6 warps to reduce the final 2*6 rows to 2*3 rows */

		/* row 30 for histogram (0)0 <- 0+1  (1)2 <- 2+3  (2)4 <- 4+5 */

		/* row 31 for histogram (3)0 <- 0+1  (4)2 <- 2+3  (5)4 <- 4+5 */

		SubHistograms[(30+w_id/3)*BINS + t_id + (w_id%3)*2*BINS*BINS] += 

		SubHistograms[(30+w_id/3)*BINS + t_id + ((w_id%3)*2+1)*BINS*BINS];

	   

	   /* use 2 wraps to finalize reduction to 2*1 rows */

	   /* row 30 (0)0 <- 0 + 2 + 4 .. row 31 (1)0 <- 0 + 2 + 4 */

	   if(w_id < 2)

		  SubHistograms[(30+w_id)*BINS + t_id] += 

		  SubHistograms[(30+w_id)*BINS + t_id + 2*BINS*BINS] + 

		  SubHistograms[(30+w_id)*BINS + t_id + 4*BINS*BINS];

	   

	   /* zero all but first histogram */

	   if(w_id != 0)

	   {

		 #pragma unroll

		 for(l=0; l < BINS; l++)

			 PerWarpHistogram[l*32 + t_id] = 0;

	   }

	   

	   __syncthreads();

	}

	Histogram += b_id*BINS*BINS;

	

	__syncthreads();

	

	for(int i=0; i < BINS; i++)

		Histogram[i*BINS+t_id] = SubHistograms[i*BINS+t_id];

}

Compiler tend to inroll all loops with known trip count. Maybe it causes register usage. Try to put unroll 0 everywhere. And I doubt this code will work anyway, because of warps could be perfromed in different oder.

The logic of the code is fine, its just the number of registers in use … unroll 0 would make the compiler unroll the loop … you mean unroll 1 … which i’m already using in the code.

I need the assistance of an nvidia employee

You do not use unroll 1 in inner loops.

Guess what … it worked :-) … registers in use are only 16 per thread

Thanks

Still, I don’t understand why loop unrolling would increase the register usage?
Maybe the compiler is trying to move addresses from unrolled inner loops outside the outher loop as well since they do not depend on line variable?