Compiler Bug ? Position of statement causes program to fail! The statement is "isolated"

CapJo · November 10, 2009, 7:42pm

I observed a strange behavior of nvcc with my cellular automaton kernel.

The kernel runs in emulation mode without errors, but in device mode it depends on a statement and where I put it in the code.

This statement has no effect on the other code and the other way arround.

I get this error when it’s on the wrong position

-------->Cuda error: Kernel_11: too many resources requested for launch.

Here is the kernel … I modified the sobel exampel for my purposes, but the name of some variables stayed the same …

__global__ void 

CA_Kernel_11( Pixel *pSobelOriginal, unsigned int *cost, unsigned int *cost_temp,  unsigned int *bit_mask, unsigned int *bit_mask_write, unsigned int pitch, 

			 int w, int h, float fScale )

{

	unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;

	unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;

	__shared__ unsigned int tmp[16];

	if(threadIdx.x == 0)

	{

		tmp[threadIdx.y] = bit_mask[y*(pitch/32)+(x/32)];

		bit_mask_write[y*(pitch/32)+(x/32)] |= 1<<threadIdx.x;

	}

	__syncthreads();

	if((tmp[threadIdx.y] & (1 << threadIdx.x)) >> threadIdx.x == 1)

	{

		unsigned int up = (UINT_MAX-65540);

		unsigned int down = (UINT_MAX-65540);

		unsigned int left = (UINT_MAX-65540);

		unsigned int right = (UINT_MAX-65540);

		unsigned int min_val = (UINT_MAX-65540);

		// center node

		unsigned int center = cost[y*pitch + x];

		center = tex2D( tex_cost, x, y);

							   

								 // -> the kernel runs when the statement is here<- | 

																										   |

		//up = cost[(y-1)*pitch + x];}							   |

		up	= tex2D( tex_cost, x+0, y-1 );						|

		left  = tex2D( tex_cost, x-1, y+0 );						 |

		right = tex2D( tex_cost, x+1, y+0 );					   |

		down  = tex2D( tex_cost, x+0, y+1 );					 |

																											|

 this statement--->	if(threadIdx.x == 0){bit_mask_write[y*(pitch/32)+(x/32)] |= 1<<threadIdx.x;}

		// Weights

		up	+= tex2D( tex, x+0, y-1 ) << 8;

		left  += tex2D( tex, x-1, y+0 ) << 8;

		right += tex2D( tex, x+1, y+0 ) << 8;

		down  += tex2D( tex, x+0, y+1 ) << 8;

		min_val = min(

			min(

			min(up & 0xffffff00, right & 0xffffff00),

			min(down & 0xffffff00,left & 0xffffff00)), center & 0xffffff00);

		if(min_val ==(center & 0xffffff00))

		{

			cost_temp[y*pitch + x] = center;

		}

		else if(min_val == (right & 0xffffff00))

		{

			cost_temp[y*pitch + x] = right;

		}

		else if(min_val == (down & 0xffffff00))

		{

			cost_temp[y*pitch + x] = down;

		}

		else if(min_val == (left & 0xffffff00))

		{

			cost_temp[y*pitch + x] = left;

		}

		else if(min_val == (up & 0xffffff00))

		{

			cost_temp[y*pitch + x] = up;

		}

	}

}

The statement makes no sense on both positions. I was looking for the problem and tried to copy it on different positions …

I start the kernel with:

dim3 dimBlock(32, 16, 1);

	dim3 dimGrid(iw / dimBlock.x, ih / dimBlock.y, 1);

	  CA_Kernel_11<<<dimGrid, dimBlock>>>(odata, cost, cost_temp, bit_mask, bit_mask_2, iw, iw, ih, fScale );

I don’t really know where the problem ist and maybe I’m only too tired …

Best regards,

capjo

CapJo · November 10, 2009, 8:04pm

I got the error (after I checked the kernel launch for an error)… w*f … my code was correct …

Changing the position of ONE statement without any semantic effects causes nvcc to use addtional 7 registers per thread and then the kernel cannot be executed, because it would need more than 8192 (cuda 1.1 device) registers to run. My blockSize was 512 and each thread used 21 registers and that is > 8192.

How can I force nvcc to reuse registers, without spilling them to local memory???

Tigga · November 10, 2009, 9:04pm

What CUDA version are you using?

You can see why register usage is higher in the 2nd case. The kernel has to keep all the registers required for the if statement while doing unrelated stuff. I not sure why nvcc doesn’t swap these things around automatically - I guess it doesn’t care about saving registers until you explitally tell it to using the -maxrregcount flag. There may be a reason why you want the load to be after the texture reads (though I can’t really think of a good one at the moment… maybe something internal to do with latency???).

CapJo · November 10, 2009, 9:30pm

I’m using CUDA 2.3.

I realized what the error caused … but I don’t know why nvcc shows this behavior.

The semantic of the code isn’t changing, but the number of register grows up 21 registers
per thread.

The position of this statement makes no sense for my algorithm … I only tried to find the reason for the error
and copied it on different positions and then I observed this strange bahavior.

Tigga · November 10, 2009, 9:40pm

Are you using any optimisation flags when you compile? Without any it may not look for changes in execution order to save registers. I dunno. It’s very clear why changine the order of execution changes the number of registers required, however what is less clear is why nvcc didn’t reorganise it to minimize redundant register usage. I can’t think of any performance advantages to doing it the way specified off the top of my head.

cbuchner1 · November 10, 2009, 9:55pm

Low on registers? Try the new G200 series of cards. Those with compute capability 1.2 or higher.

And don’t even think about launching thread blocks of 512 threads with very complex kernels ;)

CapJo · November 10, 2009, 10:10pm

Thank you for your reply. 512 Threads seemed to be a good number :-).

Yes I’m out of registers, but even if I try to save registers by reusing variables the register count grows …

the compiler does things that I don’t want. (It tries to optimize … something else … speed?)

G200 would be a possible solution. Unfortunately most PCs here in the company have only CUDA 1.1 devices (Quadro FX 1800).

I will switch to G200, if this problem is hard to solve … I will give it a second try tomorrow.

I work on a segmentation algorithm in the area of medical image processing (big data sets) and since the memory is very limited on the GPU

(also used for OpenGL simultaniously), I have to balance memory consumption and speed. I pack a 8-Bit value and 24-Bit value

in an integer and when necesscary I unpack them. It’s quite ugly but it works and is quite fast. That is probably one of the reasons

for the high register count.

Cygnus_X1 · November 10, 2009, 10:29pm

From my experience packing 8bit values into one integer it backfires into much more complex code and possibly more registers :P
The easiest way to reduce the register count is to set --maxregcount value to something (e.g. 16).
If you do so, compiler will be bound by that value and will optimise the code taking that value into advisement. That includes:

not caching some values from shared memory
recomputing some parts of the code. Note that originally he can for example use a register to store “y*pitch” since that expression repeats itself throughout the code. But if it runs out of them, it will recompute that value.
If everything else fails - spill least used register into local memory. Local memory is as slow as global one, but in memory-bound algorithms one or two registers spilled into local memory with serious increase in occupancy is worth the effort.

CapJo · November 10, 2009, 10:57pm

From my experience packing 8bit values into one integer it backfires into much more complex code and possibly more registers :P

The easiest way to reduce the register count is to set --maxregcount value to something (e.g. 16).

If you do so, compiler will be bound by that value and will optimise the code taking that value into advisement. That includes:

not caching some values from shared memory

recomputing some parts of the code. Note that originally he can for example use a register to store “y*pitch” since that expression repeats itself throughout the code. But if it runs out of them, it will recompute that value.

If everything else fails - spill least used register into local memory. Local memory is as slow as global one, but in memory-bound algorithms one or two registers spilled into local memory with serious increase in occupancy is worth the effort.

I will try your suggestions tomorrow.

One of the reasons for packing 8-Bit values into integers is that coalesced reads/write are not possible on CUDA devices <= 1.1 in many cases.

CUDA devices > 1.1 would solve the problem of accessing shorts in a coalesced way … but like before many devices are <= 1.1 :-(.

Another solution would be to read the short values as uchar4 and process 4 “units” per thread. This would make the masking operations unnecessary.