Register usage too high How to reduce register usage?

My code only runs with 1 block per SM because I am currently up to 49 register usage on compute capability 1.3 running on a Tesla C1060 on a unix machine.

I need to get it down to around 32 registers so I can have 2 blocks running per SM, which I think will help with speeds.

Anyone know a good way of going about this? I looked through the PTX code, I couldn’t make sense of where registers are assigned. If you have any tips, I’m sure I’m not the only one looking for answers to this.

Thanks!

1 Like

Don’t look at registers in PTX. PTX is in SSA form, register allocation happens past the PTX stage.

As a first measure, add a suitable launch_bounds() qualifier to the kernel (see appendix B.17 of the Programming Guide) to express your intent and see how the compiler manages. This might then lead to some extra local memory usage for register spills which you might want to tackle in a next step.

Thanks for the tip Tera, I’ve read through the manual a few times and never noticed this command…

After trying it, it reduced my registers down to 32 as expected and my lmem is up to 88, because it was around 40, and my registers about 42 on this version of code ( my other code was about 49 registers ).

Any tips on how to figure out what has been “spilled” to lmem?

It seems the compiler has just moved almost all of the saved registers into local memory (10 saved registers, 40 more bytes of local memory). It is difficult to give general advice. Can you post some code?

1 Like

My Kernal is quite large, I’ll try simplify it and post it, I also have to check if I’m allowed to post the code since I don’t own the original code, I just have transfered it to CUDA.

I have the following in my code:

#define MULTIPLIER 1

#define SUB_FRINGE_SIZE ( BLOCK_SIZE*GRID_SIZE*MULTIPLIER )

What I found werid was when I change my MULTIPLIER constant, I get different register usages and cmem usage. I use SUB_FRINGE_SIZE to allocate memory on the device. e.g.

cudaMalloc( (void **) &r_tableDevice, SUB_FRINGE_SIZE * GRID_SIZE * BLOCK_SIZE * sizeof(int) );

cudaMalloc( fringeSize, SUB_FRINGE_SIZE * SUB_FRINGE_SIZE * sizeof(float) );

I’m using the following values currently in my code:

#define BLOCK_SIZE 16

#define GRID_SIZE 256

Using __launch_bounds__( BLOCK_SIZE*BLOCK_SIZE, 2 )
#define MULTIPLIER 1

#define SUB_FRINGE_SIZE ( BLOCK_SIZE*GRID_SIZE*MULTIPLIER )

compiler output:

ptxas info    : Used 32 registers, 88+0 bytes lmem, 4128+16 bytes smem, 144 bytes cmem[0], 112 bytes cmem[1]
#define MULTIPLIER 2

#define SUB_FRINGE_SIZE ( BLOCK_SIZE*GRID_SIZE*MULTIPLIER )

compiler output:

ptxas info    : Used 30 registers, 80+0 bytes lmem, 4128+16 bytes smem, 144 bytes cmem[0], 112 bytes cmem[1]
#define MULTIPLIER 3

#define SUB_FRINGE_SIZE ( BLOCK_SIZE*GRID_SIZE*MULTIPLIER )

compiler output:

ptxas info    : Used 32 registers, 96+0 bytes lmem, 4128+16 bytes smem, 144 bytes cmem[0], 116 bytes cmem[1]
#define MULTIPLIER 4

#define SUB_FRINGE_SIZE ( BLOCK_SIZE*GRID_SIZE*MULTIPLIER )

compiler output:

ptxas info    : Used 30 registers, 80+0 bytes lmem, 4128+16 bytes smem, 144 bytes cmem[0], 116 bytes cmem[1]
#define MULTIPLIER 5

#define SUB_FRINGE_SIZE ( BLOCK_SIZE*GRID_SIZE*MULTIPLIER )

compiler output:

ptxas info    : Used 32 registers, 96+0 bytes lmem, 4128+16 bytes smem, 144 bytes cmem[0], 112 bytes cmem[1]
#define MULTIPLIER 6

#define SUB_FRINGE_SIZE ( BLOCK_SIZE*GRID_SIZE*MULTIPLIER )

compiler output:

ptxas info    : Used 32 registers, 96+0 bytes lmem, 4128+16 bytes smem, 144 bytes cmem[0], 116 bytes cmem[1]

Any ideas why changing that MULTIPLIER define would cause this? something to do with power of 2 allocations?

If I remove the launch_bounds declaration. The register useage does back ujp to 43 which is correct, but when I fiddle with the MULTIPLIER as I did previously, the register usage does not change, but the cmem does.

While I can’t see the code that uses [font=“Courier New”]MULTIPLIER[/font], it may well be that for power-of-2 values the compiler is able to replace a multiplication (or even division) by a shift. I wouldn’t look into this particular detail too much.

Literal constants in code can either be encoded in an instruction’s immediate field, or placed in a constant bank (separate form the constant bank used for user constants). There are restrictions as to what constants can fit into the immediate field of an instruction, and in addition these restrictions may vary by instruction type. The compiler can also create derived constants. In general the compiler makes sensible choices about where to place literal constants (I don’t recall any issues with this since before Fermi shipped) so this is not something to worry about. If you want to know what exactly is happening you can dump the SASS code (–dump-sass) and the constant bank contents (–dump-elf shows this, I think).

I think the more important question is: Which version of your code is faster: The one using 43 registers, no spilling, running one thread block per SM, or the one using 32 registers, with spilling, running two thread blocks per SM?

In general the compiler will try to spill where it hurts performance the least, e.g. in a loop nest it tries to spill in the outermost loop.

I’m running the Tesla Server’s now to see which is faster, it takes a few hours.

njuffa: I’ll have to find some time to check out those dumps, it might be handy for my knowledge.

Here is the code, its slightly simplified, but all the same calculations are here. It’s very large and complicated, but i guess to give you guys an idea.

I havn’t recalculated our bandwidth usage, but previously it was more then the card could handle, can you use the cuda profiler for this instead of doing it by hand? I’m also wondering if we are also limited by FLOPs, since we do a number of double calculations and this card doesn’t like doubles. I’ll have to see what our bottle neck is, I just assumed it was the registers holding us back the most. Maybe this calcualtion I’m doing just isn’t suited for GPGPU. I’ll be trying a shader implementation HOPEFULLY in the new year. I’m wanting a 10x speed up, and currently have ~5x

__global__ 

void

//__launch_bounds__( BLOCK_SIZE*BLOCK_SIZE, 2 )

calcImageNaive( const HoloPoint* pHPs, 

					float*  pImage, 

					int* pR_table, 

					int pXStart,

					int pYStart

					)

{

	const unsigned int row = blockIdx.y * blockDim.y + threadIdx.y;

	const unsigned int col = blockIdx.x * blockDim.x + threadIdx.x;

	int numInit = (NUMTBL/(BLOCK_SIZE*BLOCK_SIZE))+1;//4 if 16x16 blocks

	int i, po, idx;

int theta;

	float amp0;

	float x0, y0, z0, x, y;

	float XXXX, XX, ZZZZ, YY, YYYY;

	int r;

	///UPdated yStart and xstart to use with streams

	int ystart = pYStart + row * YSIZE; //yStart oosition in the entire fringe.

	int xstart = pXStart + col * XSIZE; //xStary position in the entire fringe.

	int yStartLocal = row * YSIZE; //start of the subset

	int xStartLocal = col * XSIZE; //start of the subset

	

	int r_mid, r_diff;

	int xs = XSIZE;

	int ys = YSIZE;

	int phase;

	float phase0;

	//If the start positions are past the max width or height of the resolution of the fringe pattern. Return!;

	if( xstart >= XPIXEL || ystart >= YPIXEL || 

			yStartLocal >= SUB_FRINGE_SIZE || xStartLocal >= SUB_FRINGE_SIZE )

	{

		return;

	}

	//Shrink if goes over bounds in X

	if( xstart+xs > XPIXEL)

	{

		int shrinkBy = ( xstart+xs )-XPIXEL;

		xs = xs - shrinkBy;

	}

	else if( xStartLocal+xs > SUB_FRINGE_SIZE )

	{

		int shrinkBy = ( xStartLocal+xs )-SUB_FRINGE_SIZE;

		xs = xs - shrinkBy;

	}

	//Shrink if goes over bounds in Y

	if( ystart+ys > YPIXEL)

	{

		int shrinkBy = ( ystart+ys )-YPIXEL;

		ys = ys - shrinkBy;

	}

	else if( yStartLocal+ys > SUB_FRINGE_SIZE )

	{		

		int shrinkBy = ( yStartLocal+ys )-SUB_FRINGE_SIZE;

		ys = ys - shrinkBy;

	}

	__shared__ float pCOStbl[ NUMTBL ];

	//////////Load cos table into shared////////////////

	for( int x = 0; x < numInit ; x++ )

	{

		int indx = ( threadIdx.y*numInit*BLOCK_SIZE ) + (numInit*threadIdx.x) + x;

		if( indx >= NUMTBL )

			break;

		pCOStbl[ indx ] = (float)cos( (M_PI + M_PI) * (float)(indx + indx -1) / (float)(2 * NUMTBL) );

	}

	//Initalise the image to be black

	initImage( pImage, xs, ys, row, col  );

	///////////////////////////////////////////////////

	double xref, yref;

	__syncthreads();

	for ( po = 0; po < NUMHOLOPOINTS; po++) 

	{

		float x1=0.0f, x2=0.0f, y1=0.0f, y2=0.0f, wk=0.0f, wkz=0.0f;

		float vw2=0.0f, vh2=0.0f;

		///small offset from the holographic plane or there is problems with diviosn by zero

		float zmin = 0.001f;

		int ix1=0, ix2=0, iy1=0, iy2=0;

		x0 = SCALE * pHPs[po].x * ( 1.0 );

		y0 = SCALE * pHPs[po].y * ( 1.0 );//mod

		z0 = SCALE * pHPs[po].z * ( 1.0 ) - Z_OFFSET;

	

		phase0 = (float)( z0 * WAVENUMBER2 + pHPs[po].phase / (M_PI + M_PI) );

		phase = (int)( phase0 - (float)((int)(phase0)) ) * NUMTBL;

		/////////////////////////////

		////////IMAGE_COLOR//////////

		/////////////////////////////

		if( COLOR == 2 )

			amp0 = pHPs[po].g;

		wkz = V_DIST / (V_DIST - z0);

		vw2 = V_WIDTH * 0.5;

		vh2 = V_HEIGHT * 0.5;

		x1 = vw2 - wkz * ( vw2 - x0);

		x2 =-vw2 + wkz * ( vw2 + x0);

		if (x1 > x2)

		{

			wk = x1; 

			x1 = x2; 

			x2 = wk;

		}

		y1 = vh2 - wkz * ( vh2 - y0);

		y2 =-vh2 + wkz * ( vh2 + y0);

		if (y1 > y2)

		{

			wk = y1; 

			y1 = y2; 

			y2 = wk;

		}

		///Determines the physical length of the image from the start of this subsection of the image

		wk = ((float)(0 + xstart) - XPIXEL / 2)* X_PITCH;

		///Calculates the pixel index

		ix1 = (int)((x1 - wk) / X_PITCH);

		if (ix1 < 0) ix1 = 0;

		ix2 = (int)((x2 - wk) / X_PITCH);

		if (ix2 > xs) ix2 = xs;

		if (ix1 > ix2) continue;

		///Determine where the start and end Y is.... in final fringe pixel space?

		wk = ((float)(0 + ystart) - YPIXEL / 2) * Y_PITCH;

		iy1 = (int)((y1 - wk) / Y_PITCH);

		if (iy1 < 0) iy1 = 0;

		iy2 = (int)((y2 - wk) / Y_PITCH);

		if (iy2 > ys) iy2 = ys;

		if (iy1 > iy2) continue;

		///ysize is the number of pixels per division

		ZZZZ=  z0*z0;

		///This this putting the y into aspect to the total scale

		y = (float)(((ys*0.5 + ystart) - YPIXEL * 0.5) * Y_PITCH);

		YY=(y0 - y);

		YYYY=YY * YY;

		x = (float)(((xs*0.5 + xstart) - XPIXEL * 0.5) * X_PITCH);

		XX=(x - x0);	

		XXXX =XX*XX;

		r_mid = (int)( WAVENUMBER * sqrt(XXXX + YYYY + ZZZZ) );

		//pFringeRes[ threadIdx.y * 16 + threadIdx.x ] = WAVENUMBER;

		int j;

	

		for (j = ix1; j < ix2; j++ )

		{

			int indx = ( (SUB_FRINGE_SIZE*row) + (col * XSIZE+j) );

			x = (float)(((j + xstart) - XPIXEL / 2) * X_PITCH);

			XX=(x - x0);	

			XXXX =XX*XX;

			pR_table[ indx ] = -1*(int)(  WAVENUMBER * sqrt( XXXX  + YYYY + ZZZZ ) );

		}

		x = (float)(((xs*0.5 + xstart) -  XPIXEL * 0.5)* X_PITCH);

		XX=(x - x0);	

		XXXX =XX*XX;

		xref = ((float)(xs / 2 +  xstart) - XPIXEL / 2.0) * X_PITCH;

		

		///Loop through the pixels allocated to this thread, calculate the vertices effect on each of the pixels

		for (i = iy1; i < iy2; i++)

		{

			int j=0;

			int bigYOffset=0;

			y = (float)(((i + ystart) - YPIXEL / 2) * Y_PITCH);

			YY=(y0 - y);

			YYYY=YY * YY;

			r = (int)(  WAVENUMBER * sqrt(XXXX  + YYYY + ZZZZ));

			r_diff = (int)(-1*(r - r_mid) + phase);

			bigYOffset = (row*(SUB_FRINGE_SIZE*YSIZE)) + (i*SUB_FRINGE_SIZE);

			yref = ((float)(i + ystart) - YPIXEL / 2.0) * Y_PITCH;

			for (j = ix1; j < ix2; j++)

			{

				float storedValue = 0.0f;

				int bigXOffset = ( ( (XSIZE)*col ) + j );

				theta = (int)( pR_table[ ( ( SUB_FRINGE_SIZE*row) + (col * XSIZE+j) ) ] 

													- ((int)(WAVENUMBER * ((double)sqrt(( X_POINT - xref) * ( X_POINT - xref) 

													+ ((double)(Y_POINT - yref) * (Y_POINT - yref) + (double)Z_POINT * (double)Z_POINT)) 

													- ((double) (sqrt((double)Y_POINT * (double)Y_POINT + (double)Z_POINT * (double)Z_POINT))))))

						+ r_diff );

				idx = (theta) & (NUMTBL2);

				storedValue = (pCOStbl[ idx ] * (amp0 / (0.005 + r_mid)));

				pImage[ bigYOffset + bigXOffset ] += storedValue;

			}

		}

	}

}

From a quick glance the most worthwhile optimization seems to be to move [font=“Courier New”]pR_table[/font] to shared memory or to eliminate the table lookup at all and instead recompute the data.

I’m also not sure the [font=“Courier New”]pCOStbl[/font] table is worth having. If you use cospi() instead of cos(), recalculation might be faster than a table lookup (and save a lot of shared memory).

Finally, what is the initImage() call doing? Could that just be dropped?

Maybe remove pImage[ bigYOffset + bigXOffset ] += storedValue; from loop?

I would definitely recommend using the profiler to zero in on the performance bottleneck; there is no point in optimizing things that do not affect performance in a meaningful way. I cannot readily tell what’s limiting this kernel.

I wonder whether (1) all uses of double-precision computation in this code are strictly necessary, and (2) whether all uses of double-precision computation in this code are intentional, or possibly the unintended side-effect of using double-precision literal constants like 0.5, 2.0, and M_PI. Use of double-precision computation will tend to increase both register usage and instruction count, plus double-precision throughput is much lower than single-precision throughput on sm_13. Could double-precision sqrt() be avoided in the innermost loop?

I second tera’s recommendation to try cospi(), or its single-precision relative cospif(). The register and instruction footprint of cospi() is somewhat smaller than the footprint of cos(), because the argument reduction is much simplified, and re-computation can beat table lookup.

1 Like

tera: yeah, I think possibly computing t_table on the fly is a good idea, I have moved a lot of things to be re-computed on the fly thus far and has helped with bandwidth, but my FLOPs will take a beating. Yes, not sure if COS_Table is worth being in shared memory, it gave a slight speed increase, maybe by like 5% by having it in shared. I’ll check out cospi().

initImage initalises the image passed in to have value 0’s. It is done in a function, because I’m using streams, and for some reason there isn’t a cudaMemsetAsync function I could use(though I read there is one).

Lev: Where would I move it to? that is where its storing the effect that each “pHPs[po]” has on that element of the pImage.

My kernal is still currently running.

It seems like the doubles are necessary, at the moment, since if I use floats the output is much different, please view attached images. (not sure if it is classified as an incorrect output, I have yet need to determine this)

The literal double constants are from the original code and not sure if they are needed, actually I’m pretty sure they arn’t. I’ll play with them.

Also another annoying thing about CUDA is seems to deal with floats differently then its CPU counter part, from what I read certain assembly instructions truncate the value which gives me different, but NOT necessarily incorrect results.

I’ll change these constants and use cospi and see how I go.

We have a whitepaper that explains some of the reasons why floating-point results may differ between CPUs and GPUs, you may find it helpful:

http://developer.nvidia.com/content/precision-performance-floating-point-and-ieee-754-compliance-nvidia-gpus

One of the reasons for CPU/GPU differences can be FMAD/FMA merging (where the compiler contracts an FMUL and a dependent FADD into FMAD/FMA). CUDA 4.1 has a new compiler switch -fmad={true|false} to turn that off globally for a compilation unit. I believe that the flag is limited to sm_2x and higher at the moment, otherwise you could run a quick check with -fmad=false (which disables the contraction) to see whether that makes a difference.

As for reducing double-precision operations by converting literal constants to single precision, I’d first try that around expensive operations such as sqrt, division, and transcendentals, where it should make the biggest performance difference. Note that on sm_1x, single-precision sqrt and division are not rounded according to IEEE-754, which may contribute to differences you are seeing when switching to double precision. You could try the properly rounded single-precision intrinsics __fdiv_rn() and __frsqrt_rn() instead.

Above all, it will be important to time and profile the code as changes are made, otherwise one could easily spend too much time on items that have little to no impact on performance.

Results:
real 369m20.235s
ptxas info : Used 43 registers, 40+0 bytes lmem, 4128+16 bytes smem, 144 bytes cmem[0], 112 bytes cmem[1]

real 440m55.979s
ptxas info : Used 32 registers, 96+0 bytes lmem, 4128+16 bytes smem, 144 bytes cmem[0], 112 bytes cmem[1]

njuffa: thanks for the paper, I saw it a week or so ago but havnt had a chance to read through it. With the __fdiv_rn() and __frsqrt_rn() functions, is there a flag for sm1.3 to automatically use these for me?

Agree’d I think the best way is for me to learn how to interpret the cuda profiler output. I have been using it to monitor values. THe main thing I don’t understand is say if I have a divergent branch size of 910179 is this good or bad? I’m mainly interested to see if I’m maxing out the memory bandwidth or hitting the FLOP limit, does the cuda profiler tell me things like this? I also run the profiler on a much smaller image so i dont have to wait 6 hours for it to complete, it takes about 6 minutes instead, but I assume it would give me a general idea. But if you guys have any tips on how to interpret the cuda profile tool or any good documents I would very much appreciate it. I have the document from Nvidia, but from what I read is jsut describes what the values are, but not if they are within a good or bad range, or what I should be looking for when reading the results of the profiler.

I’m Guessing the glob “mem overall throughput” is the total bandwidth I’m using in my kernal? It is currently 5.98957GB/s. I assume the max total throughput for my card is 72GB/s, which means I’m currently wayyyyy under?

Not sure what instruction throughput is, but its at 0.0674076

(1) Toggling between approximate and IEEE-compliant single-precision division and square root using compiler flags is only possible for sm_2x targets (with the -pred-div and -prec-sqrt flags). For sm_1x the only way to get the correctly rounded single-precision operations is via the intrinsics. Note that the correctly rounded versions are a lot slower than the approximate ones, as sm_1x doesn’t have single-precision FMA (fused multiply-add) which really helps making these fast.

(2) As you note, bandwidth of 6GB/sec is low compared to what the card can provides, so that would indicate the code is not bound by GMEM throughput.

(3) I don’t remember the profiler metric for instruction throughput, but the number you report strikes me as low. Maybe tera has better information handy. What is the occupancy reported?

I think it would be helpful to watch a couple recorded presentation by two of my colleagues on how to identify performance bottlenecks (there are additional useful presentations at this webpage):

http://developer.nvidia.com/gpu-computing-webinars
CUDA Optimization: Identifying Performance Limiters by Dr Paulius Micikevicius
CUDA Optimization : Instruction Limited Kernels with Live Q&A by Gernot Ziegler

njuffa: Ok thought so, the 6GB/s is low, so I will assume I don’t need to reduce bandwidth. How can I determine if I’m maxing FLOPS?

My Occupancy is 0.25 for my kernel function.

Thanks for the videos, I think I’ll have to give up another weekend!

Also I’m running 2 cards together.

Just to give you more of a background:
So what happens is I calculate a sub part of a massive image of around 160,000 x 160,000 pixels for a medium size, but only allocate a sub part of around 20,480x20,480 pixels since thats all the card can fit. Once its finished calculating I asyncMemcopy off the fringe and run the kernal again on the next section, initalising the sub part to 0’s in the kernal, hence the initImage function.

what is sence of calculating ((int)(WAVENUMBER * ((double)sqrt(( X_POINT - xref) * ( X_POINT - xref))) in loop while it is loop independent? Also 0.25 is low occupancy, with 32 registers you should have 75% occupancy. You need to make small test case and check everything.