Lowering register usage

ojaswa · October 5, 2008, 9:44pm

Hi everyone,

CUDA version: 2.0

Platform: OS X 10.5

The following CUDA kernel fails to launch giving the error:

too many resources requested for launch

I’ve almost restructured majority of the code and have reduced the register count from about 48 to 20, but still cannot reach the limit of 16 as hinted by the occupancy calculator. I’m using a lot of texture fetches and in between same number of calls to the fminf function (that gives a reg count of 23). Surprisingly, using a macro MIN reduces the register count further down to 20. Any hints from here after will be highly appreciated. Following is the kernel code. The texture fetches are to read the 26 neighbours of a voxel in a volume.

Thread block size: 8x8x8

grid size: 4096x1

.cubin:

    smem = 48

    reg = 20

#define MIN(a, b) (a<b)?a:b

#define BLOCK_DIM 8

#define SUBVOL_DIM 128

texture<float, 3, cudaReadModeElementType> tex; //3D texture

__global__ void d_kernel(cudaPitchedPtr d_volPPtr, cudaExtent logicalGridSize, float d)

{

  unsigned int __x, __y, __z;

  unsigned int pitchz;

  pitchz = logicalGridSize.width*logicalGridSize.height;

  __z = (unsigned int)floorf((float)blockIdx.x/(float)pitchz);

  __y = (unsigned int)floorf((float)(blockIdx.x - __umul24(pitchz, __z))/(float)logicalGridSize.width);

  __x = blockIdx.x - __umul24(pitchz, __z) - __umul24(logicalGridSize.width, __y);

 //compute coordinates local (within subvolume)

  __x = __umul24(BLOCK_DIM, __x) + threadIdx.x;

  __y = __umul24(BLOCK_DIM, __y) + threadIdx.y;

  __z = __umul24(BLOCK_DIM, __z) + threadIdx.z;

 if(__x < 1 || __x > (SUBVOL_DIM-2) || __y < 1 || __y > (SUBVOL_DIM-2) || __z < 1 || __z > (SUBVOL_DIM-2)) return;

  

  float x = (float)__x; 

  float y = (float)__y; 

  float z = (float)__z; 

 float val = tex3D(tex, x, y, z);

  if(fabsf(val) <= (d - 1.0)) return; // this pixel is inside the distance band

 float *row = (float*)((char*)d_volPPtr.ptr + (__z*d_volPPtr.ysize + __y)*d_volPPtr.pitch);//output voxel: row[__x]

 float texval;

  float region_sign = (val>0.0)?1.0:-1.0;

 // Read 6-neighbors that are at a distance 1.0 units from central voxel

  texval = region_sign*tex3D(tex, x-1.0, y, z);

  texval = MIN(texval, region_sign*tex3D(tex, x+1.0, y, z));

  texval = MIN(texval, region_sign*tex3D(tex, x, y-1.0, z));

  texval = MIN(texval, region_sign*tex3D(tex, x, y+1.0, z));

  texval = MIN(texval, region_sign*tex3D(tex, x, y, z-1.0));

  texval = MIN(texval, region_sign*tex3D(tex, x, y, z+1.0));

  if(texval < region_sign*d) 

  { 

    row[__x] = region_sign*d;

    return;

  }

 // The 6-neighbors had Inf value, so read planar 12-diagonal neighbours now

  texval = region_sign*tex3D(tex, x-1.0, y-1.0, z);

  texval = MIN(texval, region_sign*tex3D(tex, x-1.0, y+1.0, z));

  texval = MIN(texval, region_sign*tex3D(tex, x+1.0, y-1.0, z));

  texval = MIN(texval, region_sign*tex3D(tex, x+1.0, y+1.0, z));

  texval = MIN(texval, region_sign*tex3D(tex, x-1.0, y, z-1.0));

  texval = MIN(texval, region_sign*tex3D(tex, x+1.0, y, z-1.0));

  texval = MIN(texval, region_sign*tex3D(tex, x, y-1.0, z-1.0));

  texval = MIN(texval, region_sign*tex3D(tex, x, y+1.0, z-1.0));

  texval = MIN(texval, region_sign*tex3D(tex, x-1.0, y, z+1.0));

  texval = MIN(texval, region_sign*tex3D(tex, x+1.0, y, z+1.0));

  texval = MIN(texval, region_sign*tex3D(tex, x, y-1.0, z+1.0));

  texval = MIN(texval, region_sign*tex3D(tex, x, y+1.0, z+1.0));

  if(texval < region_sign*d)

  {

    row[__x] = region_sign*(d + sqrtf(2.0));

    return;

  }

 //the 12-neighbors had Inf value, so lastly check the cubic 8-diagonals now

  texval = region_sign*tex3D(tex, x-1.0, y-1.0, z+1.0);

  texval = MIN(texval, region_sign*tex3D(tex, x+1.0, y-1.0, z+1.0));

  texval = MIN(texval, region_sign*tex3D(tex, x-1.0, y+1.0, z+1.0));

  texval = MIN(texval, region_sign*tex3D(tex, x+1.0, y+1.0, z+1.0));

  texval = MIN(texval, region_sign*tex3D(tex, x-1.0, y-1.0, z-1.0));

  texval = MIN(texval, region_sign*tex3D(tex, x+1.0, y-1.0, z-1.0));

  texval = MIN(texval, region_sign*tex3D(tex, x-1.0, y+1.0, z-1.0));

  texval = MIN(texval, region_sign*tex3D(tex, x+1.0, y+1.0, z-1.0));

  if(texval < region_sign*d) 

    row[__x] = region_sign*(d + sqrtf(3.0));

 // If all the neighbors had Inf value, we will reach here without assigning any value to row[x] and return.

}

Many thanks,

Ojaswa

frea · October 5, 2008, 10:23pm

(unsigned int)floorf((float)blockIdx.x/(float)pitchz); that should be the same as just blockIdx.x/pitchz, but without casting to floats, mabye it will be cheaper.

you could also try to write 2 versions of this code, one for region sign == 1 and another one with == -1. just do two big if’s. that should save you one register.

You could also try to rewrite it and load data by planes, storing it in shared memory and processing, taking next plane, and repeating. That probably wont save you registers, but might give you a speed up, as values would be read less times. The loaded planes would be a litte bigger than calculated volume, and mabye some threads would calculate useless data, but i think is worth a try. Such technique is described in detail in convolutionSeprable whitepaper form nvidia sdk (2d version only, but the same idea applies to 3d ).
I’ve tried doing such a thing for calculating 3d convolution and got great speedup compared to cpu :).

ojaswa · October 6, 2008, 6:37am

Thanks frea!

That helps to lower down the reg count to 18, but the kernel still cannot be launched. I also wonder that by using the macro MIN(a, B) (a<b)?a:b, I’m calling tex3D() twice as many times as required, but then if I use the fminf function or my custom function:

device float mymin(float a, float B)
{
return (a<b)?a:b;
}

the reg count increases. Is it because of copying function arguments?, but if the GPU is happy fetching the texture twice, I’ve no problems.

Note that I would like to be able to use 3D texture for volume processing in this application.

Any other pointers?
Many thanks :)

Sarnath · October 6, 2008, 6:51am

Sometimes “__syncthreads” helps compiler to optimize…

Just introduce a __syncthreads in a path where all threads would execute…

Check out.

ojaswa · October 6, 2008, 8:35am

No, that didn’t really help!

Sarnath · October 6, 2008, 8:44am

Oh…ok. And, this is my final tip – use it when you have no options –

Reduce the kernel into two parts. (usually kernel launch overhead is very very minmial. I myself have seen 112x speedup with multiple (10) kernel launches…)
Use the GTX2xx series – NVIDIA increased their register count in one of the latest hardware. Use that one… I am sure they did. but not sure whether it is the GTX2xx series (90% sure)

cbuchner1 · October 6, 2008, 11:51am

This might be a long shot, but for me it works:

Introduce the keyword “volatile” in front of any local variable (that you would expect the compiler to put into a register).

Often if you declare local variables without “volatile”, the compiler chooses to inline the expression by computing the value only when the variable is actually used. I found that this may lead to some redundant computation when the variable is used in several places.

Use of volatile prevents this and makes the compiler allocate the register right away and assign its value. In many instances I found it to reduce the amount of overall PTX code generated, as well as reducing the overall register use.

Christian

Sarnath · October 6, 2008, 2:02pm

@cbuchner1,

That seems like a good observation. But how does making volatile, which actually force allocates a register, decrease the register count?

ojaswa · October 6, 2008, 2:22pm

Many thanks Christian,

It worked like a charm. My reg count is now 12 after making x, y, z, and texval volatile and the kernel executes now :)

Thanks to everyone for helping me out.

Cheers,

Ojaswa

Ailleur · October 6, 2008, 2:40pm

See http://forums.nvidia.com/index.php?showtopic=74752&hl=

Edit: well i see now that youve actually participated in it! Thats as good an answer as i have found, for now anyway.

alex_dubinsky · October 7, 2008, 1:31am

And volatile doesn’t force into lmem or anything? (I would think the ‘register’ keyword would do this… stupid compiler.) Very cool, thanks for the PTX-massaging tip.

EDIT: I see I participated in that thread too. (wow, I have such great memory.) My question from there still stands. It looked like the trick was that the variable was being assigned a constant. ‘volatile’ made the value get retrieved from the constant address space each time instead of being stored in a register (as is the meaning of the keyword in other areas of C programming). I don’t think it’d work in the general case if you just want a register to hang around with one value the whole time.

Sarnath · October 7, 2008, 6:21am

Ailleur,

I see I have participated as well… My memory…

But still lot of open questions out there.

Anyway, as long as it works, I really wont complain :-)

Thanks.

ojaswa · October 7, 2008, 6:31am

It doesn’t seem that the volatile keyword forces the the variable from local mem! Unfortunately, the volatile keyword is poorly documented.

alex_dubinsky · October 7, 2008, 4:50pm

Looking at the OP source code… Hmm. Yeah, looks like it just forces a value to hang around in its own register. I’ll remember to give this a try.

I wonder if this is something intentional on NVIDIA’s part or it’s just a strange artifact of the compiler. Ie, is this going to break in the future?

Sarnath · October 10, 2008, 6:51am

Its good to ask questions as long as you dont expect an answer… Anyway, you dont get it here…

Topic		Replies	Views
Weird use of registers Too many registers are wasted CUDA Programming and Performance	8	5601	July 4, 2007
Using volatile What have i actually done? CUDA Programming and Performance	12	4327	August 13, 2008
Getting nvcc to consolidate registers CUDA Programming and Performance	19	19780	November 19, 2012
reduce he number of register per thread in the kernel CUDA Programming and Performance	3	740	June 23, 2014
Tricks to fight register pressure or how I got down from 29 to 15 registers. CUDA Programming and Performance	14	17060	March 14, 2022
Is it possible to use more than 124 registers in kernel? CUDA Programming and Performance	15	4304	October 16, 2009
reduce the no of register per thread used CUDA Programming and Performance	2	2990	October 15, 2009
How to reduce register usage CUDA Programming and Performance	47	50204	May 28, 2022
TOO MANY RESOURCES REQUESTED FOR LAUNCH CUDA Programming and Performance	16	12086	September 2, 2008
NVCC ignores -maxrregcount=64: chooses 27 registers and high local memory overhead CUDA Programming and Performance	5	1946	March 28, 2016

Lowering register usage

Related topics