Lowering register usage

Hi everyone,

CUDA version: 2.0

Platform: OS X 10.5

The following CUDA kernel fails to launch giving the error:

too many resources requested for launch

I’ve almost restructured majority of the code and have reduced the register count from about 48 to 20, but still cannot reach the limit of 16 as hinted by the occupancy calculator. I’m using a lot of texture fetches and in between same number of calls to the fminf function (that gives a reg count of 23). Surprisingly, using a macro MIN reduces the register count further down to 20. Any hints from here after will be highly appreciated. Following is the kernel code. The texture fetches are to read the 26 neighbours of a voxel in a volume.

Thread block size: 8x8x8

grid size: 4096x1

.cubin:

    smem = 48

    reg = 20
#define MIN(a, b) (a<b)?a:b

#define BLOCK_DIM 8

#define SUBVOL_DIM 128

texture<float, 3, cudaReadModeElementType> tex; //3D texture

__global__ void d_kernel(cudaPitchedPtr d_volPPtr, cudaExtent logicalGridSize, float d)

{

  unsigned int __x, __y, __z;

  unsigned int pitchz;

  pitchz = logicalGridSize.width*logicalGridSize.height;

  __z = (unsigned int)floorf((float)blockIdx.x/(float)pitchz);

  __y = (unsigned int)floorf((float)(blockIdx.x - __umul24(pitchz, __z))/(float)logicalGridSize.width);

  __x = blockIdx.x - __umul24(pitchz, __z) - __umul24(logicalGridSize.width, __y);

 //compute coordinates local (within subvolume)

  __x = __umul24(BLOCK_DIM, __x) + threadIdx.x;

  __y = __umul24(BLOCK_DIM, __y) + threadIdx.y;

  __z = __umul24(BLOCK_DIM, __z) + threadIdx.z;

 if(__x < 1 || __x > (SUBVOL_DIM-2) || __y < 1 || __y > (SUBVOL_DIM-2) || __z < 1 || __z > (SUBVOL_DIM-2)) return;

  

  float x = (float)__x; 

  float y = (float)__y; 

  float z = (float)__z; 

 float val = tex3D(tex, x, y, z);

  if(fabsf(val) <= (d - 1.0)) return; // this pixel is inside the distance band

 float *row = (float*)((char*)d_volPPtr.ptr + (__z*d_volPPtr.ysize + __y)*d_volPPtr.pitch);//output voxel: row[__x]

 float texval;

  float region_sign = (val>0.0)?1.0:-1.0;

 // Read 6-neighbors that are at a distance 1.0 units from central voxel

  texval = region_sign*tex3D(tex, x-1.0, y, z);

  texval = MIN(texval, region_sign*tex3D(tex, x+1.0, y, z));

  texval = MIN(texval, region_sign*tex3D(tex, x, y-1.0, z));

  texval = MIN(texval, region_sign*tex3D(tex, x, y+1.0, z));

  texval = MIN(texval, region_sign*tex3D(tex, x, y, z-1.0));

  texval = MIN(texval, region_sign*tex3D(tex, x, y, z+1.0));

  if(texval < region_sign*d) 

  { 

    row[__x] = region_sign*d;

    return;

  }

 // The 6-neighbors had Inf value, so read planar 12-diagonal neighbours now

  texval = region_sign*tex3D(tex, x-1.0, y-1.0, z);

  texval = MIN(texval, region_sign*tex3D(tex, x-1.0, y+1.0, z));

  texval = MIN(texval, region_sign*tex3D(tex, x+1.0, y-1.0, z));

  texval = MIN(texval, region_sign*tex3D(tex, x+1.0, y+1.0, z));

  texval = MIN(texval, region_sign*tex3D(tex, x-1.0, y, z-1.0));

  texval = MIN(texval, region_sign*tex3D(tex, x+1.0, y, z-1.0));

  texval = MIN(texval, region_sign*tex3D(tex, x, y-1.0, z-1.0));

  texval = MIN(texval, region_sign*tex3D(tex, x, y+1.0, z-1.0));

  texval = MIN(texval, region_sign*tex3D(tex, x-1.0, y, z+1.0));

  texval = MIN(texval, region_sign*tex3D(tex, x+1.0, y, z+1.0));

  texval = MIN(texval, region_sign*tex3D(tex, x, y-1.0, z+1.0));

  texval = MIN(texval, region_sign*tex3D(tex, x, y+1.0, z+1.0));

  if(texval < region_sign*d)

  {

    row[__x] = region_sign*(d + sqrtf(2.0));

    return;

  }

 //the 12-neighbors had Inf value, so lastly check the cubic 8-diagonals now

  texval = region_sign*tex3D(tex, x-1.0, y-1.0, z+1.0);

  texval = MIN(texval, region_sign*tex3D(tex, x+1.0, y-1.0, z+1.0));

  texval = MIN(texval, region_sign*tex3D(tex, x-1.0, y+1.0, z+1.0));

  texval = MIN(texval, region_sign*tex3D(tex, x+1.0, y+1.0, z+1.0));

  texval = MIN(texval, region_sign*tex3D(tex, x-1.0, y-1.0, z-1.0));

  texval = MIN(texval, region_sign*tex3D(tex, x+1.0, y-1.0, z-1.0));

  texval = MIN(texval, region_sign*tex3D(tex, x-1.0, y+1.0, z-1.0));

  texval = MIN(texval, region_sign*tex3D(tex, x+1.0, y+1.0, z-1.0));

  if(texval < region_sign*d) 

    row[__x] = region_sign*(d + sqrtf(3.0));

 // If all the neighbors had Inf value, we will reach here without assigning any value to row[x] and return.

}

Many thanks,

Ojaswa

(unsigned int)floorf((float)blockIdx.x/(float)pitchz); that should be the same as just blockIdx.x/pitchz, but without casting to floats, mabye it will be cheaper.

you could also try to write 2 versions of this code, one for region sign == 1 and another one with == -1. just do two big if’s. that should save you one register.

You could also try to rewrite it and load data by planes, storing it in shared memory and processing, taking next plane, and repeating. That probably wont save you registers, but might give you a speed up, as values would be read less times. The loaded planes would be a litte bigger than calculated volume, and mabye some threads would calculate useless data, but i think is worth a try. Such technique is described in detail in convolutionSeprable whitepaper form nvidia sdk (2d version only, but the same idea applies to 3d ).
I’ve tried doing such a thing for calculating 3d convolution and got great speedup compared to cpu :).

Thanks frea!

That helps to lower down the reg count to 18, but the kernel still cannot be launched. I also wonder that by using the macro MIN(a, B) (a<b)?a:b, I’m calling tex3D() twice as many times as required, but then if I use the fminf function or my custom function:

device float mymin(float a, float B)
{
return (a<b)?a:b;
}

the reg count increases. Is it because of copying function arguments?, but if the GPU is happy fetching the texture twice, I’ve no problems.

Note that I would like to be able to use 3D texture for volume processing in this application.

Any other pointers?
Many thanks :)

Sometimes “__syncthreads” helps compiler to optimize…

Just introduce a __syncthreads in a path where all threads would execute…

Check out.

No, that didn’t really help!

Oh…ok. And, this is my final tip – use it when you have no options –

  1. Reduce the kernel into two parts. (usually kernel launch overhead is very very minmial. I myself have seen 112x speedup with multiple (10) kernel launches…)

  2. Use the GTX2xx series – NVIDIA increased their register count in one of the latest hardware. Use that one… I am sure they did. but not sure whether it is the GTX2xx series (90% sure)

This might be a long shot, but for me it works:

Introduce the keyword “volatile” in front of any local variable (that you would expect the compiler to put into a register).

Often if you declare local variables without “volatile”, the compiler chooses to inline the expression by computing the value only when the variable is actually used. I found that this may lead to some redundant computation when the variable is used in several places.

Use of volatile prevents this and makes the compiler allocate the register right away and assign its value. In many instances I found it to reduce the amount of overall PTX code generated, as well as reducing the overall register use.

Christian

@cbuchner1,

That seems like a good observation. But how does making volatile, which actually force allocates a register, decrease the register count?

Many thanks Christian,

It worked like a charm. My reg count is now 12 after making x, y, z, and texval volatile and the kernel executes now :)

Thanks to everyone for helping me out.

Cheers,

Ojaswa

See http://forums.nvidia.com/index.php?showtopic=74752&hl=

Edit: well i see now that youve actually participated in it! Thats as good an answer as i have found, for now anyway.

And volatile doesn’t force into lmem or anything? (I would think the ‘register’ keyword would do this… stupid compiler.) Very cool, thanks for the PTX-massaging tip.

EDIT: I see I participated in that thread too. (wow, I have such great memory.) My question from there still stands. It looked like the trick was that the variable was being assigned a constant. ‘volatile’ made the value get retrieved from the constant address space each time instead of being stored in a register (as is the meaning of the keyword in other areas of C programming). I don’t think it’d work in the general case if you just want a register to hang around with one value the whole time.

Ailleur,

I see I have participated as well… My memory…

But still lot of open questions out there.

Anyway, as long as it works, I really wont complain :-)

Thanks.

It doesn’t seem that the volatile keyword forces the the variable from local mem! Unfortunately, the volatile keyword is poorly documented.

Looking at the OP source code… Hmm. Yeah, looks like it just forces a value to hang around in its own register. I’ll remember to give this a try.

I wonder if this is something intentional on NVIDIA’s part or it’s just a strange artifact of the compiler. Ie, is this going to break in the future?

Its good to ask questions as long as you dont expect an answer… Anyway, you dont get it here…