Minimizing register usage Compile uses more registers as neccesary

Hello,

I have a very frustrating problem, in which the compiler decides to keep some intermediate results in registers, while I’m trying to lower the register usage (because of occupancy issues).

At the beginning of the kernel I am copying a 2-D tile from global memory to local memory.

   #define g_ofs ((((bid.y << columndiml2) + bid.x) << bdiml2) + (tid.y<<columndiml2) + tid.x)

    #define s_ofs ((tid.y<<bdiml2) + tid.x)

    #define xstep (1<<(bdiml2-1))

    #define systep (xstep << bdiml2)

    #define gystep (xstep << columndiml2)

   shared[s_ofs] = data[g_ofs];

    shared[s_ofs+xstep] = data[g_ofs+xstep];

    shared[s_ofs+systep] = data[g_ofs+gystep];

    shared[s_ofs+systep+xstep] = data[g_ofs+gystep+xstep];

  

    __syncthreads();

Then there is the internals of the kernel, which do not use the g_ofs/s_ofs etc at all. At the end, after processing, I write the tile back:

   data[g_ofs] = shared[s_ofs];

    data[g_ofs+xstep] = shared[s_ofs+xstep];

    data[g_ofs+gystep] = shared[s_ofs+systep];

    data[g_ofs+gystep+xstep] = shared[s_ofs+systep+xstep];

I would have expected that my macros recalculate the expression both times. But too bad, the compiler is “smart” enough to keep the result of the offset calculations, wasting 6 registers along the way.

Is there some compiler flag, pragma, or other way of hinting the compiler that I don’t want to keep these (trival) results in registers but just want to recalculate them?

The compilers smartness seems to be based on literal text (or any intermediate) representation comparisons of code fragments. As other people on this board already have noticed you can quite easily fool the common subexpression elimination by just reordering the text. Example:

#define g_ofs ((((bid.y << columndiml2) + bid.x) << bdiml2) + (tid.y<<columndiml2) + tid.x)

shared[s_ofs] = data[g_ofs];

.......

#define g_ofs_out (tid.x + (((bid.y << columndiml2) + bid.x) << bdiml2) + (tid.y<<columndiml2))

data[g_ofs_out] = shared[s_ofs];

Peter

We thought about that already here, but it seems like a bit ugly hack which can stop working any moment in a next release of CUDA. It would be nice if there was some official way to reduce register usage without making use of holes in the compiler logic.

I agree. And it seems that nvopencc already supports different optimization levels. There is probably one to trade code size against registers against speed or similar. Unfortunately the ptx is generated always with the same switches, as the comment atop each .ptx suggests. I would expect that NVIDIA improves this in coming releases.

Peter

I confirm, it is a big problem, and recomdining the code doesn’t help.
I noticed something like that not all registers used for i intermediate results are reused for i+j intermediate results…
Enlarging the code of global function forces me to reduce the threads per block… :(