Questioning compiler's use of registers, how to get compiler to use registers more efficiently

I have a kernel that is running slower than I would like. I believe that it has to do with it use of registers which is limiting occupancy. Here is the output from the compiler:

ptxas info : Compiling entry function ‘_Z21fcttrc_ijksw_Kernel_2iiiifiiPiPfS0_S0_S0_iiifffS0_S0
S0_S0_S0_S0_S0_S0_S0_S0_S0_S0_S0_S0_S0
ptxas info : Used 60 registers, 232+224 bytes smem, 20 bytes cmem[1], 20 bytes cmem[14]

Now 58 registers may just be what it takes to use. I don’t believe it. I don’t think the compiler
is reusing registers properly. The code is below, but basically the code looks like:

for(ns=1;ns<=ntr;ns++) {
chunk of code 1
chunk of code 2
chunk of code 3
chunk of code 4
chunk of code 5
chunk of code 6
}

The only thing that connects between the chunks is a register I used to sum one variable between each chunk of code.

I commented out parts of the code to see how the registers were being used, this is what I came up with:

Parts included Register Use

    1                    22
 1-2                    33
 1-3                    40
 1-4                    41
 1-5                    49
 1-6                    58

I think the problem has something to do with the macros I am using to index multi-dimensional arrays. However,
other kernels haven’t shown such poor register use. If I comment out the calculation of sflxph sflxpl, then register
use does not increase as I include more code chunks.

Any ideas on exploring this?

Thanks,
Craig

Code snippet:

#define FTNREF4D(i_index,j_index,k_index,l_index,i_size,j_size,k_siz
e,i_lb,j_lb,k_lb,l_lb) (i_size)(j_size)(k_size)(l_index-l_lb)+(i_size)(j_size)(k_index-k_lb)+(i_
size)
(j_index-j_lb)+i_index-i_lb
#define FTNREF5D(i_index,j_index,k_index,l_index,m_index,i_size,j_si
ze,k_size,l_size,i_lb,j_lb,k_lb,l_lb,m_lb) (i_size)(j_size)(k_size)(l_size)(m_index-m_lb)+(i_size
)(j_size)(k_size)(l_index-l_lb)+(i_size)(j_size)(k_index-k_lb)+(i_size)(j_index-j_lb)+i_index-i_lb

// Code Chunk 1
p1=1;
p2=(p1+3-1)%6+1;
flxp_ij_v1=flxp_ij[FTNREF4D(i,j,p1,ivl,nx,nx,npp,1,1,1,1)];
flxp_ij_v2=flxp_ij[FTNREF4D(i - 1,j,p2,ivl,nx,nx,npp,1,1,1,1)];
v1=flxp_ij_v1+fabs(flxp_ij_v1);
v2=flxp_ij_v2+fabs(flxp_ij_v2);

      sflxph = 0.5 * ((v1) * (trce_ij[FTNREF5D(i,j,p1,ivl,ns,nx,nx,npp,nvl,1,1,1,1,1)] - deltrc_ij[FTNREF5D(i,j,p1,ivl,ns,nx,nx,npp,nvl,1,1,1,1,1)]) 
                    - (v2) * (trce_ij[FTNREF5D(i - 1,j,p2,ivl,ns,nx,nx,npp,nvl,1,1,1,1,1)] - deltrc_ij[FTNREF5D(i - 1,j,p2,ivl,ns,nx,nx,npp,nvl,1,1,1,1,1)]));
      sflxpl = 0.5 * ((v1) * trc1
                    - (v2) * trc_ij[FTNREF4D(i - 1,j,ivl,ns,nx,nx,nvl,1,1,1,1)]);

      adfs_ij[FTNREF5D(i,j,p1,ivl,ns,nx,nx,npp,nvl,1,1,1,1,1)] = sflxph - sflxpl;

      fsl_temp += sflxpl;

// Code Chunk 2
p1=2;
p2=(p1+3-1)%6+1;
flxp_ij_v1=flxp_ij[FTNREF4D(i,j,p1,ivl,nx,nx,npp,1,1,1,1)];
flxp_ij_v2=flxp_ij[FTNREF4D(i,j-1,p2,ivl,nx,nx,npp,1,1,1,1)];
v1=flxp_ij_v1+fabs(flxp_ij_v1);
v2=flxp_ij_v2+fabs(flxp_ij_v2);

      sflxph = 0.5 * ((v1) * (trce_ij[FTNREF5D(i,j,p1,ivl,ns,nx,nx,npp,nvl,1,1,1,1,1)] - deltrc_ij[FTNREF5D(i,j,p1,ivl,ns,nx,nx,npp,nvl,1,1,1,1,1)]) 
                    - (v2) * (trce_ij[FTNREF5D(i,j-1,p2,ivl,ns,nx,nx,npp,nvl,1,1,1,1,1)] - deltrc_ij[FTNREF5D(i,j-1,p2,ivl,ns,nx,nx,npp,nvl,1,1,1,1,1)]));
      sflxpl = 0.5 * ((v1) * trc1
                    - (v2) * trc_ij[FTNREF4D(i,j-1,ivl,ns,nx,nx,nvl,1,1,1,1)]);

      adfs_ij[FTNREF5D(i,j,p1,ivl,ns,nx,nx,npp,nvl,1,1,1,1,1)] = sflxph - sflxpl;
      fsl_temp += sflxpl;

// Code Chunk 3
p1=3;
p2=(p1+3-1)%6+1;
flxp_ij_v1=flxp_ij[FTNREF4D(i,j,p1,ivl,nx,nx,npp,1,1,1,1)];
flxp_ij_v2=flxp_ij[FTNREF4D(i+1,j-1,p2,ivl,nx,nx,npp,1,1,1,1)];
v1=flxp_ij_v1+fabs(flxp_ij_v1);
v2=flxp_ij_v2+fabs(flxp_ij_v2);

      sflxph = 0.5 * ((v1) * (trce_ij[FTNREF5D(i,j,p1,ivl,ns,nx,nx,npp,nvl,1,1,1,1,1)] - deltrc_ij[FTNREF5D(i,j,p1,ivl,ns,nx,nx,npp,nvl,1,1,1,1,1)]) 
                    - (v2) * (trce_ij[FTNREF5D(i+1,j-1,p2,ivl,ns,nx,nx,npp,nvl,1,1,1,1,1)] - deltrc_ij[FTNREF5D(i+1,j-1,p2,ivl,ns,nx,nx,npp,nvl,1,1,1,1,1)]));
      sflxpl = 0.5 * ((v1) * trc1
                    - (v2) * trc_ij[FTNREF4D(i+1,j-1,ivl,ns,nx,nx,nvl,1,1,1,1)]);

      adfs_ij[FTNREF5D(i,j,p1,ivl,ns,nx,nx,npp,nvl,1,1,1,1,1)] = sflxph - sflxpl;
      fsl_temp += sflxpl;

// Code Chunk 4
p1=4;
p2=(p1+3-1)%6+1;
flxp_ij_v1=flxp_ij[FTNREF4D(i,j,p1,ivl,nx,nx,npp,1,1,1,1)];
flxp_ij_v2=flxp_ij[FTNREF4D(i+1,j,p2,ivl,nx,nx,npp,1,1,1,1)]
;
v1=flxp_ij_v1+fabs(flxp_ij_v1);
v2=flxp_ij_v2+fabs(flxp_ij_v2);

      sflxph = 0.5 * ((v1) * (trce_ij[FTNREF5D(i,j,p1,ivl,ns,nx,nx,npp,nvl,1,1,1,1,1)] - deltrc_ij[FTNREF5D(i,j,p1,ivl,ns,nx,nx,npp,nvl,1,1,1,1,1)]) 
                    - (v2) * (trce_ij[FTNREF5D(i+1,j,p2,ivl,ns,nx,nx,npp,nvl,1,1,1,1,1)] - deltrc_ij[FTNREF5D(i+1,j,p2,ivl,ns,nx,nx,npp,nvl,1,1,1,1,1)]

));
sflxpl = 0.5 * ((v1) * trc1
- (v2) * trc_ij[FTNREF4D(i+1,j,ivl,ns,nx,nx,nvl,1,1,1,1)]);

      adfs_ij[FTNREF5D(i,j,p1,ivl,ns,nx,nx,npp,nvl,1,1,1,1,1)] = sflxph - sflxpl;
      fsl_temp += sflxpl;

// Code Chunk 5
p1=5;
p2=(p1+3-1)%6+1;
flxp_ij_v1=flxp_ij[FTNREF4D(i,j,p1,ivl,nx,nx,npp,1,1,1,1)];
flxp_ij_v2=flxp_ij[FTNREF4D(i,j+1,p2,ivl,nx,nx,npp,1,1,1,1)]
;
v1=flxp_ij_v1+fabs(flxp_ij_v1);
v2=flxp_ij_v2+fabs(flxp_ij_v2);

     sflxph = 0.5 * ((v1) * (trce_ij[FTNREF5D(i,j,p1,ivl,ns,nx,nx,npp,nvl,1,1,1,1,1)] - deltrc_ij[FTNREF5D(i,j,p1,ivl,ns,nx,nx,npp,nvl,1,1,1,1,1)]) 
                    - (v2) * (trce_ij[FTNREF5D(i,j+1,p2,ivl,ns,nx,nx,npp,nvl,1,1,1,1,1)] - deltrc_ij[FTNREF5D(i,j+1,p2,ivl,ns,nx,nx,npp,nvl,1,1,1,1,1)]

));
sflxpl = 0.5 * ((v1) * trc1
- (v2) * trc_ij[FTNREF4D(i,j+1,ivl,ns,nx,nx,nvl,1,1,1,1)]);

      adfs_ij[FTNREF5D(i,j,p1,ivl,ns,nx,nx,npp,nvl,1,1,1,1,1)] = sflxph - sflxpl;
      fsl_temp += sflxpl;

// Code Chunk 6

      p1=6;
  p2=(p1+3-1)%6+1;
      flxp_ij_v1=flxp_ij[FTNREF4D(i,j,p1,ivl,nx,nx,npp,1,1,1,1)];
      flxp_ij_v2=flxp_ij[FTNREF4D(i-1,j+1,p2,ivl,nx,nx,npp,1,1,1,1)];
      v1=flxp_ij_v1+fabs(flxp_ij_v1);
      v2=flxp_ij_v2+fabs(flxp_ij_v2);


      sflxph = 0.5 * ((v1) * (trce_ij[FTNREF5D(i,j,p1,ivl,ns,nx,nx,npp,nvl,1,1,1,1,1)] - deltrc_ij[FTNREF5D(i,j,p1,ivl,ns,nx,nx,npp,nvl,1,1,1,1,1)]) 
                    - (v2) * (trce_ij[FTNREF5D(i-1,j+1,p2,ivl,ns,nx,nx,npp,nvl,1,1,1,1,1)] - deltrc_ij[FTNREF5D(i-1,j+1,p2,ivl,ns,nx,nx,npp,nvl,1,1,1,1,1)]));
      sflxpl = 0.5 * ((v1) * trc1
                    - (v2) * trc_ij[FTNREF4D(i-1,j+1,ivl,ns,nx,nx,nvl,1,1,1,1)]);

      adfs_ij[FTNREF5D(i,j,p1,ivl,ns,nx,nx,npp,nvl,1,1,1,1,1)] = sflxph - sflxpl;
      fsl_temp += sflxpl;

      fsl_ij[ FTNREF5D(i,j,ivl,nf,ns,nx,nx,nvl,nabl,1,1,1,1,1) ] =fsl_temp;

After a 10-second lookover…I have a few questions:

// Code Chunk 1

p1=1;

p2=(p1+3-1)%6+1;

Why are you computing those values like that? They are both always going to be constant (at least, from the code you posted)…perhaps you could either just hard-code the values, or do a little macro for them or something if their values are in fact changing somewhere else. That might save you a register or two.

Also, what is your memory layout like for the arrays? How much data is in the arrays? Are there any constraints on the sizes of each dimension?

I have a feeling like a fair amount of this code could be optimized more based on those, since it looks fairly redundant. Depending on the answers to those other questions, perhaps you could make a struct to hold your array indices together (and another one for the sizes of the array lengths in each dimension)…like a uint4 does. You’d have to use something bigger for the 5-D, like a struct with 6 or 8 ints (since your going to be issuing 2 load instructions at that point, might as well be able to handle 6-D up to 8-D). If your arrays are not that big, you could do a ushort8 vector type (or even a byte8 if they are really small) to optimize them into one load instruction.

It would be a lot easier to read/maintain/debug if you could just pass the macro 1-2 arguments (the index and size structs that I mentioned). It also might help the compiler optimize the reuse of the registers, since I imagine that all the multiplications in the index calculation are causing it to eat up registers.

Gotcha, I did it to as a balance of readability, maintainability, and performance. I will hard code the values.

The arrays are multidimensional and come from Fortran. The sizes do vary, but we have them laid out so that memory accesses are coalesced. In this case, the 4D flxp arrays are 3232506 (1.17MB), and the 5D trce arrays are 3232506*4 (4.68MB). I am not sure what you mean by are there constraints. Do you mean are they less than 256 so I can fit them in bytes? If yes,

I am not sure if I can do that. Globally they would be much bigger (this represents one task of a larger MPI program). If the indexing is always local, then that should work.

I agree that the multiplications are eating up the index calculation. The question is, why doesn’t the compiler reuse those registers as it moves to the next chunk of code? I have been looking at the ptxas options, but I can’t find anything that will help me tell what is going on (as an aside, the kernel is also using some lmem, but I haven’t figure out where that is coming from). I will try the optimizations as you suggest and see how far that gets.

Thanks,

Craig

Another thing you might try is add braces to limit scope, and add const whenever you can.

For example
// Code Chunk 1
{
const int p1=1;
const int p2=(p1+3-1)%6+1;
const float flxp_ij_v1=flxp_ij[FTNREF4D(i,j,p1,ivl,nx,nx,npp,1,1,1,1)];
const float flxp_ij_v2=flxp_ij[FTNREF4D(i - 1,j,p2,ivl,nx,nx,npp,1,1,1,1)];
const float v1=flxp_ij_v1+fabs(flxp_ij_v1);
const float v2=flxp_ij_v2+fabs(flxp_ij_v2);

const float sflxph = 0.5 * ((v1) * (trce_ij[FTNREF5D(i,j,p1,ivl,ns,nx,nx,npp,nvl,1,1,1,1,1)] - deltrc_ij[FTNREF5D(i,j,p1,ivl,ns,nx,nx,npp,nvl,1,1,1,1,1)])

  • (v2) * (trce_ij[FTNREF5D(i - 1,j,p2,ivl,ns,nx,nx,npp,nvl,1,1,1,1,1)] - deltrc_ij[FTNREF5D(i - 1,j,p2,ivl,ns,nx,nx,npp,nvl,1,1,1,1,1)]));
    const float sflxpl = 0.5 * ((v1) * trc1
  • (v2) * trc_ij[FTNREF4D(i - 1,j,ivl,ns,nx,nx,nvl,1,1,1,1)]);

adfs_ij[FTNREF5D(i,j,p1,ivl,ns,nx,nx,npp,nvl,1,1,1,1,1)] = sflxph - sflxpl;

fsl_temp += sflxpl;
}

Sometimes (historically with dumber compilers) this can help tell the compiler that there are not dependencies.

It’s possible that the compiler already knows about the lack of dependencies and is aggressively reordering the statements. If that’s what’s causing the register usage, it might not make any difference, but it’s fairly easy to try.

Be careful using the const modifier though…if it’s not properly scoped (like Jamie K wrote), you’ll end up using MORE registers.

What I was saying about the array ‘limits’ was that if you know the size of your arrays (whether they are a fixed size, or you have imposed some kind of a limit on how large they may be), you can decide what array ‘indexing’ method to use.

What about something like this:

struct __align__(8) uchar8

{

  unsigned char x, y, z, w, a, b, c, d;

};

#define ARRAY_SIZE_4D_X = 32

#define ARRAY_SIZE_4D_Y = 32

#define ARRAY_SIZE_4D_Z = 50

#define ARRAY_SIZE_4D_W = 6

#define BLOCK_SIZE_4D_X = 1

#define BLOCK_SIZE_4D_Y = 1

#define BLOCK_SIZE_4D_Z = 1

#define BLOCK_SIZE_4D_W = 1

const int array_size_4d_xy = ARRAY_SIZE_4D_X * ARRAY_SIZE_4D_Y;		// Get the size of the XY 'step'

const int array_size_4d_xyz = array_size_4d_xy * ARRAY_SIZE_4D_Z;	// Get the size of the xyz 'step'; reusing the '_xy' values saves one extra multiplication

// this can use either the built-in uchar4 type or our declared uchar8 type (5-D and higher must use uchar8)

#define CALCULATE_4D_ARRAY_INDEX(index) (array_size_4d_xyz * (index.w - BLOCK_SIZE_4D_W)) + (array_size_4d_xy * (index.z - BLOCK_SIZE_4D_Z)) + (ARRAY_SIZE_4D_X * (index.y - BLOCK_SIZE_4D_Y) + index.x - BLOCK_SIZE_4D_X)

Then, you could use this new struct/macro like so:

uchar8 index;	// Using uchar8 here means the variable can be reused for the 5-D type as well, which might be useful when looping

index.x = i;

index.y = j;

index.z = p1;

index.w = ((p2 + 2) % 6) + 1;	// Can this 6 be replaced by ARRAY_SIZE_4D_W? Also, the mod operator is really slow -- can it be replaced with something else?

float flxp_ij_v1 = flxp_ij[CALCULATE_4D_ARRAY_INDEX(index)]

...

-- Rest of code would go here --

I wasn’t sure what your macro parameters ending in _lb were, so I guessed and assumed they had something to do with your data block sizes. If not, just change the names of the #define constants.

There should also be some other ways to compute the ‘p2’ value without the mod operator (ternary perhaps?)…it’s one of the slowest operators in CUDA.

I also wonder, given that your code is operating on fixed-size arrays, if it wouldn’t be faster for you to break this kernel up into some kind of ‘blocked’ method (similar to how CUBLAS does it’s thing). I’m not sure what kind of speedup it would get you, but it might do a little better with keeping the card busy, especially as faster hardware comes out.

Have you tried simply setting maxrregisters=58?
The compiler seems to be pretty good at detecting parts where he can recompute values to save registers. However, this means that it may end up doing a lot of duplicate work, so keep an eye on your PTX and use volatile where needed.

Thanks for the idea. I tried it, but the kernel still uses 60 registers.

Craig

I tried a little experiment with a int4 to see what happened. For the 5D array, when I packed the full array sizes into one struct, the kernel still wanted to use 60 registers…

I also tried changing the macro to something similar in instructions as you have here (saving multiplications), it went slower but didn’t use any extra registers. I will

go through and try this out completely though, but will take some time.

I also changed the macro so that the 1 was assumed, and not passed to the macro. This did not help.

I also created a noinline function to calculate the indicies. Register usage was the same. What I did note is that if I made the index calculation less complex, then register use decreased dramatically (even if wrong).

The _lb are offsets. In fortran, the array indicies start at 1, not 0. Therefore the code starts indexing arrays from 1 and that is why the _lb values are included. This is the point where you ask why would I do that. It is because we are translating the code from Fortran to C to CUDA (Building a translator). Any code targeted at the GPU has to be portable to other (including non-x86) systems. Maintaining two code bases is a recipe for failure. I know the translation will never be perfect, but we have to minimze the changes.

This isn’t the problem, but I have just hard-coded p1 and p2 into all of the array references. That is how it comes from Fortran anyway.

If you are implying multiple kernels, that might work. If you are thinking about moving small sections of data into shared memory, operating on them, then moving on, that could work as well. I have tried that with some of my other routines. It can be especially helpful when a single point has to access its neighbors. The problem is that since I am using too many registers, my occupancy is low and I am limited to my block_size. Since the block_size is low, I have to call the kernel more, reloading data because my block is too small. From my other kernels, I would have to get the block_size up to 384 to see a benefit. That means I need to cut my register usage in half.

Why 58? I would really like it to be around 32 (which seems like it should).

When I tried to set maxrregcount to 58, it worked but that isn’t small enough to make a difference to occupancy. As I lowered

maxrregcount in steps down to 48, the kernel started to use local memory and the Kernel continued to get slower.

I guess the real question I have is, how to I figure out how the registers are actually being used? Reading PTX code can

help, but ptxas still does more optimization. The question is, is the problem at the PTX level where the register reuse should

already be happening, or is the assembler not detecting the scope of the registers correctly?

Or do I just have bad code??? I don’t mind this last answer, but I want to be able to prove it.

Craig

Use decuda to disassemble your complied kernel (.cubin) file…it will show you what the real registers are doing (not the virtual PTX registers):

http://www.cs.rug.nl/~wladimir/decuda/

Great idea. I have already downloaded it and using it now.