Reducing the number of registers To improve occupancy

Can anyone give any concrete tips on how to reduce the number of registers that are used for a program?

For a particular program I have, since it uses 14 registers, I can never get better than 2/3 occupancy. If I can get that down to 10, then I can get 100% occupancy.

I just don’t know how I can change my code (short of trial and error) to reduce them.

My experience has been limited so take these comments with a grain a salt :)

  • higher occupancy does not guarantee better performance

  • register usage is just one of the tradeoffs. For example, in my kernel, I increased register usage to load some of the global memory into the registers. This decreased occupancy but my kernel ran faster because of more GPU-friendly global memory accesses.

  • nvcc does not appear to be optimizing for common expressions so you may want to try manually pulling out all the common expressions and calculating them once explicitly. In my case, this increased register usage but gave me more performance by (probably) reducing global memory access.

Spencer

Thanks for the info. I understand that occupancy isn’t everything. In my case, I lowered occupancy but loaded data in a way that allowed more coalescing and got faster performance.

The thing that seems to be missing is someway to find out the relative time spent waiting to that spend computing.

Could someone from Nvidia comment on the relative merits of occupancy, register usage, coalescing, etc?

Spencer, can you provide an example of this issue? Note that not all optimizations show up in the PTX, because PTX is a virtual machine. Machine-optimized code is generated after PTX and stored in binary form (we don’t expose the actual machine assembly languages, because they will likely gradually change from GPU to GPU).

But it’s possible there are still compiler bugs with optimizations (and there have already been a lot of compiler performance improvements that will be in the next major CUDA release).

If you have an example, we’d like to see it!

As I’ve said before, higher occupancy does not always translate to higher performance. Having more active threads per multiprocessor can help hide memory latency if you are memory bound, and can also better fill the instruction pipeline so there are no bubbles. That said, if you have to make your code bend over backward to reduce register count, it can cause your instruction count to increase, which may be an overall loss if you are not memory bound.

Here’s a rule to live by:

The NUMBER ONE optimization you can make is to ensure your memory accesses are coalesced. This can mean the difference between hundreds of cycles of latency PER THREAD and hundreds of cycles of latency PER WARP. In other words it can mean ORDERS OF MAGNITUDE!

Mark

I’ve seen nvcc do common expression optimizations in several cases. Actually, in those cases register use was lower when nvcc compiled the code as opposed to me storing a common expression result in a temporary variable. It appears that nvcc has an easier time optimizing common expressions when the common part is written in the same way (character-wise) and is the firt part of the expression. For example, (1) seems to be preferred over (2):

(1):

x=a[dimx*4+offset+i];

y=b[dimx*4+offset+j];

(2):

x=a[dimx*4+offset+i];

y=b[j+offset+dimx*4];

These are by no means guidelines, just something I noticed when rearranging my source code and checking the effect on the resulting assembly. Also, the code aboe is just a representative illustration.

Paulius

Hi Mark,

One of my earlier attempts at the kernel had code that looked like this

     uint8_t *src0 = &plane0[2*i_line*i_stride];

      uint8_t *src1 = src0+i_stride;

      uint8_t *src2 = src1+i_stride;

      uint8_t *dst0 = &lowres[0][i_line*i_stride2];

      uint8_t *dsth = &lowres[1][i_line*i_stride2];

      uint8_t *dstv = &lowres[2][i_line*i_stride2];

      uint8_t *dstc = &lowres[3][i_line*i_stride2];

     if (x != i_width2-1) {

       dst0[x] = (src0[2*x] + src0[2*x+1] + src1[2*x  ] + src1[2*x+1] + 2) >> 2;

        dsth[x] = (src0[2*x+1] + src0[2*x+2] + src1[2*x+1] + src1[2*x+2] + 2) >> 2;

        dstv[x] = (src1[2*x  ] + src1[2*x+1] + src2[2*x  ] + src2[2*x+1] + 2) >> 2;

        dstc[x] = (src1[2*x+1] + src1[2*x+2] + src2[2*x+1] + src2[2*x+2] + 2) >> 2;

     }

Where plane0 is stored in global memory. When I looked at the PTX output, it looked like the global memory reads were not been optimized e.g. “src1[2x ] + src1[2x+1]” occurs twice in the code and the ptx output says each read e.g. src1[2*x] were read from global memory twice.

When I rewrote the code, I did this (taking into account that I am having problem with mis-aligned 32 bit reads which you know about)

    const uint16_t *i16p_src0 = (uint16_t *) &src0[xx]; 

     const uint16_t *i16p_src1 = (uint16_t *) &src1[xx]; 

     const uint16_t *i16p_src2 = (uint16_t *) &src2[xx]; 

     data0.i32 = i16p_src0[0] + (i16p_src0[1]<<16); 

     data1.i32 = i16p_src1[0] + (i16p_src1[1]<<16); 

     data2.i32 = i16p_src2[0] + (i16p_src2[1]<<16); 

   // precompute common expressions and load them into registers

   const unsigned int t0 = data0.plane[0] + data0.plane[1];

    const unsigned int t0a = data0.plane[1] + data0.plane[2];

    const unsigned int t1 = data1.plane[0] + data1.plane[1];

    const unsigned int t1a = data1.plane[1] + data1.plane[2];

    const unsigned int t2 =  data2.plane[0] + data2.plane[1];

    const unsigned int t2a = data2.plane[1] + data2.plane[2];

   if (thread_id < max_threads) {

     if (x != i_width2-1) {

       dst0[x] = (t0 + t1 + 2) >> 2;

        dsth[x] = (t0a + t1a + 2) >> 2;

        dstv[x] = (t1 + t2 + 2) >> 2;

        dstc[x] = (t1a + t2a + 2) >> 2;

   }

    [...]

This code ran much faster though it looks a lot uglier. Some of it may be related to how I re-ordered the global memory access but some of it is likely to be related to the fact that I calculated the intermediate expressions explicitly and did only one read from global for each access to srcX. [Note there were other intermediate versions where the kernel ran slower when I did not do the “i16p_*” code changes.]

You will have to ask your compiler guys as to whether nvcc can safely do what I did by hand.

Spencer