Fixed size array, registers and function call

Hi,

I am trying to optimize a kernel. One of the problems of my kernel is low occupancy because of the amount of shared memory I am using. On the other hand, each block is using very few amount of registers, so I was thinking to use those to reduce shared memory requirements. To do that, I define a fixed size array and use it within loops that have the #pragma unroll in them. Some of the computations happen within a function. An example would be:


__device__ void doStuffOnData(unsigned int data[64]) {
   #pragma unroll
   for(int i = 0; i < 2; i++) {
     
     #pragma unroll
     for(int j = 0; j < 32; j++) {
       int k = i*32 + j;
       data[k] = (data[k] | 0xFF)*2 //or any other arbitrary thing.
     }
   }

}

kernel(unsigned int *globalMemory) {
  unsigned int data[64];
  #pragma unroll
  for(int j = 0; j < 64; j++) {
    data[j] = globalMemory[j];
  }

  doStuffOnData(unsigned int data[64]);

  #pragma unroll
  for(int j = 0; j < 64; j++) {
    globalMemory[j] = data[j];
  }
}

However, when I use nsight computing to see the register usage, it is still using the same number of registers as the kernel version using shared memory (and actually the performance drop considerably, and it seems to be using the global memory, instead of the registers). Not sure if I am doing something wrong, I am performing wrong assumptions or there is a problem on how I am passing the array to the function.

Many thanks ion advance for any insight on this.

Best,

I believe that even with fixed size arrays and full unrolling, arrays above a certain size will end up in local memory. There seems to be some heuristic in the compiler that enforces a size limit.

Try with multiple arrays of size 16, maybe?

Also give an unrolling factor parameter to #pragma unroll because sometimes the compiler may decide to only partially unroll - which keeps the accessed array out of registers in local memory.

#pragma unroll n

    You can use this pragma to control how many times a loop should be unrolled.
    n is an integer constant expression specifying the unrolling factor.
    The values of 0 and 1 block any unrolling of the loop.

Oh, ok. So not because on how I am calling the function, but it may be a problem in the unroll and/or the size of the array.

Many thanks for the quick answer. I will try following your comments.

Best,

The code you have posted won’t compile.

However, I think you’re going to have trouble with trying to come up with a contrived example like that, which actually demonstrates local memory usage. The compiler aggressively optimizes out the need for local memory in any situation that it can precompute or simplify the computation of results. There are a few recent examples of this here and here and here. None of those codes, when the compiler was allowed to optimize, produced machine code that was anything like what the source code suggested.

Actually demonstrating local memory usage would be the first step to actually demonstrating the movement of local memory into registers.