Non-stride-1 accesses for array - dummy test

Hi all,

I’m trying to find out the solution to my problem, so I re-create a dummy version of my code in order to understand the output coming from the compiler.

Here, the code:

#define DIM 4

int main(int argc, char* argv[])
{
   float* samples;
   int i, threads;

   threads = 1024;  /* number of threads */

   samples = (float*) malloc(sizeof(float) * DIM * threads);

   #pragma acc region copyin(samples[0:(DIM * threads)-1])
   {
      #pragma acc for
      for (i=0; i<threads; i++) {
         int j;
         
         for (j=0; j<3; j++) {
            float sum;
            int j;

            sum = samples[i * DIM + 0] + samples[i * DIM + 1] + samples[i * DIM + 2];
         }
      }
   }
}

The second inner loop is just to show that every single thread reads a different sample from the array. In my original code the algorithm is more complicated, but the ouput from the compiler is the same:

Non-stride-1 accesses for array ‘samples’

Does anyone can explain me out to avoid the “non-stride-1 access for array samples” ?

Cheers
A.

Hi A.

Here’s my general response to what a ‘Non-stride-1’ message means:

‘Non-stride-1 accesses’ means that the GPU threads wont be accessing contiguous data from your array. Although sometimes unavoidable, this can cause performance problems and you should review your code to determine if it can be restructured or if a different GPU schedule will help.

The details are a bit lengthy, but I try and be brief. An NVIDIA GPU is composed of several Multi-proccessors (MIMD). Each multi-processor is comprised of several thread processors (SIMD). On my Tesla, I have 15 MIMD each with 8 SIMD for a total of 240 thread processors. This varies from card to card. For details about your card, please run the utility ‘pgaccelinfo’.

SIMD stands for ‘Single Instruction Multiple Data’ which means that all the threads running on the same multi-processor needs to execute the same instruction at the same time, although they each perform the instruction on different data. Note that a group of threads being run on a single multi-thread is called a ‘Warp’.

So what happens if all the threads in a Warp try to access memory? If the memory is contiguous, the hardware is optimized so that the threads can all bring in their memory at the same time. If the memory is not contiguous, then only on thread at a time can access memory at a time while the other threads wait.

The fix in your case is to rearrange how your data is stored in sample so that thread 0 accesses elements 0,1024,2048, and 3072, thread 1 accesses elements 1, 1025, 2049, and 3073, etc. This way when memory is fetched from the GPU’s global memory to the local memory, it’s done in a single instruction for all threads in a warp.

I rewrote you test program with this in mind. Granted I changed your algorithm a bit so it may not be exactly what you want. If you do need to store data where it’s non-sequential for the threads, the code will still parallelize, you’ll just be giving up some performance. It’s up to up you if the loss in performance outweighs the cost to modify your program.

 % cat test.c
#include <malloc.h>

#define DIM 4

int main(int argc, char* argv[])
{
   float* samples;
   int i,j, threads;
   float sum;

   threads = 1024;  /* number of threads */

   samples = (float*) malloc(sizeof(float) * DIM * threads);

   #pragma acc region copyin(samples[0:(DIM * threads)-1])
   {
      #pragma acc for kernel
       for (i=0; i<threads; i++) {
         sum = 0;
         for (j=0; j<DIM; j++) {
//            sum = samples[i * DIM + 0] + samples[i * DIM + 1] + samples[i * DIM + 2];
            sum += samples[i+(threads*j)];
         }
      }
   }
}
% pgcc -V10.8 -ta=nvidia -Minfo=accel test.c -Mfcon
main:
     15, Generating copyin(samples[:threads*4-1])
         Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
     18, Loop is parallelizable
         Accelerator kernel generated
         18, #pragma acc for parallel, vector(256)
             CC 1.0 : 1 registers; 20 shared, 28 constant, 0 local memory bytes; 100 occupancy
             CC 1.3 : 1 registers; 20 shared, 28 constant, 0 local memory bytes; 100 occupancy
     20, Loop is parallelizable

Hope this helps,
Mat

Thanks mkcolg, it worked as you suggested!

However your solution led to a new problem. As you can see from the following code, the compiler gives me a “Loop carried dependence” and a “Loop carried backward dependence” on ‘indices’ array that I can’t solve. Can you give me some advices ?

Here the compiler output:

pgcc -ta=nvidia,time -Minfo=loop,accel,vect,mp -Msafeptr -mp -O3 -fast -I./inc -c src/test.c -o obj/test.o
main:
     56, Generating copyin(samples[:threads*4-1])
         Generating copyout(indices[:threads*2-1])
         Generating compute capability 1.0 kernel
         Generating compute capability 1.3 kernel
     61, Loop carried dependence of 'indices' prevents parallelization
         Loop carried backward dependence of 'indices' prevents vectorization
         Accelerator kernel generated
         61, #pragma acc for vector(32)
             Cached references to size [1056] block of 'indices'
     63, Loop is parallelizable

And then the dummy code based on your previous tips

#define DIM 4

int main(int argc, char* argv[])
{
   float* samples;
   int* indices;
   int i,j, threads;
   float sum;

   threads = 1024;  /* number of threads */

   samples = (float*) malloc(sizeof(float) * DIM * threads);
   indices = (int*) malloc(sizeof(int) * 2 *  threads);

   #pragma acc region \
		copyin(samples[0:(DIM * threads)-1]) \
		copyout(indices[0:(2 * threads)-1])
   {
      #pragma acc for vector
       for (i=0; i<threads; i++) {
         sum = 0;
         for (j=0; j<DIM; j++) {
            sum += samples[i+(threads*j)];
         }

         indices[0 * threads + i] = 0;
         indices[1 * threads + i] = 0;
      }
   }
}

Cheers
A.

Hi A.

Currently, the compiler’s dependency analyzer is not able to determine if all values of indices are independent since your using a computed index. This is a known limitation which hopefully in the future we’ll be able to add support. The work around is to use the independent clause to assert the your loop is indeed independent.

For example:

 % cat test.c
#include <malloc.h>

#define DIM 4

int main(int argc, char* argv[])
{
   float* samples;
   int* indices;
   int i,j, threads;
   float sum;

   threads = 1024;  /* number of threads */

   samples = (float*) malloc(sizeof(float) * DIM * threads);
   indices = (int*) malloc(sizeof(int) * 2 *  threads);


   for (i=0; i<threads; i++) {
         for (j=0; j<DIM; j++) {
             samples[i+(threads*j)]=i;
         }
   }

   #pragma acc region \
      copyin(samples[0:(DIM * threads)-1]) \
      copyout(indices[0:(2 * threads)-1])
   {
      #pragma acc for independent parallel, vector(32)
       for (i=0; i<threads; i++) {
         sum = 0;
         for (j=0; j<DIM; j++) {
            sum += samples[i+(threads*j)];
         }
         indices[0 * threads + i] = sum;
         indices[1 * threads + i] = 1;
      }
   }

   printf("%d %d %d %d\n", indices[0],indices[1],indices[255],indices[511]);
}
% pgcc -ta=nvidia -Minfo=accel test.c -V10.9
main:
     24, Generating copyin(samples[:threads*4-1])
         Generating copyout(indices[:threads*2-1])
         Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
     29, Loop is parallelizable
         Accelerator kernel generated
         29, #pragma acc for parallel, vector(32)
             Cached references to size [1056] block of 'indices'
             CC 1.0 : 104 registers; 4244 shared, 432 constant, 0 local memory bytes; 4 occupancy
             CC 1.3 : 104 registers; 4244 shared, 432 constant, 0 local memory bytes; 6 occupancy
     31, Loop is parallelizable
% a.out
0 4 1020 2050
%

Hope this helps,
Mat