Hi A.
Here’s my general response to what a ‘Non-stride-1’ message means:
‘Non-stride-1 accesses’ means that the GPU threads wont be accessing contiguous data from your array. Although sometimes unavoidable, this can cause performance problems and you should review your code to determine if it can be restructured or if a different GPU schedule will help.
The details are a bit lengthy, but I try and be brief. An NVIDIA GPU is composed of several Multi-proccessors (MIMD). Each multi-processor is comprised of several thread processors (SIMD). On my Tesla, I have 15 MIMD each with 8 SIMD for a total of 240 thread processors. This varies from card to card. For details about your card, please run the utility ‘pgaccelinfo’.
SIMD stands for ‘Single Instruction Multiple Data’ which means that all the threads running on the same multi-processor needs to execute the same instruction at the same time, although they each perform the instruction on different data. Note that a group of threads being run on a single multi-thread is called a ‘Warp’.
So what happens if all the threads in a Warp try to access memory? If the memory is contiguous, the hardware is optimized so that the threads can all bring in their memory at the same time. If the memory is not contiguous, then only on thread at a time can access memory at a time while the other threads wait.
The fix in your case is to rearrange how your data is stored in sample so that thread 0 accesses elements 0,1024,2048, and 3072, thread 1 accesses elements 1, 1025, 2049, and 3073, etc. This way when memory is fetched from the GPU’s global memory to the local memory, it’s done in a single instruction for all threads in a warp.
I rewrote you test program with this in mind. Granted I changed your algorithm a bit so it may not be exactly what you want. If you do need to store data where it’s non-sequential for the threads, the code will still parallelize, you’ll just be giving up some performance. It’s up to up you if the loss in performance outweighs the cost to modify your program.
% cat test.c
#include <malloc.h>
#define DIM 4
int main(int argc, char* argv[])
{
float* samples;
int i,j, threads;
float sum;
threads = 1024; /* number of threads */
samples = (float*) malloc(sizeof(float) * DIM * threads);
#pragma acc region copyin(samples[0:(DIM * threads)-1])
{
#pragma acc for kernel
for (i=0; i<threads; i++) {
sum = 0;
for (j=0; j<DIM; j++) {
// sum = samples[i * DIM + 0] + samples[i * DIM + 1] + samples[i * DIM + 2];
sum += samples[i+(threads*j)];
}
}
}
}
% pgcc -V10.8 -ta=nvidia -Minfo=accel test.c -Mfcon
main:
15, Generating copyin(samples[:threads*4-1])
Generating compute capability 1.0 binary
Generating compute capability 1.3 binary
18, Loop is parallelizable
Accelerator kernel generated
18, #pragma acc for parallel, vector(256)
CC 1.0 : 1 registers; 20 shared, 28 constant, 0 local memory bytes; 100 occupancy
CC 1.3 : 1 registers; 20 shared, 28 constant, 0 local memory bytes; 100 occupancy
20, Loop is parallelizable
Hope this helps,
Mat