I’m getting some rather odd behaviour which feels like a bug, though may be a misunderstanding on my part.
I don’t understand why the following code could ever need syncing at the indicated points. The second argument to the device function is never written to in the device function, so the sync threads should be redundant, right?
(NB. this isn’t the actual code, so may have slight syntax errors)
__device__ void devFunc(double *a1, double *a2, int n) {
// Write to a1 based on a2, but DO NOT ALTER a2.
}
__global__ void function(double *x, int *lengths, int n) {
unsigned int index = threadIdx.x;
int length = lengths[blockIdx.x]; // Guarenteed to be < index
extern __shared__ double a[];
double *b = &a[length * n];
double *c = &b[length * n];
double *d = &c[length * n];
__syncthreads();
devFunc(&x[index * n], &a[index * n], n);
// __syncthreads(); (This is needed for some reason)
devFunc(&b[index * n], &a[index * n], n);
// __syncthreads(); (This is needed for some reason)
devFunc(&u[cndex * n], &a[index * n], n);
}
The actual code is a big longer than this example - if it is indeed a bug I’ll see if I can write a proper example to reproduce it. I’m slightly concerned it may be due to a compiler flag I had to set to get the compiler to work through the rather long set of device functions needed for my kernal (–opencc-options -OPT:Olimit=0).
Bug or misunderstanding?
(BTW - I can’t seem to get indentation working in the code blocks - is there a way?)