Shared memory and __syncthreads

I’m getting some rather odd behaviour which feels like a bug, though may be a misunderstanding on my part.

I don’t understand why the following code could ever need syncing at the indicated points. The second argument to the device function is never written to in the device function, so the sync threads should be redundant, right?

(NB. this isn’t the actual code, so may have slight syntax errors)

__device__ void devFunc(double *a1, double *a2, int n) {

	// Write to a1 based on a2, but DO NOT ALTER a2.


__global__ void function(double *x, int *lengths, int n) {

	unsigned int index = threadIdx.x;

	int length = lengths[blockIdx.x]; // Guarenteed to be < index

	extern __shared__  double a[];

	double *b = &a[length * n];

	double *c = &b[length * n];

	double *d = &c[length * n];


	devFunc(&x[index * n], &a[index * n], n);	

	// __syncthreads(); (This is needed for some reason)

	devFunc(&b[index * n], &a[index * n], n);

	// __syncthreads(); (This is needed for some reason)

	devFunc(&u[cndex * n], &a[index * n], n);


The actual code is a big longer than this example - if it is indeed a bug I’ll see if I can write a proper example to reproduce it. I’m slightly concerned it may be due to a compiler flag I had to set to get the compiler to work through the rather long set of device functions needed for my kernal (–opencc-options -OPT:Olimit=0).

Bug or misunderstanding?

(BTW - I can’t seem to get indentation working in the code blocks - is there a way?)


I have had the same problem in the same case, it is obviously because you use a variable in shared memory index and in my case, the compiler needed too much registers to be executed . The command syncthreads() solved it too. The fact of adding syncthreads() decreased the amount of registers.