help with kernel synchronization?

No time to look at your code in detail, but triple check that you aren’t accessing shared memory out of bounds - Fermi will report an unspecified launch failure if you do so. Running your kernel through Ocelot’s PTX emulator is another way to test this.

I allocate threadsPerblock elements of shared memory before I call my kernel (Do_Scan_block), so each thread has 1 element to load from device memory to shared memory.

When I do some debugging it seems that shared memory of the position belonging to next thread block gets overwritten with a zero after performing scan_block and returning to Do_scan_block. I cannot see why this happens.

EDIT: it seems that the result varies depending on the amount of shared memory that is given to the kernel. is not (number of threads per block)*sizeof(T) the way to specify the amount of shared memory for a thread block, given that 1 thread is responsible for 1 element?

This error is driving me nuts!

EDIT: I found the error. It was as MisterAnderson42 said I access shared memory using a threads index on entire grid (variable tid in code).

I allocate threadsPerblock elements of shared memory before I call my kernel (Do_Scan_block), so each thread has 1 element to load from device memory to shared memory.

When I do some debugging it seems that shared memory of the position belonging to next thread block gets overwritten with a zero after performing scan_block and returning to Do_scan_block. I cannot see why this happens.

EDIT: it seems that the result varies depending on the amount of shared memory that is given to the kernel. is not (number of threads per block)*sizeof(T) the way to specify the amount of shared memory for a thread block, given that 1 thread is responsible for 1 element?

This error is driving me nuts!

EDIT: I found the error. It was as MisterAnderson42 said I access shared memory using a threads index on entire grid (variable tid in code).