Problems with shared memory protection/multiple active thread blocks per multiprocessor

Hi everyone,

I’ve got problems with the following code:

__global__ void myKernel(..............)


	__shared__ unsigned int POINTS[26][26];

	__shared__ int changed;


			#pragma unroll 1	// no unroll



				if (!(threadIdx.y) && !(threadIdx.x))

					changed = 0;


				if ((POINTS[letter1][LETTER] ^ POINTS[LETTER][letter1]))


					POINTS[letter1][LETTER] = 1;

					POINTS[letter2][pos2] = 1;

					POINTS[LETTER][letter1] = 1;

					POINTS[pos2][letter2] = 1;

					changed = 1;



			} while (changed);



The above code will works nicely if there is only one active thread block per multiprocessor. Otherwise, problems will appear: the loop may terminate earlier than expected or eventually hang the whole computer. I suspect the issue is due to one of the two reasons:

  • Out-of-bound writes to shared memory: Every multiprocessor has a single on-chip memory of 16KB allocated to active blocks as shared memory, thus out-of-bound writes from one thread block may interfere the other “concurrently” active thread blocks. I am not sure if the shared memory is protected by hardware or not (another topic has discussed this: I’ve reviewed my program carefully to look for possible out-of-bound writes but found nothing.

  • How the multiprocessor handles __syncthreads(): the above loop contains some __syncthreads() and conditional paths; I am not sure if they muddle the GPU hardware.

The program is run on a GLX260 card (SM version 1.3). The kernel is called with 26x26 blocks of Nx26 threads (N < 17). Currently I have to use a workaround occupying more than 8192 bytes of shared memory to prevent multiple active blocks per multiprocessor.

Anyone interested in the full source code may have a look on my whole project here:

I appreciate everyone’s help on this issue.


you should not be modifying and reading variables in the same “time segment” (block between syncthreads). Try adding a __syncthreads in the beginning of the if statement. Also, I believe the loop won’t be unrolled (“the compiler unrolls small loops with a known trip count”) – you don’t need the pragma. Last, you should add some bounds checking code.

hope it helps,

Thanks for your help. I had thought about mixing writing and reading shared memory in the same “time segment” and designed such the algorithm that the order of writings and readings (in threads) does not affect the final result (although the number of loop iterations may varies). The program does work anyway for the case of one single active block per multiprocessor.

I’m going to be working on improving Michael Boyer’s tool [that automatically checks for RW hazards] so it works on real-world code (; I’ll post it in this forum if I get anything done.

Waiting for your work :thumbup:

I’ve declared the variables in shared memory as volatile as well as exhaustively checked for out-of-bound R/W. Unfortunately, the problem is still there. I am turning to think that this is due to CUDA’s bugs.