getting deadlock in while calling __syncthreads()


I have this following code fragment for all-to-one reduction:

__shared__ __device__ void reduction_add(int* array)


		int nTotalThreads = blockDim.x;

		int tx = threadIdx.x;

		while (nTotalThreads > 1)


				int half_point = (nTotalThreads >> 1);

				if (tx < half_point)


						*(array+tx) += *(array+tx+half_point);



				nTotalThreads = (nTotalThreads >> 1);



nTotalThreads = 56 initially and number of concurrent threads = 56 each working on 1 element of the array. The size of the array = 56.

While debugging the program in cuda-gdb, the program runs fine for first iteration (when nTotalThreads = 56 and half_point = 28). Then in the second iteration, when the execution reaches __syncthreads() [by typing a sequence of next], the program gets hung (ie the program seems to be blocking and not returning to the gdb prompt). Quite surprisingly, if I compile and run the program in bash then it runs to completion. Even in gdb, if no breakpoint is set and “run” is typed in the prompt, there is no problem. I am unable to figure out what might be the reason for this.

Some help will be really appreciated.


PS: The other parts of the code are not relevant so didnt mention here. I compiled the program using:

nvcc -g -G -o lookup.o

Inside the gdb I set the breakpoint using