Problem in EmuDebug mode

I run my program written in CUDA and under the platform of MSVS 2008.

The program was able to run but the result was obvious wrong.

Then I built it using EmuDebug mode and run the program.

The program seems stuck in Deadlock.

I used 255 threads and when i set the breakpoint and run step by step, I found all threads can run sequentially and successfully load the data.

All finished threads will wait in __syncthreads(); However, when the last thread reached that sentence, the program hung up.

I am not sure what happened and can do nothing more.

What I am trying to do in the following code fragment is to copy data from global memory to the shared memory.

And I also found that the data inside vertices array will be changed, say vertices[8], but in different runs it was changed by different thread.

Can anyone who are experienced can enlightened me out?

Thank you very much in advance.

int threadId = threadIdx.x;

	// local vertices_size from GPU global memory to register

	int vcnt = (*devVertices_size);

	// shared memory declaration

	extern __shared__ int array[];

	int* vertices = (int*)&array;

	int* adjIndex = (int*)&vertices[48];

	int* adjcent = (int*)&adjIndex[96];

	int* triangleTable = (int*)&adjcent[255];

	int* cmNbrs = (int*)&triangleTable[255];

	int* sortedCS = (int*)&cmNbrs[48];

	int* unsortedCS = (int*)&sortedCS[48];

	int* eta = (int*)&unsortedCS[48];

	int* zeta = (int*)&eta[255];

	// Load data from GPU global memory to on-chip shared memory

	if (threadId < vcnt) {

		vertices[threadId] = devVerticesArray[threadId];

		adjIndex[2*threadId] = devAdjIndexArrayInOneGrid[2*threadId];

		adjIndex[2*threadId+1] = devAdjIndexArrayInOneGrid[2*threadId+1];


	adjcent[threadId] = devAdjcentArrayInOneGrid[threadId];

	triangleTable[threadId] = devLocal_TT[threadId];