getting random output for different run of the same program

Hi,

This code fragment

#define convert_thread_index_to_table_row(bx, tx) bx*blockDim.x+tx; 

_global__ void MatCopyKernel(int* table, int* candidate, int crows, int ccols, int* another, int* count)

{

		int i= threadIdx.x;

		int bx = blockIdx.x;

		//int j= threadIdx.y;

		int j,k, l;

		int item;

		for(j=0;j<crows;j++)

		{

				for(k=0;k<ccols;k++)

				{

						item= candidate[j*ccols+k];

						__syncthreads();

						int row = convert_thread_index_to_table_row(bx,i);

						if(table[row*ITEMS+(item-1)]==1)

						{

								if(k==0)

										another[row]=1;

								else

										another[row]*=1;

						}

						else

						{

								another[row]=0;

						}

						__syncthreads();

				}

				__syncthreads();

				//reduction

				l=2;

				while((l/2)<NUMRECORDS)

				{

						__syncthreads();

						int row = convert_thread_index_to_table_row(bx,i);

						if((row%l)==0 && row+l/2 < NUMRECORDS)

						{

								another[row]= another[row]+another[(row+(l/2))];

						}

						__syncthreads();

						l*=2;

 }

				//reduction ends

				count[j]=another[0];

				__syncthreads();

		}

}

The output is random for each run. I dont understand why would the output differ for different runs. If the calculation is wrong then it should produce the same incorrect result for each run. I guess it is due to synchronization. But at the beginning and end of each iteration of the loops there are _syncthreads() call (redundant) included. Then how come the output differs?

Wrong random outputs may also be caused by array overflows. If you read/write data from cell beyond array size you can get random rubbish.
Maybe if you could provide also your host code where you allocate memory and call the kernel, we could say something more.
Also, what is ITEMS and NUMRECORDS?

syncthreads() only syncs the threads within a threadblock, from what I can figure from the manual and various items on this site…I had a similar trouble assuming that syncthreads sync’d all threads…it does not, some threads in other blocks may run ahead of the sync, and the trouble begins. I have not found a good solution for this yet, though there are some academic papers describing a gpu_sync function on the web.

Google this one:

Inter-Block GPU Communication

via Fast Barrier Synchronization

Shucai Xiao and Wu-chun Feng

Department of Computer Science

Virginia Tech

However, I have not yet gotten this to work.