problem with __syncthreads();

As the topic title ive got a big problem.

This function seems not to work on my machine. I use MS Visual Studio 2010 Ultimate with Parallel Nsight.

First of all, progrmas using __syncthreads(); do compile, and work, but, in code function is always underlayed and

after hovering mouse pointer over it, it says that undefined - strangly it comile…

And furthermore, this instruction seems not to work at all.

__global__ void oceniajDev(float *tabDev, int row, int col, float *wynikDev)

{

	int bid = blockIdx.x;

	int tid = threadIdx.x;

	if(tid < col)

	{

		//atomicAdd(&wynikDev[bid], tabDev[bid*row+tid]);	

		wynikDev[bid] = tabDev[bid*row+tid];

		__syncthreads();				

	}

}

I need to avoid atomicAdd because i need to use floats, and there is no atomic operation that sums floats.

So i fugured out that i will synchronize tthreads in each block. Obwiously there is thread conflict, reading and writing to the same memory space in the same time. __syncthreads(); should make each thread to hold on untill it will close all of its operation befor stepping to another thread. Well after printing my results it seems not to work.

Is there some aditional header/library i should include? what m’I doing wrong? Any help would be welcome.

Oh no. What happens if tid>col. Those threads will never get to the sync instructions. Your kernel will crash.

Put __syncthreads(); outside of the if. Anyway for this kernel you do not need it.

The __syncthreads(); does not do what you think is doing, it only says stop here until all threads are getting are getting here. Use shared memory to save intermediate results, and then collect data in each block.

A suggestion (works only for blockDim.x power of 2):

__global__ void oceniajDev(float *tabDev, int row, int col, float *wynikDev)

{       

        shared double temp[blockDim.x];

        int bid = blockIdx.x;

        int tid = threadIdx.x;

if(tid < col)

        {

                //atomicAdd(&wynikDev[bid], tabDev[bid*row+tid]); 

                temp[tid]=tabDev[bid*row+tid];            

        }

        for(int ofs=blockDim.x/2;ofs<blockDim.x; ofs=ofs/2)

        {

                temp[tid]=temp[tid]+temp[tid+ofs];

                __syncthreads();                  

        }

        if(tid==0)

        {

                wynikDev[bid]=temp[0];

         }

}