__threadfence() problem

Greetings!

I have a problem which looks as the __threadfence() command would’t do what it supposed to. Guess the fault is in my understanding not the command, but im totally out of ideas.

What I try to achieve is that every block copys data to a global memory reserve, than each of them reads the one put by their neighbour, but its unreliable, my best guess is sync error, but if the threadfence() works as I read, it couldnt be…

Heres my method, any help or idea is appreciated :)

template <typename T>

__device__ void Migrate(float * fitnes, T* data, unsigned indivsize, T* d_array, unsigned migrates){

	const unsigned tid=threadIdx.x;

	Sort<T>(fitnes,data,indivsize);

	if(tid<migrates){

		

		T* gl_indiv=&d_array[blockIdx.x*indivsize*migrates+tid*indivsize]; //global memory pointer for storing best indivs

		T* gl_nghb=&d_array[((blockIdx.x+1)%gridDim.x)*indivsize*migrates+tid*indivsize]; //pointer to the neighbours global memory storage

		T* loc_indivB=&data[tid*indivsize]; //pointer to the individuals in shared memory, starting from the first

		T* loc_indivW=&data[(blockDim.x-migrates)*indivsize+tid*indivsize]; //pointer to the worst indivs in the block

		for(int i=0;i<indivsize;i++)

			gl_indiv[i]=loc_indivB[i]; //moving best individuals to global memory

			

		

			__threadfence();

			

		for(int i=0;i<indivsize;i++) 

			loc_indivW[i]=gl_nghb[i]; //overwriting worst indivs whith the neigbours bests			

			

			

	}

	

	__syncthreads();

}

And the problem is: sometimes some blocks read the original contents of the neighbours global array, instead of the stuff written there by the corresponding block. Which in my understanding means, that the read happens prior to the write.

Oh, and I have compute capability 1.1 with my card.

__threadfence() is not a global barrier. It doesn’t guarantee anything about scheduling.

Well thats bad news, I thought as I read the manual that it waits until all theads on the device sees the change in memory, that its global.

Is there any way to implement what I try so a global sync on global memory access,(except stopping and restarting the kernel, because its not an option) other than moving the data to global memory by atomicExch(), because that would hurt the performance very badly

Guess that threadfence_system() wouldnt do the trick either.

Also I guess that the scheduler is using some time multiplexing method, so if I can delay the data read enough that might just do what I want, am I correct?

Not even the atomic functions done the trick, I’ve tryed this after the threadfence(), but the results are still unreliable.

__threadfence();

		unsigned value=atomicInc(&count,migrates*gridDim.x);

		unsigned cnt=0;

		while(value!=(migrates*gridDim.x-1)){cnt++;}//just to make sure, that the cycle not get optimized out of the final code

I know this is very unefficient, and it doesnt work anyway. Is there any way on any card to done this at all?