Greetings!
I have a problem which looks as the __threadfence() command would’t do what it supposed to. Guess the fault is in my understanding not the command, but im totally out of ideas.
What I try to achieve is that every block copys data to a global memory reserve, than each of them reads the one put by their neighbour, but its unreliable, my best guess is sync error, but if the threadfence() works as I read, it couldnt be…
Heres my method, any help or idea is appreciated :)
template <typename T>
__device__ void Migrate(float * fitnes, T* data, unsigned indivsize, T* d_array, unsigned migrates){
const unsigned tid=threadIdx.x;
Sort<T>(fitnes,data,indivsize);
if(tid<migrates){
T* gl_indiv=&d_array[blockIdx.x*indivsize*migrates+tid*indivsize]; //global memory pointer for storing best indivs
T* gl_nghb=&d_array[((blockIdx.x+1)%gridDim.x)*indivsize*migrates+tid*indivsize]; //pointer to the neighbours global memory storage
T* loc_indivB=&data[tid*indivsize]; //pointer to the individuals in shared memory, starting from the first
T* loc_indivW=&data[(blockDim.x-migrates)*indivsize+tid*indivsize]; //pointer to the worst indivs in the block
for(int i=0;i<indivsize;i++)
gl_indiv[i]=loc_indivB[i]; //moving best individuals to global memory
__threadfence();
for(int i=0;i<indivsize;i++)
loc_indivW[i]=gl_nghb[i]; //overwriting worst indivs whith the neigbours bests
}
__syncthreads();
}
And the problem is: sometimes some blocks read the original contents of the neighbours global array, instead of the stuff written there by the corresponding block. Which in my understanding means, that the read happens prior to the write.
Oh, and I have compute capability 1.1 with my card.