I have two matrix with different size each in cuda device
the input matrix is (512x62) dimBlock(512,1,1); dimGrid(1, 64) and the ouput matrix is (150,150)

I compute index of the output matrix base on he index on the input matrix and write the value to that index.

Before any thread write it compute data to the output matrix it check if the current value of it has is bigger than the current value on the output matrix and that the index compute are within the range ou the ouput matrix
My question is how can I prevent other threat to write in the matrix while a thread is current writting it value (data race)

I may suggest to split your program into two steps, first to compute all values and after it make reduction. Also you can conside using shared memory with splitting image into blocks. It very much depends on amount of computations for each thread.
Btw, why do you go from source data, not result data?

I may suggest to split your program into two steps, first to compute all values and after it make reduction. Also you can conside using shared memory with splitting image into blocks. It very much depends on amount of computations for each thread.
Btw, why do you go from source data, not result data?

Thanks for your quick reply. I am trying to implment the kernel as you advice me : computing first the value and making the reduction. But i really stuck wit the implementation of the reduction algorithm the outpout are alway not correct.

Please can you help me with some pseudo code on how i can make a reduction.

Here is what i ahave done

I have share struct variable holding the col row and value on the new matrix for each block

and later i try the ready the value in this share variable to make the reduction.

In short, first stage is to write modified 512x62 matrix to global memory and next stage is to complete reduction to smaller size. New kernel does it, eiach thread there represents pixel of small matrix.
Also you need to check data locality, if it is possible to split matrixes into blocks and to use shared memory.

once again thanks for your quick reply, after discusing with my teammate we found the the best solution is as you suggest to compute from result (output) to source (intput) because doing reduction will grow the memory as the input data increase.

You may also try using global atomic writes, they should help but maybe slow. It maybe usefull for 2.0 hardware with fast global atomics. Also your matrix may be in L2 cache.