Compress Data Image Processing

Dear all ,

I have two matrix with different size each in cuda device
the input matrix is (512x62) dimBlock(512,1,1); dimGrid(1, 64) and the ouput matrix is (150,150)

I compute index of the output matrix base on he index on the input matrix and write the value to that index.

Before any thread write it compute data to the output matrix it check if the current value of it has is bigger than the current value on the output matrix and that the index compute are within the range ou the ouput matrix
My question is how can I prevent other threat to write in the matrix while a thread is current writting it value (data race)

many thanks for you reply

Willer

I may suggest to split your program into two steps, first to compute all values and after it make reduction. Also you can conside using shared memory with splitting image into blocks. It very much depends on amount of computations for each thread.
Btw, why do you go from source data, not result data?

I may suggest to split your program into two steps, first to compute all values and after it make reduction. Also you can conside using shared memory with splitting image into blocks. It very much depends on amount of computations for each thread.
Btw, why do you go from source data, not result data?

atomic writes could help.

But they are relatively slow on compute 1.x hardware.

Like the previous responder, may I also suggest to switch to a “gather” approach, rather than doing “scattered” writes.

atomic writes could help.

But they are relatively slow on compute 1.x hardware.

Like the previous responder, may I also suggest to switch to a “gather” approach, rather than doing “scattered” writes.

Dear lev,

Thanks for your quick reply. I am trying to implment the kernel as you advice me : computing first the value and making the reduction. But i really stuck wit the implementation of the reduction algorithm the outpout are alway not correct.

Please can you help me with some pseudo code on how i can make a reduction.

Here is what i ahave done

I have share struct variable holding the col row and value on the new matrix for each block

and later i try the ready the value in this share variable to make the reduction.

Many thanks for you reply

In short, first stage is to write modified 512x62 matrix to global memory and next stage is to complete reduction to smaller size. New kernel does it, eiach thread there represents pixel of small matrix.
Also you need to check data locality, if it is possible to split matrixes into blocks and to use shared memory.

Dear Lev

once again thanks for your quick reply, after discusing with my teammate we found the the best solution is as you suggest to compute from result (output) to source (intput) because doing reduction will grow the memory as the input data increase.

many thanks

You may also try using global atomic writes, they should help but maybe slow. It maybe usefull for 2.0 hardware with fast global atomics. Also your matrix may be in L2 cache.

I have just try use the atomicExch for testing but I am still have corrupted result

my question is that atomic operation is restric memory access for thread within a block or for all thread in any block.

what i have done is

  • compute the row and col of each cell in the input matrix

    __synchronize()

  • check that the row and column compute are with the range of the output matric

    then use atomicExch to write yr data

[codebox]

-input matrix[64][512]

-compute row and col value for each cell based on mathematic function

__synchronize()

if( (row >= 0) && (row <= row_max) && (column >= 0) && (column <= col_max))

{

  value[block.y][thread.x]

if(coutputmatrix[row][colum].value < value)

}

[/codebox]