Several threads attacking the same position. Superposition in that position.

Hello everyone,

I’m working in the next problem:

I have two matrices (images) A and B, where each pixel of the matrix B can affect several pixels of the matrix A. A is the resulting image.

When I call a kernel, each thread is responsible for carrying out an operation for each pixel of B and this result affects the corresponding pixels of A. The problem is that a pixel of A can receive the contributions of several pixels of B, when this occurs, the contributions do not accrue because when each thread finishes its operation writes the result in the corresponding pixel of A, removing the current value at that pixel (obvious result since all the threads are working in parallel).

The aim is that all the contributions of all the threads are taken into account in the calculation of the matrix A.

I know this is a problem inherent to the nature of the proccess (parallel process), but, I think somehow this can be possible to make, using the shared memory, some criteria of the latency… Does anybody have an idea?

Thanks.

Hello everyone,

I’m working in the next problem:

I have two matrices (images) A and B, where each pixel of the matrix B can affect several pixels of the matrix A. A is the resulting image.

When I call a kernel, each thread is responsible for carrying out an operation for each pixel of B and this result affects the corresponding pixels of A. The problem is that a pixel of A can receive the contributions of several pixels of B, when this occurs, the contributions do not accrue because when each thread finishes its operation writes the result in the corresponding pixel of A, removing the current value at that pixel (obvious result since all the threads are working in parallel).

The aim is that all the contributions of all the threads are taken into account in the calculation of the matrix A.

I know this is a problem inherent to the nature of the proccess (parallel process), but, I think somehow this can be possible to make, using the shared memory, some criteria of the latency… Does anybody have an idea?

Thanks.

Hi
I had a very similar problem and I will explain it

An array B where each cell has an elevation on the side of a hill and water may flow from it to any of its 8 neighbouring cells.
Required output is a 2nd array A with each cell containing info about which cells flow into it.
Problem with this is that several cells may all flow into the same downslope cell. so it will exhibit the problem you have.
e.g. A[n,m], A[n,m+1], A[n,m+2] and A[n+1,m] might all flow into A[n+1,m+1]

some ways of overcoming this

  1. use a shared array of the downslope cells and do ‘atomic’ updates on it, then once all the atomic updates are done copy the result back to the global array. Problem is that atomic updates are expensive (Could also do atomic updates directly to global array)

  2. have each thread calculate which cell(s) it will flow to, store that in a shared array (non-atomic)
    then have each thread read its neighbours results to see which flow into it by reading from the shared array
    and then update the global array
    NB use 8 bits with each bit indicating a flow to a different neighbour.

  3. ensure that two cells (threads) never write to the same cell at same time by staggering the times that cell do their updates
    i.e. have cell [n,m] do all its updates, then cell[n,m+1] etc
    implies a bit of code divergence, however if the updating of neighbours is very fast thats OK, and there is a more complicated design that reduces that problem as well.

  4. have each cell pretend it is each of its 8 neighbours in turn do the calculations of what would happen at that cell
    and then update its correct cell accordingly.

Another way of thinking of these is that

  1. pushes data to its neighbour but uses atomic update to ensure no two threads push at same time
  2. pushes data to its neighbour but uses a fixed order to ensure no two threads push at same time
  3. pulls data from its neighbours with each cell doing its own calculations and a shared array to store those results.
  4. pulls data from its neighbours by doing the calculations each neighbour would.

I like the ones that pull the data.

Hi
I had a very similar problem and I will explain it

An array B where each cell has an elevation on the side of a hill and water may flow from it to any of its 8 neighbouring cells.
Required output is a 2nd array A with each cell containing info about which cells flow into it.
Problem with this is that several cells may all flow into the same downslope cell. so it will exhibit the problem you have.
e.g. A[n,m], A[n,m+1], A[n,m+2] and A[n+1,m] might all flow into A[n+1,m+1]

some ways of overcoming this

  1. use a shared array of the downslope cells and do ‘atomic’ updates on it, then once all the atomic updates are done copy the result back to the global array. Problem is that atomic updates are expensive (Could also do atomic updates directly to global array)

  2. have each thread calculate which cell(s) it will flow to, store that in a shared array (non-atomic)
    then have each thread read its neighbours results to see which flow into it by reading from the shared array
    and then update the global array
    NB use 8 bits with each bit indicating a flow to a different neighbour.

  3. ensure that two cells (threads) never write to the same cell at same time by staggering the times that cell do their updates
    i.e. have cell [n,m] do all its updates, then cell[n,m+1] etc
    implies a bit of code divergence, however if the updating of neighbours is very fast thats OK, and there is a more complicated design that reduces that problem as well.

  4. have each cell pretend it is each of its 8 neighbours in turn do the calculations of what would happen at that cell
    and then update its correct cell accordingly.

Another way of thinking of these is that

  1. pushes data to its neighbour but uses atomic update to ensure no two threads push at same time
  2. pushes data to its neighbour but uses a fixed order to ensure no two threads push at same time
  3. pulls data from its neighbours with each cell doing its own calculations and a shared array to store those results.
  4. pulls data from its neighbours by doing the calculations each neighbour would.

I like the ones that pull the data.

Besides the neat answer of Kbam, is the problem perhaps similar to a convolution? Then there is a useful SDK example to work from.

Besides the neat answer of Kbam, is the problem perhaps similar to a convolution? Then there is a useful SDK example to work from.

Thanks a lot kbam for your reply…

I like the ones that pull the data too. When I see the Atomic Functions, I realize that this operations works over integer data type mostly, but in this case AtomicAdd() is just fine for doing that I need. The question is:

by using Atomic Functions do I have a limit in the data size? i.e. an array of float data type of size 2048*2048 works fine with this function?

Thanks again for your help.

Thanks a lot kbam for your reply…

I like the ones that pull the data too. When I see the Atomic Functions, I realize that this operations works over integer data type mostly, but in this case AtomicAdd() is just fine for doing that I need. The question is:

by using Atomic Functions do I have a limit in the data size? i.e. an array of float data type of size 2048*2048 works fine with this function?

Thanks again for your help.

Thanks jan.heckman.

I wont say this problem is similar to a convolution… but just in case, What’s the name of the SDK example?

Thanks jan.heckman.

I wont say this problem is similar to a convolution… but just in case, What’s the name of the SDK example?

I think atomic functions will work with far far larger arrays than that, not sure on upper limit if any

I think atomic functions will work with far far larger arrays than that, not sure on upper limit if any

ConvolutionSeparable. The ‘separable’ is because linear dependence in the convolution kernel is used to reduce the complexity which increases performance.

Replacing the row and column kernels with a single bock-kernel should be easy, there are many other helpful aspects.

The example is documented and discusses the optimization steps.

ConvolutionSeparable. The ‘separable’ is because linear dependence in the convolution kernel is used to reduce the complexity which increases performance.

Replacing the row and column kernels with a single bock-kernel should be easy, there are many other helpful aspects.

The example is documented and discusses the optimization steps.