Several threads attacking the same position. Superposition in that position.

Carlos_Trujillo · October 16, 2010, 4:27pm

Hello everyone,

I’m working in the next problem:

I have two matrices (images) A and B, where each pixel of the matrix B can affect several pixels of the matrix A. A is the resulting image.

When I call a kernel, each thread is responsible for carrying out an operation for each pixel of B and this result affects the corresponding pixels of A. The problem is that a pixel of A can receive the contributions of several pixels of B, when this occurs, the contributions do not accrue because when each thread finishes its operation writes the result in the corresponding pixel of A, removing the current value at that pixel (obvious result since all the threads are working in parallel).

The aim is that all the contributions of all the threads are taken into account in the calculation of the matrix A.

I know this is a problem inherent to the nature of the proccess (parallel process), but, I think somehow this can be possible to make, using the shared memory, some criteria of the latency… Does anybody have an idea?

Thanks.

Carlos_Trujillo · October 16, 2010, 4:27pm

Hello everyone,

I’m working in the next problem:

I have two matrices (images) A and B, where each pixel of the matrix B can affect several pixels of the matrix A. A is the resulting image.

When I call a kernel, each thread is responsible for carrying out an operation for each pixel of B and this result affects the corresponding pixels of A. The problem is that a pixel of A can receive the contributions of several pixels of B, when this occurs, the contributions do not accrue because when each thread finishes its operation writes the result in the corresponding pixel of A, removing the current value at that pixel (obvious result since all the threads are working in parallel).

The aim is that all the contributions of all the threads are taken into account in the calculation of the matrix A.

I know this is a problem inherent to the nature of the proccess (parallel process), but, I think somehow this can be possible to make, using the shared memory, some criteria of the latency… Does anybody have an idea?

Thanks.

kbam · October 18, 2010, 6:57am

Hi
I had a very similar problem and I will explain it

An array B where each cell has an elevation on the side of a hill and water may flow from it to any of its 8 neighbouring cells.
Required output is a 2nd array A with each cell containing info about which cells flow into it.
Problem with this is that several cells may all flow into the same downslope cell. so it will exhibit the problem you have.
e.g. A[n,m], A[n,m+1], A[n,m+2] and A[n+1,m] might all flow into A[n+1,m+1]

some ways of overcoming this

use a shared array of the downslope cells and do ‘atomic’ updates on it, then once all the atomic updates are done copy the result back to the global array. Problem is that atomic updates are expensive (Could also do atomic updates directly to global array)
have each thread calculate which cell(s) it will flow to, store that in a shared array (non-atomic)
then have each thread read its neighbours results to see which flow into it by reading from the shared array
and then update the global array
NB use 8 bits with each bit indicating a flow to a different neighbour.
ensure that two cells (threads) never write to the same cell at same time by staggering the times that cell do their updates
i.e. have cell [n,m] do all its updates, then cell[n,m+1] etc
implies a bit of code divergence, however if the updating of neighbours is very fast thats OK, and there is a more complicated design that reduces that problem as well.
have each cell pretend it is each of its 8 neighbours in turn do the calculations of what would happen at that cell
and then update its correct cell accordingly.

Another way of thinking of these is that

pushes data to its neighbour but uses atomic update to ensure no two threads push at same time
pushes data to its neighbour but uses a fixed order to ensure no two threads push at same time
pulls data from its neighbours with each cell doing its own calculations and a shared array to store those results.
pulls data from its neighbours by doing the calculations each neighbour would.

I like the ones that pull the data.

kbam · October 18, 2010, 6:57am

Hi
I had a very similar problem and I will explain it

An array B where each cell has an elevation on the side of a hill and water may flow from it to any of its 8 neighbouring cells.
Required output is a 2nd array A with each cell containing info about which cells flow into it.
Problem with this is that several cells may all flow into the same downslope cell. so it will exhibit the problem you have.
e.g. A[n,m], A[n,m+1], A[n,m+2] and A[n+1,m] might all flow into A[n+1,m+1]

some ways of overcoming this

use a shared array of the downslope cells and do ‘atomic’ updates on it, then once all the atomic updates are done copy the result back to the global array. Problem is that atomic updates are expensive (Could also do atomic updates directly to global array)
have each thread calculate which cell(s) it will flow to, store that in a shared array (non-atomic)
then have each thread read its neighbours results to see which flow into it by reading from the shared array
and then update the global array
NB use 8 bits with each bit indicating a flow to a different neighbour.
ensure that two cells (threads) never write to the same cell at same time by staggering the times that cell do their updates
i.e. have cell [n,m] do all its updates, then cell[n,m+1] etc
implies a bit of code divergence, however if the updating of neighbours is very fast thats OK, and there is a more complicated design that reduces that problem as well.
have each cell pretend it is each of its 8 neighbours in turn do the calculations of what would happen at that cell
and then update its correct cell accordingly.

Another way of thinking of these is that

pushes data to its neighbour but uses atomic update to ensure no two threads push at same time
pushes data to its neighbour but uses a fixed order to ensure no two threads push at same time
pulls data from its neighbours with each cell doing its own calculations and a shared array to store those results.
pulls data from its neighbours by doing the calculations each neighbour would.

I like the ones that pull the data.

jan.heckman · October 18, 2010, 7:45am

Hello everyone,

I’m working in the next problem:

I have two matrices (images) A and B, where each pixel of the matrix B can affect several pixels of the matrix A. A is the resulting image.

When I call a kernel, each thread is responsible for carrying out an operation for each pixel of B and this result affects the corresponding pixels of A. The problem is that a pixel of A can receive the contributions of several pixels of B, when this occurs, the contributions do not accrue because when each thread finishes its operation writes the result in the corresponding pixel of A, removing the current value at that pixel (obvious result since all the threads are working in parallel).

The aim is that all the contributions of all the threads are taken into account in the calculation of the matrix A.

I know this is a problem inherent to the nature of the proccess (parallel process), but, I think somehow this can be possible to make, using the shared memory, some criteria of the latency… Does anybody have an idea?

Thanks.

Besides the neat answer of Kbam, is the problem perhaps similar to a convolution? Then there is a useful SDK example to work from.

jan.heckman · October 18, 2010, 7:45am

Hello everyone,

I’m working in the next problem:

I have two matrices (images) A and B, where each pixel of the matrix B can affect several pixels of the matrix A. A is the resulting image.

When I call a kernel, each thread is responsible for carrying out an operation for each pixel of B and this result affects the corresponding pixels of A. The problem is that a pixel of A can receive the contributions of several pixels of B, when this occurs, the contributions do not accrue because when each thread finishes its operation writes the result in the corresponding pixel of A, removing the current value at that pixel (obvious result since all the threads are working in parallel).

The aim is that all the contributions of all the threads are taken into account in the calculation of the matrix A.

I know this is a problem inherent to the nature of the proccess (parallel process), but, I think somehow this can be possible to make, using the shared memory, some criteria of the latency… Does anybody have an idea?

Thanks.

Besides the neat answer of Kbam, is the problem perhaps similar to a convolution? Then there is a useful SDK example to work from.

Carlos_Trujillo · October 18, 2010, 4:30pm

Hi

I had a very similar problem and I will explain it

An array B where each cell has an elevation on the side of a hill and water may flow from it to any of its 8 neighbouring cells.

Required output is a 2nd array A with each cell containing info about which cells flow into it.

Problem with this is that several cells may all flow into the same downslope cell. so it will exhibit the problem you have.

e.g. A[n,m], A[n,m+1], A[n,m+2] and A[n+1,m] might all flow into A[n+1,m+1]

some ways of overcoming this

use a shared array of the downslope cells and do ‘atomic’ updates on it, then once all the atomic updates are done copy the result back to the global array. Problem is that atomic updates are expensive (Could also do atomic updates directly to global array)

have each thread calculate which cell(s) it will flow to, store that in a shared array (non-atomic)

then have each thread read its neighbours results to see which flow into it by reading from the shared array

and then update the global array

NB use 8 bits with each bit indicating a flow to a different neighbour.

ensure that two cells (threads) never write to the same cell at same time by staggering the times that cell do their updates

i.e. have cell [n,m] do all its updates, then cell[n,m+1] etc

implies a bit of code divergence, however if the updating of neighbours is very fast thats OK, and there is a more complicated design that reduces that problem as well.

have each cell pretend it is each of its 8 neighbours in turn do the calculations of what would happen at that cell

and then update its correct cell accordingly.

Another way of thinking of these is that

pushes data to its neighbour but uses atomic update to ensure no two threads push at same time

pushes data to its neighbour but uses a fixed order to ensure no two threads push at same time

pulls data from its neighbours with each cell doing its own calculations and a shared array to store those results.

pulls data from its neighbours by doing the calculations each neighbour would.

I like the ones that pull the data.

Thanks a lot kbam for your reply…

I like the ones that pull the data too. When I see the Atomic Functions, I realize that this operations works over integer data type mostly, but in this case AtomicAdd() is just fine for doing that I need. The question is:

by using Atomic Functions do I have a limit in the data size? i.e. an array of float data type of size 2048*2048 works fine with this function?

Thanks again for your help.

Carlos_Trujillo · October 18, 2010, 4:30pm

Hi

I had a very similar problem and I will explain it

An array B where each cell has an elevation on the side of a hill and water may flow from it to any of its 8 neighbouring cells.

Required output is a 2nd array A with each cell containing info about which cells flow into it.

Problem with this is that several cells may all flow into the same downslope cell. so it will exhibit the problem you have.

e.g. A[n,m], A[n,m+1], A[n,m+2] and A[n+1,m] might all flow into A[n+1,m+1]

some ways of overcoming this

use a shared array of the downslope cells and do ‘atomic’ updates on it, then once all the atomic updates are done copy the result back to the global array. Problem is that atomic updates are expensive (Could also do atomic updates directly to global array)

have each thread calculate which cell(s) it will flow to, store that in a shared array (non-atomic)

then have each thread read its neighbours results to see which flow into it by reading from the shared array

and then update the global array

NB use 8 bits with each bit indicating a flow to a different neighbour.

ensure that two cells (threads) never write to the same cell at same time by staggering the times that cell do their updates

i.e. have cell [n,m] do all its updates, then cell[n,m+1] etc

implies a bit of code divergence, however if the updating of neighbours is very fast thats OK, and there is a more complicated design that reduces that problem as well.

have each cell pretend it is each of its 8 neighbours in turn do the calculations of what would happen at that cell

and then update its correct cell accordingly.

Another way of thinking of these is that

pushes data to its neighbour but uses atomic update to ensure no two threads push at same time

pushes data to its neighbour but uses a fixed order to ensure no two threads push at same time

pulls data from its neighbours with each cell doing its own calculations and a shared array to store those results.

pulls data from its neighbours by doing the calculations each neighbour would.

I like the ones that pull the data.

Thanks a lot kbam for your reply…

I like the ones that pull the data too. When I see the Atomic Functions, I realize that this operations works over integer data type mostly, but in this case AtomicAdd() is just fine for doing that I need. The question is:

by using Atomic Functions do I have a limit in the data size? i.e. an array of float data type of size 2048*2048 works fine with this function?

Thanks again for your help.

Carlos_Trujillo · October 18, 2010, 4:33pm

Thanks jan.heckman.

I wont say this problem is similar to a convolution… but just in case, What’s the name of the SDK example?

Carlos_Trujillo · October 18, 2010, 4:33pm

Thanks jan.heckman.

I wont say this problem is similar to a convolution… but just in case, What’s the name of the SDK example?

kbam · October 19, 2010, 12:25am

I think atomic functions will work with far far larger arrays than that, not sure on upper limit if any

kbam · October 19, 2010, 12:25am

I think atomic functions will work with far far larger arrays than that, not sure on upper limit if any

jan.heckman · October 19, 2010, 9:25pm

ConvolutionSeparable. The ‘separable’ is because linear dependence in the convolution kernel is used to reduce the complexity which increases performance.

Replacing the row and column kernels with a single bock-kernel should be easy, there are many other helpful aspects.

The example is documented and discusses the optimization steps.

jan.heckman · October 19, 2010, 9:25pm

ConvolutionSeparable. The ‘separable’ is because linear dependence in the convolution kernel is used to reduce the complexity which increases performance.

Replacing the row and column kernels with a single bock-kernel should be easy, there are many other helpful aspects.

The example is documented and discusses the optimization steps.

Topic		Replies	Views
Several threads attacking the same position. Superposition in that position. CUDA Programming and Performance	31	13703	October 19, 2010
Several threads attacking the same position Superposition in that position. CUDA Programming and Performance	4	838	December 2, 2010
How to speed up AtomicAdd kernel using shared memory CUDA Programming and Performance	9	9854	September 30, 2022
Speeding up kernel with atomicAdd and results do not match CUDA Programming and Performance	26	154	July 29, 2024
matrix multiply reduction CUDA Programming and Performance	41	35613	January 15, 2011
How warp serialization works on shared memory How to run a "data[n] += something" efficientl CUDA Programming and Performance	26	3330	May 26, 2010
Writes in same memory location Cant add numbers from different threads? CUDA Programming and Performance	46	25736	July 5, 2007
Efficient use of shared memory CUDA Programming and Performance	29	4660	December 2, 2019
__syncthreads and shared memory CUDA Programming and Performance	21	4420	June 15, 2011
multiple threads writing value to a same variable CUDA Programming and Performance	19	6771	March 20, 2012

Several threads attacking the same position. Superposition in that position.

Related topics