Several threads attacking the same position. Superposition in that position.

Carlos_Trujillo · October 16, 2010, 4:26pm

Hello everyone,

I’m working in the next problem:

I have two matrices (images) A and B, where each pixel of the matrix B can affect several pixels of the matrix A. A is the resulting image.

When I call a kernel, each thread is responsible for carrying out an operation for each pixel of B and this result affects the corresponding pixels of A. The problem is that a pixel of A can receive the contributions of several pixels of B, when this occurs, the contributions do not accrue because when each thread finishes its operation writes the result in the corresponding pixel of A, removing the current value at that pixel (obvious result since all the threads are working in parallel).

The aim is that all the contributions of all the threads are taken into account in the calculation of the matrix A.

I know this is a problem inherent to the nature of the proccess (parallel process), but, I think somehow this can be possible to make, using the shared memory, some criteria of the latency… Does anybody have an idea?

Thanks.

Carlos_Trujillo · October 16, 2010, 4:26pm

Hello everyone,

I’m working in the next problem:

I have two matrices (images) A and B, where each pixel of the matrix B can affect several pixels of the matrix A. A is the resulting image.

When I call a kernel, each thread is responsible for carrying out an operation for each pixel of B and this result affects the corresponding pixels of A. The problem is that a pixel of A can receive the contributions of several pixels of B, when this occurs, the contributions do not accrue because when each thread finishes its operation writes the result in the corresponding pixel of A, removing the current value at that pixel (obvious result since all the threads are working in parallel).

The aim is that all the contributions of all the threads are taken into account in the calculation of the matrix A.

I know this is a problem inherent to the nature of the proccess (parallel process), but, I think somehow this can be possible to make, using the shared memory, some criteria of the latency… Does anybody have an idea?

Thanks.

laughingrice · October 18, 2010, 1:02am

Hello everyone,

I’m working in the next problem:

I have two matrices (images) A and B, where each pixel of the matrix B can affect several pixels of the matrix A. A is the resulting image.

When I call a kernel, each thread is responsible for carrying out an operation for each pixel of B and this result affects the corresponding pixels of A. The problem is that a pixel of A can receive the contributions of several pixels of B, when this occurs, the contributions do not accrue because when each thread finishes its operation writes the result in the corresponding pixel of A, removing the current value at that pixel (obvious result since all the threads are working in parallel).

The aim is that all the contributions of all the threads are taken into account in the calculation of the matrix A.

I know this is a problem inherent to the nature of the proccess (parallel process), but, I think somehow this can be possible to make, using the shared memory, some criteria of the latency… Does anybody have an idea?

Thanks.

Either you can get each thread to handle several pixels in such a way that there is no overlap (and hopefully you still get some sort of coalescing), or you need to use atomic operations (assuming that they exist on the hardware you are using)

laughingrice · October 18, 2010, 1:02am

Hello everyone,

I’m working in the next problem:

I have two matrices (images) A and B, where each pixel of the matrix B can affect several pixels of the matrix A. A is the resulting image.

When I call a kernel, each thread is responsible for carrying out an operation for each pixel of B and this result affects the corresponding pixels of A. The problem is that a pixel of A can receive the contributions of several pixels of B, when this occurs, the contributions do not accrue because when each thread finishes its operation writes the result in the corresponding pixel of A, removing the current value at that pixel (obvious result since all the threads are working in parallel).

The aim is that all the contributions of all the threads are taken into account in the calculation of the matrix A.

I know this is a problem inherent to the nature of the proccess (parallel process), but, I think somehow this can be possible to make, using the shared memory, some criteria of the latency… Does anybody have an idea?

Thanks.

Either you can get each thread to handle several pixels in such a way that there is no overlap (and hopefully you still get some sort of coalescing), or you need to use atomic operations (assuming that they exist on the hardware you are using)

Carlos_Trujillo · October 18, 2010, 1:26am

Thanks for your answer, I really needed it…

While I was Reading the info about Atomic Functions in the programming guide, I realized that this kind of function are only possible with data of type unsigned int or int (except for the AtomicAdd). This is true or am I outdated?

Thanks again.

Carlos_Trujillo · October 18, 2010, 1:26am

Thanks for your answer, I really needed it…

While I was Reading the info about Atomic Functions in the programming guide, I realized that this kind of function are only possible with data of type unsigned int or int (except for the AtomicAdd). This is true or am I outdated?

Thanks again.

laughingrice · October 18, 2010, 1:34am

It depends on your compute capability and type of memory (which would probably be global in this case)

I don’t have the specs on me at the moment, will update tomorrow unless someone beats me to it

laughingrice · October 18, 2010, 1:34am

It depends on your compute capability and type of memory (which would probably be global in this case)

I don’t have the specs on me at the moment, will update tomorrow unless someone beats me to it

jan.heckman · October 18, 2010, 10:43am

Section B.11 of the C Programming guide.[codebox]An atomic function performs a read-modify-write atomic operation on one 32-bit or 64-bit word residing in global or shared memory. For example, atomicAdd() reads a 32-bit word at some address in global or shared memory, adds a number to it, and writes the result back to the same address. The operation is atomic in the sense that it is guaranteed to be performed without interference from other threads. In other words, no other thread can access this address until the operation is complete.

Atomic operations only work with signed and unsigned integers with the exception of atomicAdd() for devices of compute capability 2.x and atomicExch() for all devices, which also work for single-precision floating-point numbers.

Atomic functions can only be used in device functions and are only available for devices of compute capability 1.1 and above.

Atomic functions operating on shared memory and atomic functions operating on 64-bit words are only available for devices of compute capability 1.2 and above.

Atomic functions operating on 64-bit words in shared memory are only available for devices of compute capability 2.x and higher.

Atomic functions operating on mapped page-locked memory (Section 3.2.6.3) are not atomic from the point of view of the host or other devices.[/codebox]

I think I used atomicCAS() to effect atomic calculations (swap only if the swapped value and the result are consistent), but it was slow. If you need speed, you probably should avoid atomics.

jan.heckman · October 18, 2010, 10:43am

Section B.11 of the C Programming guide.[codebox]An atomic function performs a read-modify-write atomic operation on one 32-bit or 64-bit word residing in global or shared memory. For example, atomicAdd() reads a 32-bit word at some address in global or shared memory, adds a number to it, and writes the result back to the same address. The operation is atomic in the sense that it is guaranteed to be performed without interference from other threads. In other words, no other thread can access this address until the operation is complete.

Atomic operations only work with signed and unsigned integers with the exception of atomicAdd() for devices of compute capability 2.x and atomicExch() for all devices, which also work for single-precision floating-point numbers.

Atomic functions can only be used in device functions and are only available for devices of compute capability 1.1 and above.

Atomic functions operating on shared memory and atomic functions operating on 64-bit words are only available for devices of compute capability 1.2 and above.

Atomic functions operating on 64-bit words in shared memory are only available for devices of compute capability 2.x and higher.

Atomic functions operating on mapped page-locked memory (Section 3.2.6.3) are not atomic from the point of view of the host or other devices.[/codebox]

I think I used atomicCAS() to effect atomic calculations (swap only if the swapped value and the result are consistent), but it was slow. If you need speed, you probably should avoid atomics.

Carlos_Trujillo · October 18, 2010, 4:41pm

Section B.11 of the C Programming guide.[codebox]An atomic function performs a read-modify-write atomic operation on one 32-bit or 64-bit word residing in global or shared memory. For example, atomicAdd() reads a 32-bit word at some address in global or shared memory, adds a number to it, and writes the result back to the same address. The operation is atomic in the sense that it is guaranteed to be performed without interference from other threads. In other words, no other thread can access this address until the operation is complete.

Atomic operations only work with signed and unsigned integers with the exception of atomicAdd() for devices of compute capability 2.x and atomicExch() for all devices, which also work for single-precision floating-point numbers.

Atomic functions can only be used in device functions and are only available for devices of compute capability 1.1 and above.

Atomic functions operating on shared memory and atomic functions operating on 64-bit words are only available for devices of compute capability 1.2 and above.

Atomic functions operating on 64-bit words in shared memory are only available for devices of compute capability 2.x and higher.

Atomic functions operating on mapped page-locked memory (Section 3.2.6.3) are not atomic from the point of view of the host or other devices.[/codebox]

I think I used atomicCAS() to effect atomic calculations (swap only if the swapped value and the result are consistent), but it was slow. If you need speed, you probably should avoid atomics.

The speed problem with Atomic functions is that one thread waits for another thread to finish, for that reason this is a serial proccess but in my case this “serialization” is not very big… so I can use it.

Thanks for your contributions…

Carlos_Trujillo · October 18, 2010, 4:41pm

Section B.11 of the C Programming guide.[codebox]An atomic function performs a read-modify-write atomic operation on one 32-bit or 64-bit word residing in global or shared memory. For example, atomicAdd() reads a 32-bit word at some address in global or shared memory, adds a number to it, and writes the result back to the same address. The operation is atomic in the sense that it is guaranteed to be performed without interference from other threads. In other words, no other thread can access this address until the operation is complete.

Atomic operations only work with signed and unsigned integers with the exception of atomicAdd() for devices of compute capability 2.x and atomicExch() for all devices, which also work for single-precision floating-point numbers.

Atomic functions can only be used in device functions and are only available for devices of compute capability 1.1 and above.

Atomic functions operating on shared memory and atomic functions operating on 64-bit words are only available for devices of compute capability 1.2 and above.

Atomic functions operating on 64-bit words in shared memory are only available for devices of compute capability 2.x and higher.

Atomic functions operating on mapped page-locked memory (Section 3.2.6.3) are not atomic from the point of view of the host or other devices.[/codebox]

I think I used atomicCAS() to effect atomic calculations (swap only if the swapped value and the result are consistent), but it was slow. If you need speed, you probably should avoid atomics.

The speed problem with Atomic functions is that one thread waits for another thread to finish, for that reason this is a serial proccess but in my case this “serialization” is not very big… so I can use it.

Thanks for your contributions…

laughingrice · October 18, 2010, 10:26pm

Looking at the headers you have these two:

[codebox]

/usr/local/cuda/include/sm_20_atomic_functions.h:66:static inline device float atomicAdd(float *address, float val)

/usr/local/cuda/include/sm_11_atomic_functions.h:132:static inline device float atomicExch(float *address, float val)

[/codebox]

So all cards except compute 1.0 have atomicExch for floats, fermi also has atomicAdd. It won’t be efficient, but you can implement atomic add using atomicExch and a loop.

Another option that may be relevant is to do some sort of reduction.

laughingrice · October 18, 2010, 10:26pm

Looking at the headers you have these two:

[codebox]

/usr/local/cuda/include/sm_20_atomic_functions.h:66:static inline device float atomicAdd(float *address, float val)

/usr/local/cuda/include/sm_11_atomic_functions.h:132:static inline device float atomicExch(float *address, float val)

[/codebox]

So all cards except compute 1.0 have atomicExch for floats, fermi also has atomicAdd. It won’t be efficient, but you can implement atomic add using atomicExch and a loop.

Another option that may be relevant is to do some sort of reduction.

Carlos_Trujillo · October 18, 2010, 11:20pm

Thanks I’m gonna implement it right now…

Carlos_Trujillo · October 18, 2010, 11:20pm

Thanks I’m gonna implement it right now…

Sarnath · October 19, 2010, 6:08am

Be careful with “atomicExch” and “loop” ---- It can lead you to “deadlock”

Sarnath · October 19, 2010, 6:08am

Be careful with “atomicExch” and “loop” ---- It can lead you to “deadlock”

Carlos_Trujillo · October 19, 2010, 4:09pm

Hi Sarnath,

What do you mean with “deadlock”?

Carlos_Trujillo · October 19, 2010, 4:09pm

Hi Sarnath,

What do you mean with “deadlock”?

Topic		Replies	Views
Several threads attacking the same position. Superposition in that position. CUDA Programming and Performance	13	1447	October 19, 2010
Global thread barrier CUDA Programming and Performance	78	85672	December 23, 2011
matrix multiply reduction CUDA Programming and Performance	41	35548	January 15, 2011
CUDA and Image Processing CUDA Programming and Performance	45	22126	August 5, 2008
Mutual exclusion or Reduction on global memory? CUDA Programming and Performance	13	9090	September 13, 2008
Cuda Latency problems Slow Cuda CUDA Programming and Performance	15	13932	September 5, 2008
Some advice needed pls Doubts we have, we're starting with CUDA programming CUDA Programming and Performance	16	4698	June 22, 2011
Worse atomic performance in shared than global memory CUDA Programming and Performance	7	8945	August 3, 2017
Wishlist Place your considered suggestions here CUDA Programming and Performance	201	204317	April 13, 2009
Performance issues on memory transfer CUDA Programming and Performance	13	12982	November 26, 2010

Several threads attacking the same position. Superposition in that position.

Related topics