Several threads attacking the same position. Superposition in that position.

Hello everyone,

I’m working in the next problem:

I have two matrices (images) A and B, where each pixel of the matrix B can affect several pixels of the matrix A. A is the resulting image.

When I call a kernel, each thread is responsible for carrying out an operation for each pixel of B and this result affects the corresponding pixels of A. The problem is that a pixel of A can receive the contributions of several pixels of B, when this occurs, the contributions do not accrue because when each thread finishes its operation writes the result in the corresponding pixel of A, removing the current value at that pixel (obvious result since all the threads are working in parallel).

The aim is that all the contributions of all the threads are taken into account in the calculation of the matrix A.

I know this is a problem inherent to the nature of the proccess (parallel process), but, I think somehow this can be possible to make, using the shared memory, some criteria of the latency… Does anybody have an idea?

Thanks.

Hello everyone,

I’m working in the next problem:

I have two matrices (images) A and B, where each pixel of the matrix B can affect several pixels of the matrix A. A is the resulting image.

When I call a kernel, each thread is responsible for carrying out an operation for each pixel of B and this result affects the corresponding pixels of A. The problem is that a pixel of A can receive the contributions of several pixels of B, when this occurs, the contributions do not accrue because when each thread finishes its operation writes the result in the corresponding pixel of A, removing the current value at that pixel (obvious result since all the threads are working in parallel).

The aim is that all the contributions of all the threads are taken into account in the calculation of the matrix A.

I know this is a problem inherent to the nature of the proccess (parallel process), but, I think somehow this can be possible to make, using the shared memory, some criteria of the latency… Does anybody have an idea?

Thanks.

Either you can get each thread to handle several pixels in such a way that there is no overlap (and hopefully you still get some sort of coalescing), or you need to use atomic operations (assuming that they exist on the hardware you are using)

Either you can get each thread to handle several pixels in such a way that there is no overlap (and hopefully you still get some sort of coalescing), or you need to use atomic operations (assuming that they exist on the hardware you are using)

Thanks for your answer, I really needed it…

While I was Reading the info about Atomic Functions in the programming guide, I realized that this kind of function are only possible with data of type unsigned int or int (except for the AtomicAdd). This is true or am I outdated?

Thanks again.

Thanks for your answer, I really needed it…

While I was Reading the info about Atomic Functions in the programming guide, I realized that this kind of function are only possible with data of type unsigned int or int (except for the AtomicAdd). This is true or am I outdated?

Thanks again.

It depends on your compute capability and type of memory (which would probably be global in this case)

I don’t have the specs on me at the moment, will update tomorrow unless someone beats me to it

It depends on your compute capability and type of memory (which would probably be global in this case)

I don’t have the specs on me at the moment, will update tomorrow unless someone beats me to it

Section B.11 of the C Programming guide.[codebox]An atomic function performs a read-modify-write atomic operation on one 32-bit or 64-bit word residing in global or shared memory. For example, atomicAdd() reads a 32-bit word at some address in global or shared memory, adds a number to it, and writes the result back to the same address. The operation is atomic in the sense that it is guaranteed to be performed without interference from other threads. In other words, no other thread can access this address until the operation is complete.

Atomic operations only work with signed and unsigned integers with the exception of atomicAdd() for devices of compute capability 2.x and atomicExch() for all devices, which also work for single-precision floating-point numbers.

Atomic functions can only be used in device functions and are only available for devices of compute capability 1.1 and above.

Atomic functions operating on shared memory and atomic functions operating on 64-bit words are only available for devices of compute capability 1.2 and above.

Atomic functions operating on 64-bit words in shared memory are only available for devices of compute capability 2.x and higher.

Atomic functions operating on mapped page-locked memory (Section 3.2.6.3) are not atomic from the point of view of the host or other devices.[/codebox]

I think I used atomicCAS() to effect atomic calculations (swap only if the swapped value and the result are consistent), but it was slow. If you need speed, you probably should avoid atomics.

Section B.11 of the C Programming guide.[codebox]An atomic function performs a read-modify-write atomic operation on one 32-bit or 64-bit word residing in global or shared memory. For example, atomicAdd() reads a 32-bit word at some address in global or shared memory, adds a number to it, and writes the result back to the same address. The operation is atomic in the sense that it is guaranteed to be performed without interference from other threads. In other words, no other thread can access this address until the operation is complete.

Atomic operations only work with signed and unsigned integers with the exception of atomicAdd() for devices of compute capability 2.x and atomicExch() for all devices, which also work for single-precision floating-point numbers.

Atomic functions can only be used in device functions and are only available for devices of compute capability 1.1 and above.

Atomic functions operating on shared memory and atomic functions operating on 64-bit words are only available for devices of compute capability 1.2 and above.

Atomic functions operating on 64-bit words in shared memory are only available for devices of compute capability 2.x and higher.

Atomic functions operating on mapped page-locked memory (Section 3.2.6.3) are not atomic from the point of view of the host or other devices.[/codebox]

I think I used atomicCAS() to effect atomic calculations (swap only if the swapped value and the result are consistent), but it was slow. If you need speed, you probably should avoid atomics.

The speed problem with Atomic functions is that one thread waits for another thread to finish, for that reason this is a serial proccess but in my case this “serialization” is not very big… so I can use it.

Thanks for your contributions…

The speed problem with Atomic functions is that one thread waits for another thread to finish, for that reason this is a serial proccess but in my case this “serialization” is not very big… so I can use it.

Thanks for your contributions…

Looking at the headers you have these two:

[codebox]

/usr/local/cuda/include/sm_20_atomic_functions.h:66:static inline device float atomicAdd(float *address, float val)

/usr/local/cuda/include/sm_11_atomic_functions.h:132:static inline device float atomicExch(float *address, float val)

[/codebox]

So all cards except compute 1.0 have atomicExch for floats, fermi also has atomicAdd. It won’t be efficient, but you can implement atomic add using atomicExch and a loop.

Another option that may be relevant is to do some sort of reduction.

Looking at the headers you have these two:

[codebox]

/usr/local/cuda/include/sm_20_atomic_functions.h:66:static inline device float atomicAdd(float *address, float val)

/usr/local/cuda/include/sm_11_atomic_functions.h:132:static inline device float atomicExch(float *address, float val)

[/codebox]

So all cards except compute 1.0 have atomicExch for floats, fermi also has atomicAdd. It won’t be efficient, but you can implement atomic add using atomicExch and a loop.

Another option that may be relevant is to do some sort of reduction.

Thanks I’m gonna implement it right now…

Thanks I’m gonna implement it right now…

Be careful with “atomicExch” and “loop” ---- It can lead you to “deadlock”

Be careful with “atomicExch” and “loop” ---- It can lead you to “deadlock”

Hi Sarnath,

What do you mean with “deadlock”?

Hi Sarnath,

What do you mean with “deadlock”?