Atomic operations for multi-GPU Is it possible to do that?

laplaceme · August 26, 2009, 4:09am

I have some algorithm uses atomic operations. II have more than one device (one 295GTX to be exact), can I do atomic operations with two GPU? Any one has experience with this? Atomic operations are really convenient.

Quoc_Vinh · August 26, 2009, 4:19am

Oh, good question.

As far as i understand the atomic operations can not use for multi-GPU.

but you can use atomic operations for each gpu and then copy back data to host.

finally, host functions will manipulate these datas.

SPWorley · August 26, 2009, 6:04am

The trick that might work is to use CUDA 2.2+'s powerful zero-copy memory. Zero-copy memory does work with atomics.

Now, the first question is whether you can specify the same range of host memory to be zero-copy for more than one CUDA context at once. And even if you can, you need to actually try it to see if the atomic support works with multi GPU.

It would be painfully inefficient, but perhaps for some very rare work allocation or something it certainly might be useful.

tmurray · August 26, 2009, 6:07am

Atomics work with a single GPU only.

SPWorley · August 26, 2009, 6:23am

Lord Tim dashes our hopes! But where’s the evil chortle?

Well, at least for 3 full minutes after posting, I was able to be happy thinking about how I might use multi-GPU atomics for work queue coordination…

tmurray · August 26, 2009, 8:09am

my job description is actually “crusher of hopes and dreams”

as far as I know there’s no way to do this.

Sarnath · August 26, 2009, 8:23am

Strange that some1 has this requirement… But you can still work it out…

If you _can(?) map the same host pinned memory to multiple GPU devices (This shuld be possible), You could have the CPU broker between the multiple GPUs.

You just need to have a per-GPU request and response queue per shared resource. And you need to dedicate a CPU thread who will monitor the request locations.

The GPUs would place their request for a resource. The CPU thread monitoring the memory location will arbiter between mutiple GPUs and provide them access to the resource…

The GPUs have to spin on a location… 1 thread per GPU should do it…

If all thread does it, your PCI-bus will be jammmed…

DISCLAIMER:
I dont vouch this will work. At least, one can try it out - especially if you are up against a roadblock

laplaceme · August 27, 2009, 1:00am

Thank you for the suggestion. I’ll try it and get back you the result later if I am able to.

To my understanding zero-copy is some automatic synchronization of GPU-CPU memory. Since to use two GPUs you need two thread, it seems the zero-copy might not work, based on what tmurray said in that thread:

“The only thing we guarantee is that if you write to a PCIe location in one thread and read it later from that same thread, you’ll see the updated value.”

laplaceme · August 27, 2009, 1:07am

Thanks, Sarnath. I ask this question because I want my program to be scalable with multiple GPUs while take the advantage of atomic operation. Your method looks a little complicated, but change the algorithm might be the way to go.

And thanks for Quoc Vinh and tmurray too.

Strange that some1 has this requirement… But you can still work it out…

If you _can(?) map the same host pinned memory to multiple GPU devices (This shuld be possible), You could have the CPU broker between the multiple GPUs.

You just need to have a per-GPU request and response queue per shared resource. And you need to dedicate a CPU thread who will monitor the request locations.

The GPUs would place their request for a resource. The CPU thread monitoring the memory location will arbiter between mutiple GPUs and provide them access to the resource…

The GPUs have to spin on a location… 1 thread per GPU should do it…

If all thread does it, your PCI-bus will be jammmed…

DISCLAIMER:

I dont vouch this will work. At least, one can try it out - especially if you are up against a roadblock

Sarnath · August 27, 2009, 6:46am

Zero-copy is NOT like what you say. Zero-copy is a way of GPU accessing RAM directly. The application allocates virtual address range in its address space space (like malloc but…) that is guaranteed to be physically contiguous and pinned (i…e the OS will NOT swap it). Thus this range of VA is contiguous in Physical RAM and will be pinned… The device (kernels running on them), that are capable, can write to the system RAM directly. When the device writes, the application can see it through the VA… And since VA is a property of the process and NOT threads, there is no problem. Just like how CPU threads can share global variables (C global variables), this will work fine.

Now, if one GPU can write, other can also write. So, there wont be any issue about CPU not seeing the values written by multiple GPUs. Intel CPUs are cache coherent.

It will be definitely a tough thing to make it work. BUt I believe it should work.

btw,

Atomics across GPUs does not sound like a good idea. if you could avoid it, you should always. May be, there is a smarter workaround

Topic		Replies	Views
Useful Arbitrary Atomic Operation Hack CUDA Programming and Performance	0	10061	July 20, 2008
Atomic operation to a peer GPU’s memory? CUDA Programming and Performance cuda	0	573	August 31, 2020
Tell me about atomics with mapped/zero-copy host memory CUDA Programming and Performance	1	1055	May 4, 2009
Atomic operation unit? CUDA Programming and Performance	7	3211	July 3, 2010
Concurrent kernel synchronization CUDA Programming and Performance	5	990	June 19, 2011
Strategies for Dynamic Global memory access from CPU ? CUDA Programming and Performance	3	918	March 30, 2013
Atomic operations and Block communication CUDA Programming and Performance	3	2953	December 11, 2007
Atomic Operations CUDA Programming and Performance	4	4584	November 11, 2015
Is there a way to avoid atomicAdd in my situation? CUDA Programming and Performance	3	1396	March 4, 2019
Atomic operations on host mapped memory CUDA Programming and Performance	7	2307	October 22, 2009

Atomic operations for multi-GPU Is it possible to do that?

Related topics