Atomic operations for multi-GPU Is it possible to do that?

I have some algorithm uses atomic operations. II have more than one device (one 295GTX to be exact), can I do atomic operations with two GPU? Any one has experience with this? Atomic operations are really convenient.

Oh, good question.

As far as i understand the atomic operations can not use for multi-GPU.

but you can use atomic operations for each gpu and then copy back data to host.

finally, host functions will manipulate these datas.

The trick that might work is to use CUDA 2.2+'s powerful zero-copy memory. Zero-copy memory does work with atomics.

Now, the first question is whether you can specify the same range of host memory to be zero-copy for more than one CUDA context at once. And even if you can, you need to actually try it to see if the atomic support works with multi GPU.

It would be painfully inefficient, but perhaps for some very rare work allocation or something it certainly might be useful.

Atomics work with a single GPU only.

Lord Tim dashes our hopes! But where’s the evil chortle?

Well, at least for 3 full minutes after posting, I was able to be happy thinking about how I might use multi-GPU atomics for work queue coordination…

my job description is actually “crusher of hopes and dreams”

as far as I know there’s no way to do this.

Strange that some1 has this requirement… But you can still work it out…

If you _can(?) map the same host pinned memory to multiple GPU devices (This shuld be possible), You could have the CPU broker between the multiple GPUs.

You just need to have a per-GPU request and response queue per shared resource. And you need to dedicate a CPU thread who will monitor the request locations.

The GPUs would place their request for a resource. The CPU thread monitoring the memory location will arbiter between mutiple GPUs and provide them access to the resource…

The GPUs have to spin on a location… 1 thread per GPU should do it…

If all thread does it, your PCI-bus will be jammmed…

I dont vouch this will work. At least, one can try it out - especially if you are up against a roadblock

Thank you for the suggestion. I’ll try it and get back you the result later if I am able to.

To my understanding zero-copy is some automatic synchronization of GPU-CPU memory. Since to use two GPUs you need two thread, it seems the zero-copy might not work, based on what tmurray said in that thread:

“The only thing we guarantee is that if you write to a PCIe location in one thread and read it later from that same thread, you’ll see the updated value.”

Thanks, Sarnath. I ask this question because I want my program to be scalable with multiple GPUs while take the advantage of atomic operation. Your method looks a little complicated, but change the algorithm might be the way to go.

And thanks for Quoc Vinh and tmurray too.

Zero-copy is NOT like what you say. Zero-copy is a way of GPU accessing RAM directly. The application allocates virtual address range in its address space space (like malloc but…) that is guaranteed to be physically contiguous and pinned (i…e the OS will NOT swap it). Thus this range of VA is contiguous in Physical RAM and will be pinned… The device (kernels running on them), that are capable, can write to the system RAM directly. When the device writes, the application can see it through the VA… And since VA is a property of the process and NOT threads, there is no problem. Just like how CPU threads can share global variables (C global variables), this will work fine.

Now, if one GPU can write, other can also write. So, there wont be any issue about CPU not seeing the values written by multiple GPUs. Intel CPUs are cache coherent.

It will be definitely a tough thing to make it work. BUt I believe it should work.


Atomics across GPUs does not sound like a good idea. if you could avoid it, you should always. May be, there is a smarter workaround