Question about multi-GPU programming Memory accesses and sharing

jack · January 10, 2009, 8:29pm

I’m thinking about getting one of the nifty new GTX295 cards when I build a new dev box this summer, so I’ve been looking at some of the multi-GPU examples. In the CUDA driver API, there are functions to get the properties of a device (compute capability, total memory, etc.). If there were two cards in the machine, then each would return it’s own memory count, etc. When there are two GPU’s on one card (a la GTX295), is the memory split in half, or can they each access the entirety of the memory on the card. If it is the latter, is the driver API “smart” enough to know when memory is allocated for a CUDA application running on one of the GPUs so that it is not re-allocated and overwritten by a program running on the second GPU? What about memory transfers from the card to the host?

tmurray · January 10, 2009, 8:37pm

In CUDA terms, the GTX 295 is two cards with 240 SPs and 896 MB of memory each, period. There is no fancy magic on that card to help you make a multi-GPU app.

jack · January 10, 2009, 8:52pm

Ah ok, thanks. I was hoping that’s how it would be, versus allowing each GPU to access the entire memory. That makes things a lot simpler.

SPWorley · January 10, 2009, 10:56pm

What we all want is “backchannel” communication between GPUs though… for a 295, it’d be cool to have a device-to-device copy over its own private oncard link. (Such hardware connections may not exist on the 295 though… SLI is not a high bandwidth connection) Much more practical and possible is device-to-device copies over the PCIE bus without the CPU’s involvement… just queue up the transfer in a stream and let the exchange happen at the driver level.
These are not new feature requests… but as soon as you start programming multiGPU you immediately think about this ability.

It’s likely that the device to device PCIE bus transfer ability is possible now with today’s hardware, but would need significant new driver and CUDA API support. Such updates aren’t trivial of course but I think all of us multiGPU coders like to bring it up just so the NVidia folks know we’re still eager…

tmurray · January 10, 2009, 11:22pm

But then wouldn’t a stream need to encapsulate multiple contexts in order to facilitate multi-GPU synchronization (which is what you’re after, really)?

SPWorley · January 10, 2009, 11:31pm

Yeah, it would, and it all starts becoming messy since you’re trying to peel back a corner of the abstraction that contexts give you.

And maybe it’s not a big deal for a couple devices.

But what about a FASTRA-like box with 8 devices? And all the devices need to sync with each other? That’s 112 transfers the CPU has to organize (8*7 pairs, *2 since you need a device->host and then a second host->device)

This isn’t a new problem either, and you’re really getting to supercomputer issues of intercommunication.

A lot of tasks don’t need much intercommunication, but some are dominated by it.

Admittedly in my own coding I have never dealt with more than 2 devices, I’m just extrapolating to the trickier cases that the supercomputer guys must swim in all the time.

Sarnath · January 11, 2009, 3:25am

I think it is just a question of extending “cudaMemcpyKind” to include inter-device… Although there is a kind called cudaMemcpyDeviceToDevice – it represents only intra-device… So, we need one more for device to device to make things simple.

I hope these cards are PCI-master capable… Then, it is jus a question of programming the DMA registers and kick starting the operation…

E.D_Riedijk · January 11, 2009, 6:59am

I believe they are all DMA-master. If my memory serves me well, it was explained by an NVIDIA employee, it is the card that performs the actual DMA transfer.

Sarnath · January 11, 2009, 2:45pm

Yeah… Right… GPU initiated master DMA would be the case for “pinned” memory case for sure…

So, the multi-GPU copying case could be a slight extension of “cudaMemcpy”.

I am not sure how this would work on GTX295… Are these cards 2 PCI functions in a single device? So, if 1 PCI function masters data from/to other PCI function, will it work over the PCI interface (that is being shared for both the PCI functions?)… Ideally, it should though… I am just fantasizing a situation…

jack · January 12, 2009, 4:21pm

Just curious…would there be any performance benefit to adding a driver function that would allow us to tell if two “devices” (e.g. GTX295) were on the same physical card? I was thinking that this might allow people to optimize memory transfers so that they don’t overlap (otherwise the two GPU’s could run independently, but they’d share a common PCI-E interface, and could possibly be transferring data at the same time).

I don’t know a whole lot about the PCI-E bus and memory transfer stuff, so I apologize if this is a stupid question. ;)

Sarnath · January 13, 2009, 5:08am

I cant tell if you can optimize memory transfers this way…

BUt I can tell you how to identify 2 cards sharing same PCI-E interface – Just examine the PCI tuple <Bus#, Slot#, Function#>. If the 2 CUDA devices share the same <PCI Bus#, Slot #> and differ only in <function #> – then they must be sharing the same PCI-E slot.

Anyway, I have no practical experience on this card. I am just dishing this out from my knowledge on PCI. Not sure if sthg changed with PCI-E though…

Topic		Replies	Views
GTX295 multi GPU programming CUDA Programming and Performance	22	10655	July 9, 2009
CUDA 2.0 QUESTION CUDA Programming and Performance	11	13872	November 12, 2008
Dazed and Confused.. CUDA Programming and Performance	6	1412	April 8, 2013
Data transfer between two GPUs CUDA Programming and Performance	6	2759	September 9, 2009
IDEA: Intrinsic multi-GPU support (Even over a network) CUDA Programming and Performance	7	9592	January 1, 2009
Multiple GPU's and sharing memory Will a CUDA API eventually be provided for this? CUDA Programming and Performance	4	16497	June 28, 2010
GTX-295 CUDA Programming and Performance	7	3603	June 12, 2010
NVidia GPUs in Embedded Computing Has the GPU computing and CUDA penetrated the embedded market? CUDA Programming and Performance	11	3862	August 3, 2010
Transfer data from PCIe device to GPU memory? Why not incorporate the SLI bridge into CUDA? CUDA Programming and Performance	15	11549	March 28, 2010
multi-GPUs with streams. Seems only one device overlapping copies CUDA Programming and Performance	9	1632	October 30, 2015

Question about multi-GPU programming Memory accesses and sharing

Related topics