I’m thinking about getting one of the nifty new GTX295 cards when I build a new dev box this summer, so I’ve been looking at some of the multi-GPU examples. In the CUDA driver API, there are functions to get the properties of a device (compute capability, total memory, etc.). If there were two cards in the machine, then each would return it’s own memory count, etc. When there are two GPU’s on one card (a la GTX295), is the memory split in half, or can they each access the entirety of the memory on the card. If it is the latter, is the driver API “smart” enough to know when memory is allocated for a CUDA application running on one of the GPUs so that it is not re-allocated and overwritten by a program running on the second GPU? What about memory transfers from the card to the host?
In CUDA terms, the GTX 295 is two cards with 240 SPs and 896 MB of memory each, period. There is no fancy magic on that card to help you make a multi-GPU app.
Ah ok, thanks. I was hoping that’s how it would be, versus allowing each GPU to access the entire memory. That makes things a lot simpler.
What we all want is “backchannel” communication between GPUs though… for a 295, it’d be cool to have a device-to-device copy over its own private oncard link. (Such hardware connections may not exist on the 295 though… SLI is not a high bandwidth connection) Much more practical and possible is device-to-device copies over the PCIE bus without the CPU’s involvement… just queue up the transfer in a stream and let the exchange happen at the driver level.
These are not new feature requests… but as soon as you start programming multiGPU you immediately think about this ability.
It’s likely that the device to device PCIE bus transfer ability is possible now with today’s hardware, but would need significant new driver and CUDA API support. Such updates aren’t trivial of course but I think all of us multiGPU coders like to bring it up just so the NVidia folks know we’re still eager…
But then wouldn’t a stream need to encapsulate multiple contexts in order to facilitate multi-GPU synchronization (which is what you’re after, really)?
Yeah, it would, and it all starts becoming messy since you’re trying to peel back a corner of the abstraction that contexts give you.
And maybe it’s not a big deal for a couple devices.
But what about a FASTRA-like box with 8 devices? And all the devices need to sync with each other? That’s 112 transfers the CPU has to organize (8*7 pairs, *2 since you need a device->host and then a second host->device)
This isn’t a new problem either, and you’re really getting to supercomputer issues of intercommunication.
A lot of tasks don’t need much intercommunication, but some are dominated by it.
Admittedly in my own coding I have never dealt with more than 2 devices, I’m just extrapolating to the trickier cases that the supercomputer guys must swim in all the time.
I think it is just a question of extending “cudaMemcpyKind” to include inter-device… Although there is a kind called cudaMemcpyDeviceToDevice – it represents only intra-device… So, we need one more for device to device to make things simple.
I hope these cards are PCI-master capable… Then, it is jus a question of programming the DMA registers and kick starting the operation…
I believe they are all DMA-master. If my memory serves me well, it was explained by an NVIDIA employee, it is the card that performs the actual DMA transfer.
Yeah… Right… GPU initiated master DMA would be the case for “pinned” memory case for sure…
So, the multi-GPU copying case could be a slight extension of “cudaMemcpy”.
I am not sure how this would work on GTX295… Are these cards 2 PCI functions in a single device? So, if 1 PCI function masters data from/to other PCI function, will it work over the PCI interface (that is being shared for both the PCI functions?)… Ideally, it should though… I am just fantasizing a situation…
Just curious…would there be any performance benefit to adding a driver function that would allow us to tell if two “devices” (e.g. GTX295) were on the same physical card? I was thinking that this might allow people to optimize memory transfers so that they don’t overlap (otherwise the two GPU’s could run independently, but they’d share a common PCI-E interface, and could possibly be transferring data at the same time).
I don’t know a whole lot about the PCI-E bus and memory transfer stuff, so I apologize if this is a stupid question. ;)
I cant tell if you can optimize memory transfers this way…
BUt I can tell you how to identify 2 cards sharing same PCI-E interface – Just examine the PCI tuple <Bus#, Slot#, Function#>. If the 2 CUDA devices share the same <PCI Bus#, Slot #> and differ only in <function #> – then they must be sharing the same PCI-E slot.
Anyway, I have no practical experience on this card. I am just dishing this out from my knowledge on PCI. Not sure if sthg changed with PCI-E though…