CUDA 2.0 QUESTION

Could you tell me CUDA 2.0 can support “Many-GPU” computing(like S1070 S870)? Thank you! External Image External Image External Image External Image

Strange question actually, but yes.

I’m also wondering what does this question actually mean … “Will the CUDA driver/runtime scale to many GPUs ?”, or simply “does it support multi-GPU ?”.

I’m actually a bit curious about CUDA runtime/driver scalabilty… For now, are there already systems with more than 1x4 GPUs (i can’t imagine a pci bus that would survive beyond that External Media ? Nevertheless, one must admit that there is still a really poor support for multi-GPU, at least with regards to memory transfers. Did anyone actually experienced some “scalability” issues (in the runtime/driver) when dealing with 4 GPUs for instance ?

There is a guy from belgium who put 4 GX2 cards in a PC, so he has 8 GPUs. No scalability issues as far as I understood, but it offcourse depends heavily on the algorithm you are accelerating. http://fastra.ua.ac.be/en/index.html

And I am also hoping we can soon have at least multicast writes to GPU memory, so I can transfer the same data to more than 1 GPU. That would make CUDA twice as valid for realtime-processing

Multi-cast write would benefit only if supported by PCI-E. Well , I know little about PCIe though. May b, you were referring to those kindaa support that PCI-E has.

No, I actually have not a clue if it is supported by PCI-e ;)

And it will probably not work at all, as the current way data is transferred is by DMA, where the GPU is the device that does the transfer, so it is probably impossible anyway… (makes note to recalculate some stuff when back from holiday, holiday really clears your mind)

I thought that is only for pinned memory transfers. When you do a “cudaMemcpy”, I thought CPU does a PIO (i.e. CPU read/writes from application address space to device memory). Are you sure it is DMA? It would require a lot to support for doing DMAs (scatter-gather support + non-page aligned accesses as well). THanks!

And yes, holidays really clears your mind… It is good in a sense…

It indeed seems that CUDA is doing PIO for non-pinned memory, which is actually a strong issue provided you cannot pin existing memory and that CUDA current interface does not allow to share such a pinned buffer between contexts. So, considering that CUDA currently fails at handling DMA properly, if you have a really important number of GPUs, you would just kill you CPUs by performing PIO … isn’t that a real scalability issue ?

Now concerning memory transfers, who would really need broadcast ? A point-to-point (inter GPU, or even GPU<->network card, let’s be crazy !) would be much more interesting imho… and it’s also perfectly doable on a PCI express bus.

I have a related, possible trivial question (I am totally new to all this parallel computing, GPU and CUDA)! Thanks in advance for the patience and response!

If I have both two GTX cards (or one GTX 280 and other Tesla board) on a single machine and if I run a job, would it automatically distribute the job between the cards? How does it work in various circumstanges?

The reason for my question is we have 280 on a machine now and we plan to add another graphic card: perhaps another 280 or tesla to it. Some pointers to any online documentation can be very helpful.

No, CUDA does not automatically distribute the work between multiple cards. If you don’t explicitly pick a device, it will always run your code on device 0.

Dividing the work between the cards requires you to either run two copies of your program (if you don’t need the tasks to communicate at all, like in batch processing) and have each copy select a different device at startup. The cudaSetDevice(int dev) function is used for this.

If you want to use two cards from one instance of your program, then you need to start two threads, and have each thread call cudaSetDevice() with a different device number. After that point, when each thread runs a cuda function, it will automatically go to the device that was selected initially.

No free lunch :)

At least from the driver API’s point of view, you have to create various contexts and to distribute tasks between GPUs by hand. Basically, this is the same problem as having multiple tasks to be performed on multiple GPUs, there is just a huge amount of techniques used to balance the load or more generally to schedule those tasks on various computing resources.

Thanks! Precisely what I was looking for :).