CUDA 4.0

Now here’s a fun question (or a feature request) to tmurray: Given a single node without an IB device but with a bunch of C20xx, can I run the new CUDA+MPI features nonetheless? This would be great, since I usually develop on a single multicore multigpu node and scale out afterwards. I could verify that my code runs by doing a suboptimal MPI+CUDA run even on my devel box if this worked! This would obviously be convenient for a lot of people like me, since such a box is pretty common and this scaling-out strategy matches my workflow perfectly. And I can’t imagine I am not the only one since nowadays even laptops are fast enough for MPI (with shmem-devices via HWLOC).

There have been cuda helper functions in hwloc for maybe half a year, and (if I understand correctly) all they are doing is providing a uniform answer to the host CPU - GPU affinity question, ie. “which processor is closest to which GPU?”. I don’t believe hwloc has anything to do with the actual messaging parts of MPI, it is just a way of providing an architecture independent topology graph of all the hardware in the communicator. But I am probably wrong too :)

Sure. The modifications are basically “MPI takes GPU pointers and does the copying as required itself.” The backend for the transfer doesn’t really matter.

CUDA 4.0 is great!

Fantastic. Together with the CUDA SDK 4.0 will you provide any documents that will explain the ideas and how this integration has been done?

Awesome job to Tim and the Cuda team! This makes my life a lot easier External Image

For those who haven’t seen, there will be a webinar and conference call tomorrow on Cuda 4.0 that all are free to attend:
https://www2.gotomeeting.com/register/394445946

It’s from 10am PST - 11am PST, Friday March 4th.

I’ll be around to answer questions during that, actually.

Hi,

I have a question about UVA. The fact that its unified virtual address still doesn’t remove the copy from “actual” CPU memory

to “actual” GPU memory over PCI, right?

The driver beneath takes care of all of this? What happens if I change some value in GPU RAM and then have the CPU read this as this

is unified address, does the CPU automatically see the change? the driver will “auto-magically” copy the data again over PCI back from GPU

to CPU to reflect this change?

thanks

like I’ve said previously, you can’t just dereference a GPU pointer from the CPU or vice-versa. you still have to do explicit copies.

We just posted some CUDA 4.0 overview slides:

A question regarding the GPU computing SDK: In 3.2 the SDK included CUDPP, which seems to be removed in 4.0 (I guess because THRUST is included in toolkit now). My code is relying on CUDPP scan and (radix) sort a lot and last year this was alot faster with CUDPP than with TRHUST. Can the THRUST in 4.0 compete in this disciplines? And generally: Do you know if both are under active development, especially regarding new iterations of CUDA and Nvidia hardware?

Is it limited to Tesla T20-series only? Or all Fermi’s?

All Fermis on 64-bit non-WDDM platforms.

Awesome, thanks Tim

Hello there,

It was a great webinar, thanks a lot! And wow, so many new features, you must be busy out there :)

MPI support is extremely interesting. I was actually considering porting one of the simpler MPI implementations intended for shared memory to CUDA. Are you certain that an openmpi extension is the best way to achieve this? I’ve been having intermittent problems with sm and openib transports on openmpi, it’s surprisingly buggy. Well, at least it’s got the framework :) At any rate, please make sure that you have an extensive test suite that contains different kind of applications (iterative matrix calculations, asynchronous computations, fine-grain algorithms, coarse-grain algorithms, etc.), to be on the safe side if you are using openmpi.

Just to be absolutely sure, this MPI support is for communication across CPU threads or GPU threads? Or am I missing the point entirely? :) I was a bit late into the webinar, I have to watch it from the beginning.

I am looking forward to trying out the CUDA 4.0 right away. Some of the memory alloc. bugs I reported were supposed to be fixed in this release. I must check if my test cases work now.

Best Regards,


Eray

Yet more well well well deserved kudos for 4.0!

A quick question about GPU to GPU transfers… even though they avoid the CPU, do the transfers always hop all the way down to the main PCIE bus and back, or will they get routed even at a higher switch level?

My specific question is if you have a motherboard like the Asus P6T7 Supercomputer with an NF200 switch to support more PCIE slots, will a memory transfer between 64 bit Fermi devices which share that NF200 switch be routed through the switch, to the motherboard, and then back through the switch again? Or would the transfer avoid the motherboard bus hop?

This matters even more when thinking about the still-hypothetical dual GF110 boards which would likely have an NF200 onboard. If the two GPUs could share memory via the switch and not even touch the main PCIE bus, it’d be rather nifty!

They don’t necessarily have to go back to the chipset. If you have two cards behind a switch, P2P will not go through the root chipset and instead stay entirely behind the switch.

See a question on the first page of this thread, its for CPU threads (MPI processes). It just allows you to give GPU pointers to the MPI communication routines (recv, send, broadcast) and thus gets rid of the necessary copying to the host first (which is done implecitly).

So basically there are now three possibilities for data transfers between GPUs on different hosts with MPI (HostI, and GPUI being pointers to memory on the Host and GPU on node I).


  1. The old way, and if you dont have 64 bit system:

On Host1:

cudaMemcpy(GPU1, Host1)

MPISend(Host1)

On Host2:

MPIRecv(Host2)

cudaMemcpy(Host2, GPU2)

This involves the CPU touching the data.


  1. GPU-Direct 1.0 only with special drivers for Infiniband cards (Mellanox and I think QLogic or so??)

On Host1:

cudaMemcpy(GPU1, Host1)

MPISend(Host1)

On Host2:

MPIRecv(Host2)

cudaMemcpy(Host2, GPU2)

The Infiniband driver can share pinned memory of the GPU so both the GPU and the Network card can use DMA access to the host memory, so the CPU does not have to touch the transfer buffer.


  1. GPU-Direct 2.0

On Host1

MPISend(GPU1)

On Host2

MPIRecv(GPU2)

This implicitely works as version 1 but the API / Driver will do the internal stuff of transfering the data first from the GPU to the Host and later back again on the other Host. This can be used with normal systems (as long as they support UVA i.e. are 64 bit systems) and I guess they are also working on combining it with the GPU-Direct feature of direct DMA access by both the GPU and the network card. But as far as I understood it they dont have that working yet - but maybe tmurray can correct me on this (and other misunderstandings whcih might plague me ;-)).

Ceearem

Oh nifty nifty nifty!

This looks great! The direct memory access features and thread access looks great for multi GPU development.

But… [s]What about support for VS 2010. I installed the 4.0 SDK tonight but it looks like I’m going to have to back it out and revert to 3.2 until the relevant props files are available etc.

Any idea when this will happen? I’d really like to start using the 4.0 features.[/s]

Oh… Looks like the props files get dropped into the correct MSBuild folder and a few changes to the vcxproj file and some paths gets things going. Excellent news. The README should mention this I guess. It implies there’s only support for 2005 & 2008.

Ade