CUDA 4.0

See what tmurray has been hinting at for months:

Registered developers will get to use it on March 4.

Biggest win seems to be all around easier handling of multiple GPUs, including direct GPU-to-GPU copies which has been a common request.

We’ve been working on this for 7-8 months now. Multi-GPU was a big focus, as was simplifying the API and removing a lot of the weird corner cases.

There are a lot of features–time to spoil most things.

  • UVA–globally accessible pointers don’t alias, whether they’re in GPU memory (any UVA GPU’s memory) or the CPU. also, want to copy data between GPUs? cudaMemcpy(src, dst, size)

  • multiple threads just work with multiple GPUs–you want to use the same GPU from multiple threads? call cudaSetDevice from multiple threads and they’ll all share the same context. you want to use many GPUs from one thread? call cudaSetDevice repeatedly from the same thread and things will work exactly how you’d imagine they should work

  • P2P–direct load/store to another GPU’s memory within a kernel as well as improved inter-GPU copy performance (a lot of chipsets seem to be not too great at concurrent DtoH and HtoD bandwidth to different cards but good at P2P)

  • launching a kernel from the driver API is no longer a big affair; it’s one call that looks a lot like the runtime API kernel launch

  • 3D grids on Fermi

  • inter-GPU synchronization–record an event on one GPU, do a cudaStreamWaitEvent on another GPU

  • pin arbitrary memory–get some page-aligned memory, call cudaHostRegister, congrats! it’s pinned.

So you can do some snazzy things like



cudaEventRecord(eventDev0, stream0);



cudaEventRecord(eventDev1, stream1); 


cudaStreamWaitEvent(eventDev1, stream0);

cudaMemcpyAsync(dev1Ptr, dev0Ptr, size, stream1);


cudaStreamWaitEvent(eventDev0, stream1);

cudaMemcpyAsync(dev0Ptr, dev1Ptr, size, stream0);

With something like that, you can kick off a lot of work on multiple GPUs that is launched from a single thread and never uses the CPU for inter-GPU synchronization (so you could launch that in a for loop without any sort of need to call cudaThreadSynchronize until the very end).

Hopefully it’s obvious now, but forum posts complaining about lack of functionality or the difficulty to accomplish some reasonable task do not go into a vacuum.

Ah this is going to make a lot of things easier for me :-).
I am especially curious for the direct MPI support. Can I assume that it basically works the way MPI always works, but that you just give GPU Memory pointers to the send and recv functions? And is it a completely seperate implementation of MPI or an addition to openmpi or any other of the available solutions?


That’s basically how it works. We’re not releasing our own MPI implementation; we’re working with existing implementers to add this support. You’ll hear more about it during the next few months (hopefully sooner rather than later).

Will this be supported across all CUDA devices and compute capabilities? I’m curious if this requires hardware features that only devices with canMapHostMemory have, or something like that.

This will be huge. I assume this means that repeat cudaSetDevice() will be quick after the initial contexts are created? I want to know if I pay some penalty for rapid-fire switching between GPUs from a single thread.

In the multiple host thread, one GPU case, is it important to use streams to keep the host threads independent? I’m thinking of a situation where I have multiple threads firing off kernels that don’t relate to each other, and using the implicit block in cudaMemcpy().

For example:

void ComputeSomething(int n, float *inbuffer, float *device_memory, float *outbuffer)


  cudaMemcpy(device_memory, inbuffer, sizeof(float) * n, cudaMemcpyHostToDevice);

  my_kernel<<<gridDim, blockDim>>>(n, device_memory);

  cudaMemcpy(outbuffer, device_memory, sizeof(float) *n, cudaMemcpyDeviceToHost);


If I call ComputeSomething() from two host threads at the same time such that my_kernel is queued one after the other by each thread, will the second cudaMemcpy wait until both kernels finish, or just the kernel associated with that host thread. (OK, that’s a little confusing. I guess I’m asking if stream 0 for a given device is shared by all host threads, or if each host thread on the same device gets its own?)

Once PyCUDA incorporates the new functionality, it will finally remove the potential barrier preventing me from expanding some current code to multi-GPU ability. I’m looking forward to it!

64-bit Fermi only. Pre-Fermi cards have a 32-bit address space, and we’d blow through that trivially. Same reason we don’t support UVA on 32-bit platforms.

Very quick; you’ll be using the same contexts from all threads using the runtime. (you can use the same context from multiple threads simultaneously, in case that wasn’t obvious)

The null stream remains per context rather than per thread.

Duh, right. That seems obvious now. :)

Great, thanks!

Two thirds of my work in the past two years was due to gpu-gpu transfers and multi-gpu barriers… Could you please release all these nice featuresin jan, 2012, so that I don’t have to rewrite my phd thesis? ;)

There’s still no way to do a barrier within a kernel. Between kernels, it’s easy–use cudaStreamWaitEvent.

Congratulations and thank you, your work is very appreciated!

Inter-GPU communications, great!
And with that future CUDAized MPI it’ll be an absolute killer-feature (CUDA on clusters is sometimes quite heavy to deal with, and maintain)

Kudos to Nvidia! This could have huge impacts in our Tsubame 2 machine.

Does MPI transfer from GPU memory over Inifiband go directly to Infiniband interface? Or Does that involve copying to CPU memory?

Another question is on how the new GPU direct is implemented. Does it require a kernel patch as the previous version?

Copying to CPU memory. Doesn’t require a kernel patch, either.

Thanks for the info. Glad to know no kernel patch needed.

If NVIDIA could release the “interface” for writing GPUDirect Support, One can implement it for any device that one wants.
That would be a cool thing.

For example: Many people had earlier wanted to DMA Ethernet packets directly onto GPU. If GPUDirect interface is made public, I can write one for the card that I possess… and so on.

Is CUDA-MPI implementation based on MPICH, OpenMPI, something else?

Hello, I need to know if CUDA SDK 4.0 supports GPU Direct v2.0 on G200 consumer devices (GX260, GX275, GX280, GX295).

In particular I need information regarding CUDA SDK 4.0 features supported on GX275 specifically in near term for current application development needs and planning.

Also G200 support on the rest of the features in CUDA SDK 4.0 while you are at it – I’m thinking a comprehensive feature matrix for SDK4 - vs - H/W Generation. But please for near term just concern with addressing the intention to support GPU Direct v2.0 on G200 devices in CUDA SDK 4.0 presently.

Many nonscientific nonworkstation applications can benefit from consumer G200 hardware, if you want it. If you don’t believe me then think about Microsoft Kinect and potentials for large-volume-sales mobile and desktop vision applications.

Thanks for information!

Let me clarify to help keep focus on possible responses to previous questions. Maybe I gave too many potential distractions in my original question.

Of course GPU Direct is not cleanly possible 32-bit GPU to 64-bit HOST (or other 64-bit device). Let’s not talk about it further.

The true question that remains is, whether NVIDIA CUDA SDK 4.0 is providing support to G200 devices to perform GPU Direct from 32-bit G200 consumer GPU device to another identical 32-bit G200 consumer GPU device. Please notice the same bit-ness on both devices.

Thanks for information!

Unified addressing and peer-to-peer are limited to Fermi GPUs on 64-bit platforms.

Also, let me be clear–we’re not shipping a proprietary MPI implementation. We’re working with existing MPI implementations to add support for this.

On the subject of getting existing MPI implementations to add support, is there a way to jump on this particular train and help? I reckon this is the missing piece in the puzzle which would allow bridging two of my most favorite tools - pycuda and mpi4py - and allow direct MPI mapping and manipulation of pycuda’s gpuarray type. For those of us who work with distributed memory clusters and like python, this is potentially nirvana. I am very keen to do this myself if the necessary technical information can be made available.

Looks promising:

svn co ompi-trunk
grep -r -i cuda *

update: mpich2 (svn co mpich2-trunk) and openmpi have CUDA-support in the HWLOC component in their svn trunks. Which makes sense, based on my humble understanding of the role of this particular component: mpich2-hydra and ompi encapsulate node heterogeneities through HWLOC. mvapich2 doesn’t seem to have, which is surprising since, to ignorant me, mvapich2 is basically mpich2 with support for faster interconnects like IB or Myrinet.

I tried configuring the ompi checkout, but there’s no ./configure --with-cuda or --with-hwloc=cuda or anything remotely related available yet, at least I didn’t find anything despite excessive grep stunts.

In summary: Either the HWLOC folks are pursuing their own CUDA thing, or we can expect usable CUDA4+MPI soon. Or I am on w wild goose chase, which is just as probable :)