Tips on writing safe parallel code with CUDA?


This maybe a difficult question to answer but how does one write safe complex parallel code with CUDA?

I am utilising cudaStreamAddCallback and multiple GPU and streams to drive maximum efficiency with a multi-GPU setup of NVidia Titans and will expand to dynamic parallelism usage. It gets very hard for me to figure out all the things that might cause race conditions or multiple threads overwriting values or memory leaks etc.

So I am looking for guidelines (tools, error handling, references etc.) on what to do to help manage this complexity and to minimize anything that might creep out in production code? So yes the question maybe a bit too open ended.

Thanks for any help,

#define SAFE 1

It seems to me what is missing in the CUDA toolset is a set of higher generic abstraction algorithms that help abstract away the detail of threads, blocks, SMXs and GPUs, similar to STL but for parallel algorithms. And then you introduce policies (etc) that overwrite defaults so you can tweak still for optimal performance. And of course, all this would have to be thread safe.

Mucking around with C is a great distraction of detail, that hinders me from concentrating on doing really what I want; implementing algorithms that make best use of the GPU hardware.

I’m sure the clever people at Nvidia can come up with such an abstracted template library! ;)


What about the thrust library that comes with the CUDA SDK ? That is exactly based on the STL, and their sort is very fast(assuming you do NOT use their device vector, rather use pointers.

Also cuBLAS allows you to avoid such issues.

C (and its superset C++) is also the fastest language by a large margin, so it is the appropriate language for high-perfomance code. Is C really that much more difficult?

you can always use Google to find other CUDA libraries such as MAGMA, CULA, etc.

Some new frameworks (written by NVIDIA people) are here:

  1. Hemi: CUDA Portable C/C++ Utilities
  2. CUB: CUDA Unbound

Yes I use Thrust. I wish it could do a lot more. I think the problem is that it is rooted in the STL serial world, and provides an interface that is restrictive because of that.

For example, reduction is really useful algorithm. But why not expand upon this to a more parallel paradigm.

A couple of examples:

Use case 1

Apply algorithm(s) to a combination. That would be for N elements, apply the same algorithm to every unique group of M elements as input, where M<=N. Why not have ‘combination’ as an algorithm.

Use case 2

A grid of elements, being able to apply an algorithm to a group of tiles of elements at the same time. For example, in blurring/filtering. Why not have ‘tile’ as another algorithm?

I heard in a GTC talk, the one on Kepler, where it was said Nvidia capture a lot of different programs to see how well they run on their GPU architecture. So surely then they are able to pick out common parallel program patterns that can be abstracted out into a modern template library?

In other words, shift the paradigm of programming forward. Nvidia is surely the best placed to take a lead on this? And I’d love to see this happen.


Thanks, the links look interesting!