Mutiple GPUs with Runtime API and OpenMP

Currently, our simulation code uses CUDA and MPI for running on clusters. The CPU-only version uses an MPI+OpenMP hybrid. I’d like to adopt the latter to allow tighter communication between GPUs on the same node. The typical OpenMP program has the model “spawn, work, join; spawn, work, join; …” Can this be made compatible with the CUDA Runtime API? Currently, the runtime API specifically disallows context migration. It was mentioned elsewhere in the forum that this would be fixed in a future release, but without a timeframe. Is the only option for the Runtime API + OpenMP to keep all the threads live for the duration of the program?

In my application, spawning and joining are done very infrequently, so I am not concerned about overhead. Can the CUDA contexts be destroyed before the thread joins when using the runtime API, then recreated after the next spawn? Alternatively, is there a tentative timeframe for supporting context migration in the runtime API?

Thanks!

All the OpenMP compilers I know don’t actually spawn new worker threads, but keep them idling and reuse them to avoid the thread creation overhead.

So, as long as you just want each thread to have some context and don’t mind which one it is (and don’t mind not following the OpenMP specification literally), I think you could just create the contexts once and reuse them in each parallel section.

All the OpenMP compilers I know don’t actually spawn new worker threads, but keep them idling and reuse them to avoid the thread creation overhead.

So, as long as you just want each thread to have some context and don’t mind which one it is (and don’t mind not following the OpenMP specification literally), I think you could just create the contexts once and reuse them in each parallel section.

I experimented with this once, and it seemed to work, but I ultimately abandoned the approach because I was uncomfortable relying on unspecified OpenMP behavior. (And it turned out that I didn’t need it for the project.) It is certainly true that any fast OpenMP implementation needs to reuse threads in different parallel sections because on most platform, spawning a thread is a slow operation. However, I don’t think the standard actually specifies the lifetime of a thread, so you are not guaranteed to get the same set of threads in every parallel section.

I think the “right” way to do this would require the driver API. You’d have to make an array of structures holding information for each GPU (the CUContext plus any device pointers). Upon entering a parallel section, each thread would index on threadID and grab their GPU’s information. Then they would cuCtxPushCurrent to attach that context to their CPU thread, do some work, and at the end of the parallel section, call cuCtxPopCurrent to allow the GPU context to be reattached to a different thread in the next parallel section.

If you frequently enter and exit parallel sections, this is likely to be so slow that it kills performance (despite being correct no matter how the OpenMP implementation works). You might want to try the simpler approach first and choose to “live dangerously.” :)

I experimented with this once, and it seemed to work, but I ultimately abandoned the approach because I was uncomfortable relying on unspecified OpenMP behavior. (And it turned out that I didn’t need it for the project.) It is certainly true that any fast OpenMP implementation needs to reuse threads in different parallel sections because on most platform, spawning a thread is a slow operation. However, I don’t think the standard actually specifies the lifetime of a thread, so you are not guaranteed to get the same set of threads in every parallel section.

I think the “right” way to do this would require the driver API. You’d have to make an array of structures holding information for each GPU (the CUContext plus any device pointers). Upon entering a parallel section, each thread would index on threadID and grab their GPU’s information. Then they would cuCtxPushCurrent to attach that context to their CPU thread, do some work, and at the end of the parallel section, call cuCtxPopCurrent to allow the GPU context to be reattached to a different thread in the next parallel section.

If you frequently enter and exit parallel sections, this is likely to be so slow that it kills performance (despite being correct no matter how the OpenMP implementation works). You might want to try the simpler approach first and choose to “live dangerously.” :)

Another option would be not to spawn and join, but to embed everything into one long-lived parallel region. The single threaded parts would then be omp single regions. Would conform to OpenMP, and should in practice be about what every implementation does anyway.

Another option would be not to spawn and join, but to embed everything into one long-lived parallel region. The single threaded parts would then be omp single regions. Would conform to OpenMP, and should in practice be about what every implementation does anyway.

Thanks. After some experiments, I suspected this, since doing the totally naive thing seemed to work, but I wasn’t sure. I agree that the proper thing to do is either use the driver API or maintain one omp parallel section for the lifetime of the code, but both would require some significant restructuring. Since the number of compilers we currently use with CUDA is countable on two fingers, I will likely choose to “live dangerously” for the moment. Thanks for the suggestions!

Thanks. After some experiments, I suspected this, since doing the totally naive thing seemed to work, but I wasn’t sure. I agree that the proper thing to do is either use the driver API or maintain one omp parallel section for the lifetime of the code, but both would require some significant restructuring. Since the number of compilers we currently use with CUDA is countable on two fingers, I will likely choose to “live dangerously” for the moment. Thanks for the suggestions!

I’ve had problems with this approach and MS compiler. If you change number of threads in parallel sections, it will recreate the whole team of threads. Say, you have 3 Teslas and 8 CPU cores; you do part of your work on CPU only and use 8 OpenMP threads; part of your work is done on GPU, and you use 3 OpenMP threads to control your Teslas. This works perfectly when compiled with GCC, but under MSVC GPU threads loose their contexts. So you better off using your own threads to control GPUs. There is a nice example of how to do it by MisterAnderson42 here: http://forums.nvidia.com/index.php?showtopic=66598

I’ve had problems with this approach and MS compiler. If you change number of threads in parallel sections, it will recreate the whole team of threads. Say, you have 3 Teslas and 8 CPU cores; you do part of your work on CPU only and use 8 OpenMP threads; part of your work is done on GPU, and you use 3 OpenMP threads to control your Teslas. This works perfectly when compiled with GCC, but under MSVC GPU threads loose their contexts. So you better off using your own threads to control GPUs. There is a nice example of how to do it by MisterAnderson42 here: http://forums.nvidia.com/index.php?showtopic=66598