Mutiple GPUs with Runtime API and OpenMP

esler · April 16, 2010, 2:22am

Currently, our simulation code uses CUDA and MPI for running on clusters. The CPU-only version uses an MPI+OpenMP hybrid. I’d like to adopt the latter to allow tighter communication between GPUs on the same node. The typical OpenMP program has the model “spawn, work, join; spawn, work, join; …” Can this be made compatible with the CUDA Runtime API? Currently, the runtime API specifically disallows context migration. It was mentioned elsewhere in the forum that this would be fixed in a future release, but without a timeframe. Is the only option for the Runtime API + OpenMP to keep all the threads live for the duration of the program?

In my application, spawning and joining are done very infrequently, so I am not concerned about overhead. Can the CUDA contexts be destroyed before the thread joins when using the runtime API, then recreated after the next spawn? Alternatively, is there a tentative timeframe for supporting context migration in the runtime API?

Thanks!

tera · April 16, 2010, 8:25am

Currently, our simulation code uses CUDA and MPI for running on clusters. The CPU-only version uses an MPI+OpenMP hybrid. I’d like to adopt the latter to allow tighter communication between GPUs on the same node. The typical OpenMP program has the model “spawn, work, join; spawn, work, join; …” Can this be made compatible with the CUDA Runtime API? Currently, the runtime API specifically disallows context migration. It was mentioned elsewhere in the forum that this would be fixed in a future release, but without a timeframe. Is the only option for the Runtime API + OpenMP to keep all the threads live for the duration of the program?

In my application, spawning and joining are done very infrequently, so I am not concerned about overhead. Can the CUDA contexts be destroyed before the thread joins when using the runtime API, then recreated after the next spawn? Alternatively, is there a tentative timeframe for supporting context migration in the runtime API?

Thanks!

All the OpenMP compilers I know don’t actually spawn new worker threads, but keep them idling and reuse them to avoid the thread creation overhead.

So, as long as you just want each thread to have some context and don’t mind which one it is (and don’t mind not following the OpenMP specification literally), I think you could just create the contexts once and reuse them in each parallel section.

tera · April 16, 2010, 8:25am

Currently, our simulation code uses CUDA and MPI for running on clusters. The CPU-only version uses an MPI+OpenMP hybrid. I’d like to adopt the latter to allow tighter communication between GPUs on the same node. The typical OpenMP program has the model “spawn, work, join; spawn, work, join; …” Can this be made compatible with the CUDA Runtime API? Currently, the runtime API specifically disallows context migration. It was mentioned elsewhere in the forum that this would be fixed in a future release, but without a timeframe. Is the only option for the Runtime API + OpenMP to keep all the threads live for the duration of the program?

In my application, spawning and joining are done very infrequently, so I am not concerned about overhead. Can the CUDA contexts be destroyed before the thread joins when using the runtime API, then recreated after the next spawn? Alternatively, is there a tentative timeframe for supporting context migration in the runtime API?

Thanks!

All the OpenMP compilers I know don’t actually spawn new worker threads, but keep them idling and reuse them to avoid the thread creation overhead.

So, as long as you just want each thread to have some context and don’t mind which one it is (and don’t mind not following the OpenMP specification literally), I think you could just create the contexts once and reuse them in each parallel section.

seibert · April 16, 2010, 2:00pm

I experimented with this once, and it seemed to work, but I ultimately abandoned the approach because I was uncomfortable relying on unspecified OpenMP behavior. (And it turned out that I didn’t need it for the project.) It is certainly true that any fast OpenMP implementation needs to reuse threads in different parallel sections because on most platform, spawning a thread is a slow operation. However, I don’t think the standard actually specifies the lifetime of a thread, so you are not guaranteed to get the same set of threads in every parallel section.

I think the “right” way to do this would require the driver API. You’d have to make an array of structures holding information for each GPU (the CUContext plus any device pointers). Upon entering a parallel section, each thread would index on threadID and grab their GPU’s information. Then they would cuCtxPushCurrent to attach that context to their CPU thread, do some work, and at the end of the parallel section, call cuCtxPopCurrent to allow the GPU context to be reattached to a different thread in the next parallel section.

If you frequently enter and exit parallel sections, this is likely to be so slow that it kills performance (despite being correct no matter how the OpenMP implementation works). You might want to try the simpler approach first and choose to “live dangerously.” :)

seibert · April 16, 2010, 2:00pm

I experimented with this once, and it seemed to work, but I ultimately abandoned the approach because I was uncomfortable relying on unspecified OpenMP behavior. (And it turned out that I didn’t need it for the project.) It is certainly true that any fast OpenMP implementation needs to reuse threads in different parallel sections because on most platform, spawning a thread is a slow operation. However, I don’t think the standard actually specifies the lifetime of a thread, so you are not guaranteed to get the same set of threads in every parallel section.

I think the “right” way to do this would require the driver API. You’d have to make an array of structures holding information for each GPU (the CUContext plus any device pointers). Upon entering a parallel section, each thread would index on threadID and grab their GPU’s information. Then they would cuCtxPushCurrent to attach that context to their CPU thread, do some work, and at the end of the parallel section, call cuCtxPopCurrent to allow the GPU context to be reattached to a different thread in the next parallel section.

If you frequently enter and exit parallel sections, this is likely to be so slow that it kills performance (despite being correct no matter how the OpenMP implementation works). You might want to try the simpler approach first and choose to “live dangerously.” :)

tera · April 16, 2010, 2:58pm

Another option would be not to spawn and join, but to embed everything into one long-lived parallel region. The single threaded parts would then be omp single regions. Would conform to OpenMP, and should in practice be about what every implementation does anyway.

tera · April 16, 2010, 2:58pm

Another option would be not to spawn and join, but to embed everything into one long-lived parallel region. The single threaded parts would then be omp single regions. Would conform to OpenMP, and should in practice be about what every implementation does anyway.

esler · April 17, 2010, 2:21am

Thanks. After some experiments, I suspected this, since doing the totally naive thing seemed to work, but I wasn’t sure. I agree that the proper thing to do is either use the driver API or maintain one omp parallel section for the lifetime of the code, but both would require some significant restructuring. Since the number of compilers we currently use with CUDA is countable on two fingers, I will likely choose to “live dangerously” for the moment. Thanks for the suggestions!

esler · April 17, 2010, 2:21am

Thanks. After some experiments, I suspected this, since doing the totally naive thing seemed to work, but I wasn’t sure. I agree that the proper thing to do is either use the driver API or maintain one omp parallel section for the lifetime of the code, but both would require some significant restructuring. Since the number of compilers we currently use with CUDA is countable on two fingers, I will likely choose to “live dangerously” for the moment. Thanks for the suggestions!

ddemidov · April 17, 2010, 9:41am

I’ve had problems with this approach and MS compiler. If you change number of threads in parallel sections, it will recreate the whole team of threads. Say, you have 3 Teslas and 8 CPU cores; you do part of your work on CPU only and use 8 OpenMP threads; part of your work is done on GPU, and you use 3 OpenMP threads to control your Teslas. This works perfectly when compiled with GCC, but under MSVC GPU threads loose their contexts. So you better off using your own threads to control GPUs. There is a nice example of how to do it by MisterAnderson42 here: http://forums.nvidia.com/index.php?showtopic=66598

ddemidov · April 17, 2010, 9:41am

I’ve had problems with this approach and MS compiler. If you change number of threads in parallel sections, it will recreate the whole team of threads. Say, you have 3 Teslas and 8 CPU cores; you do part of your work on CPU only and use 8 OpenMP threads; part of your work is done on GPU, and you use 3 OpenMP threads to control your Teslas. This works perfectly when compiled with GCC, but under MSVC GPU threads loose their contexts. So you better off using your own threads to control GPUs. There is a nice example of how to do it by MisterAnderson42 here: http://forums.nvidia.com/index.php?showtopic=66598

Topic		Replies	Views
MultiGPU, multithread, and establishing contexts Odd (but good) behavior with OpenMP affecting multi CUDA Programming and Performance	4	6317	July 10, 2009
OpenMP + CUDA Multiple Parallel Sections Does GPU to Thread linking persist across multiple parallel CUDA Programming and Performance	11	3762	June 29, 2011
How to Avoid Re-Creating CUDA Context with Multi-GPU using OpenMP? Call a function that does omp par CUDA Programming and Performance	2	1879	July 28, 2009
Not answered question? CUDA and OpenMP? CUDA Programming and Performance	28	11163	September 30, 2010
Multi-GPU parallel context creation using OpenMP CUDA Programming and Performance	2	1319	November 1, 2017
Is it possible using muliple context for a GPU. mulitple CPU thread CUDA Programming and Performance	10	4998	April 8, 2009
pthreads vs. OpenMP? CUDA Programming and Performance	4	5047	February 18, 2013
Using CUDA/CudaContexts simultanously from multiple CPU threads CUDA Programming and Performance	4	5581	February 3, 2010
CUDA + OpenMP CUDA Programming and Performance	2	764	December 8, 2016
How to share GPU memory from different host threads? CUDA Programming and Performance	6	2422	July 14, 2010

Mutiple GPUs with Runtime API and OpenMP

Related topics