Concurrent kernels and CPU threads Not answered question

I hope that someone from the Nvidia team can provide me an answer for my question, this is really important for the speedup comparison between CPUs and GPUs (if I cannot launch different kernels concurrently using different CPU threads this will significantly reduce the speedup of using GPUs instead of CPUs for square SIMD programs). Indeed, I think that for concurrent computations on different streams, the kernels must be from the same context. If this is the case, how can every thread acquires/releases context (like a critical section) and the kernels be executed concurrently on the GTX 480?

Thank you again
Lokman

It is technically possible to get concurrent kernel execution and full speed performance on the GTX 4x0 or C20X0 (480/2050)–see the Kappa library at psilambda.com for examples and timing results. The approach that is needed (and used by Kappa) for guaranteed concurrent kernel execution is to prepare the kernels for launch and then to issue asynchronous kernel launches (on separate streams) in batches without intervening CUDA API (driver) calls on the host thread associated with the CUDA GPU context (for multiple GPUs use multiple host thread/context pairings–one pair per GPU). This allows the GPU to function in a manner similar to CPU execution in that the device stays fully occupied in execution as long as it has the resources and the requests to execute (actually it might be better than CPU in that there may be less overhead for mixing different instruction requests–give or take cache usage).

It is technically possible to get concurrent kernel execution and full speed performance on the GTX 4x0 or C20X0 (480/2050)–see the Kappa library at psilambda.com for examples and timing results. The approach that is needed (and used by Kappa) for guaranteed concurrent kernel execution is to prepare the kernels for launch and then to issue asynchronous kernel launches (on separate streams) in batches without intervening CUDA API (driver) calls on the host thread associated with the CUDA GPU context (for multiple GPUs use multiple host thread/context pairings–one pair per GPU). This allows the GPU to function in a manner similar to CPU execution in that the device stays fully occupied in execution as long as it has the resources and the requests to execute (actually it might be better than CPU in that there may be less overhead for mixing different instruction requests–give or take cache usage).

Wait, cuCtxPopCurrent and cuCtxPushCurrent, both driver calls, will prevent concurrent launches?

I was planning on having a critical section, where the context gets transferred to a thread long enough for it to do its async calls, then the context gets made floating again. If this design pattern prevents concurrent execution though, I’ll have to rethink it. Having a control thread and multiple feeder threads passing the control thread messages would then be the other alternative.

Wait, cuCtxPopCurrent and cuCtxPushCurrent, both driver calls, will prevent concurrent launches?

I was planning on having a critical section, where the context gets transferred to a thread long enough for it to do its async calls, then the context gets made floating again. If this design pattern prevents concurrent execution though, I’ll have to rethink it. Having a control thread and multiple feeder threads passing the control thread messages would then be the other alternative.

My understanding is that any API calls that involve events or synchronization will prevent concurrent kernel execution. My further understanding is that the cuCtxPopCurrent and cuCtxPushCurrent calls are in this class of API calls. Somebody please correct me if either of these are not true.

Of course the definitive answer is to try it and see (that really even trumps what NVIDIA documentation or employees say–although they are more authoritative about what will be in the future :unsure: ).

My understanding is that any API calls that involve events or synchronization will prevent concurrent kernel execution. My further understanding is that the cuCtxPopCurrent and cuCtxPushCurrent calls are in this class of API calls. Somebody please correct me if either of these are not true.

Of course the definitive answer is to try it and see (that really even trumps what NVIDIA documentation or employees say–although they are more authoritative about what will be in the future :unsure: ).

cuCtxPush/PopCurrent are not synchronization primitives (except sort of on WDDM).

cuCtxPush/PopCurrent are not synchronization primitives (except sort of on WDDM).

I know that they are not synchronization primitives–the question is whether they involve any synchronization or anything else that prevent concurrent kernel execution. I guess that it must be assumed that no reply from NVIDIA or a clear example means that they do not prevent concurrent kernel execution. Even better would be if NVIDIA or somebody else has an example demonstrating that they do not prevent concurrent kernel execution.

I know that they are not synchronization primitives–the question is whether they involve any synchronization or anything else that prevent concurrent kernel execution. I guess that it must be assumed that no reply from NVIDIA or a clear example means that they do not prevent concurrent kernel execution. Even better would be if NVIDIA or somebody else has an example demonstrating that they do not prevent concurrent kernel execution.

They have absolutely nothing to do with synchronization except on WDDM.

They have absolutely nothing to do with synchronization except on WDDM.

Well, it takes some thought and I would definitely want to see a simple test example to come to a definite conclusion, but I do not think that the topic’s “Not answered question” subtitle is fair any more.

I hope that people keep pursuing different approaches to using concurrent kernels. I have my own K.I.S.S. approach but would like to see other approaches succeed. In my experience, concurrent kernels can make development for the GPU much more general purpose–much more like how we use multiple CPU processor cores.

Well, it takes some thought and I would definitely want to see a simple test example to come to a definite conclusion, but I do not think that the topic’s “Not answered question” subtitle is fair any more.

I hope that people keep pursuing different approaches to using concurrent kernels. I have my own K.I.S.S. approach but would like to see other approaches succeed. In my experience, concurrent kernels can make development for the GPU much more general purpose–much more like how we use multiple CPU processor cores.