Concurrent kernel execution using cuda graph

Hi,

Is concurrent kernel execution now possible in CUDA graph?

After a graph is built manually or captured by streams, it should have “enough” information to determine which kernels can run at the same time.
I saw page 20 (graph execution semantics) in this slide, http://on-demand.gputechconf.com/gtc-kr/2018/pdf/HPC_Minseok_Lee_NVIDIA.pdf says “Branches in graph still execute concurrently even though graph is launched into a stream”.

I tested some code an a Kepler GPU (GTX 680), using the latest cuda-10.1. I did not see any concurrent kernel execution happened. In fact, a previous multi-stream application that has concurrent kernel executions even got serialized after adding those graph api (using stream capture). Seen from NVVP.

If i manually build a graph, can CUDA graph automatically produce a multi-stream implementation?

Thanks in advance.

Yes, concurrent kernel execution is possible using graphs. It has been possible since the very first introduction of graphs. I’m unable to say why you didn’t witness it, for code you have not shown.

Witnessing kernel concurrency can be difficult in any setting. Kernel design itself can preclude concurrent execution.