Is concurrent kernel execution now possible in CUDA graph?
After a graph is built manually or captured by streams, it should have “enough” information to determine which kernels can run at the same time.
I saw page 20 (graph execution semantics) in this slide, http://on-demand.gputechconf.com/gtc-kr/2018/pdf/HPC_Minseok_Lee_NVIDIA.pdf says “Branches in graph still execute concurrently even though graph is launched into a stream”.
I tested some code an a Kepler GPU (GTX 680), using the latest cuda-10.1. I did not see any concurrent kernel execution happened. In fact, a previous multi-stream application that has concurrent kernel executions even got serialized after adding those graph api (using stream capture). Seen from NVVP.
If i manually build a graph, can CUDA graph automatically produce a multi-stream implementation?
Thanks in advance.