I am using CUPTI and noticed that the overhead of cuCtxCreate is around 14 seconds to create a new context for CUPTI to be used in another thread.
I am wondering if I have a separate thread on CPU, do we need to create a new CUDA context? This context is used to activate event groups.
Could you kindly clarify on this and how can I reduce the latency of cuCtxCreate?
That’s a lot. Do you mean cuCtxCreate takes longer when application is profiled with CUPTI? Do you observe the same behavior outside CUPTI? What GPU and CUDA toolkit is used?
It’s not required to create a separate CUDA context for event profiling. CUDA context created in the application can be used. CUPTI sample callback_event can be referred to see how this can be done.
Note that all the events and metrics except NVLink metrics are collected at the context level. That means events configured for a context can profile kernels running in the same context, it can’t observe other contexts.
Before MPI ranks start to use GPU I have put a barrier and I create the context on a different thread. It’s longhorn system. NVIDIA-SMI 440.33.01 - Driver Version: 440.33.01 CUDA Version: 10.2
I want to get the NVLink metrics for user level (not total bytes sent). That’s why I am creating a new context. In CUDA 10 I noticed that I need to create a new context and cannot push and pop the context to reuse for profiling. Have anything changed? Should I use cuDevicePrimaryCtxRetain?