I found that there was a lock between the multi-threads while launching the kernel to the GPU’s execution engine queue.
However, in the figure, while waiting for “max_pool_forward_nchw” and “generateWinogradTilesKernel” to obtain a lock, the checked kernel (sky blue kernel) first obtained the lock and completed the launch.
I am using RTX Quadro 6000 and using multithreading to run Torchvision’s Densenet201, Resnet152, Alexnet, and vgg16 simultaneously on each thread.
In addition, that figure is at some point when my application is profiled through the Nsight system.
Only kernel launcher’s process is shown in the figure.
Regardless of kernel launch detail, only one kernel can be launched at a time per device.
In multithreading ,if one kernel is launching and another kernel requests launch, the kernel has to wait until it gets ‘mutex lock’.
Am I right?
What I’m curious about is the order of obtaining the ‘mutex lock’.
On the profiler, the longest-awaited kernel does not seem to be the first to get the ‘mutex lock’, but rather randomly gets the ‘mutex lock’ and launches the kernel.
That looks like “kernel launch detail” to me. I’m sure you can study and reverse-engineer some of it if you want. I’m not able to comment on it directly. As far as I know it is not specified.
If you observe that the longest waiting kernel somehow is not the next one to be processed, doesn’t that mean that FIFO is not a very good description of the process?