Multi threading have lock problem

I found that there was a lock between the multi-threads while launching the kernel to the GPU’s execution engine queue.

However, in the figure, while waiting for “max_pool_forward_nchw” and “generateWinogradTilesKernel” to obtain a lock, the checked kernel (sky blue kernel) first obtained the lock and completed the launch.

How is this possible?

Doesn’t the mutex queue follow the FIFO?

I don’t think details of kernel launch processing are specified anywhere.

Wouldn’t that be a bad idea if you wanted to enable out-of-order processing such as what you might have with CUDA streams?

I am using RTX Quadro 6000 and using multithreading to run Torchvision’s Densenet201, Resnet152, Alexnet, and vgg16 simultaneously on each thread.

In addition, that figure is at some point when my application is profiled through the Nsight system.

Only kernel launcher’s process is shown in the figure.

Regardless of kernel launch detail, only one kernel can be launched at a time per device.

In multithreading ,if one kernel is launching and another kernel requests launch, the kernel has to wait until it gets ‘mutex lock’.

Am I right?

What I’m curious about is the order of obtaining the ‘mutex lock’.

On the profiler, the longest-awaited kernel does not seem to be the first to get the ‘mutex lock’, but rather randomly gets the ‘mutex lock’ and launches the kernel.

Isn’t this the first-in-first-out?

That looks like “kernel launch detail” to me. I’m sure you can study and reverse-engineer some of it if you want. I’m not able to comment on it directly. As far as I know it is not specified.

If you observe that the longest waiting kernel somehow is not the next one to be processed, doesn’t that mean that FIFO is not a very good description of the process?

If I use MPS than multithreading, can I exclude these delays?