Implicit synchronization in host API call: cudalaunch and memcpyAsync ?

I am currently trying to add cuda computations in a multithreaded application which uses mainly two types of computations, one on a large dataset (2.6MB) and one on a little (330KB).

There is basically one OS thread for large computation and one OS thread for the little ones, and each of them implement the following model: copy to device -> call kernel -> copy to host

When the “little” thread makes its API calls alone, it is “quite” fast: around 150µs for each copies and 60µs for kernel.
The first thing I don’t understand is why API calls are so slow, in the mean time on the GPU profiling window, the kernel executes in only 10 µs, and the copy is performed in only 30 µs.

But when both little and large threads are making their calls in the same time, although they target different streams, the little thread API calls duration is even worst (1000µs), just as they where waiting for the large thread API calls to finish (cudamemcpyAsync on the picture)

I was hoping for someone to tell me, if this situation is an occurence of the implicit synchronisation that is descibed here

And what are the workaround, because I already dimished the number of threads of my application to the limits, and that helped me to save a bit of time on other calls.

I recreated a very simple example that shows a strange situation were launching 2 workloads sequentially in a single OS thread and single GPU stream is a lot more faster than launching two Workloads in two CPU thread to two GPU streams:

the code:

Looks like both threads are busy-waiting for the kernel to finish. Could it be the CPU load?

I finally undestood what my problem was, in my first example, I did not notices that each time the first OS Thread cudalaunch api call was blocked, there was a memcpy from the other OS thread in the same time.

I assume that while the PCIe bus is transferring datas, it cannot let pass the launch command through it, that is why there was a “synchronisation”.

The second example I gave was in fact different, the double threaded version was slower certainly because of CPU load or too nice thread behaviour, as pruby said.

By launching my second example in root, that is what I got: