What could cause kernel execution to be serialized on two separate streams? I’m using CUDA 8.0 on a K40c with MPS running, and there are enough resources on the device to accommodate both kernels but I’m not observing any overlap. I create two streams in two different host threads, and launch one kernel in each of the streams without doing any work on the default stream. I tried compiling with and without “–default-stream per-thread” and both resulted in the same behavior. Running with and without profiling provide similar timings as well. The following is the timeline from nvvp (the purple kernel belongs to another program, I’m concerned with overlap of the instances of the green kernel):
The yellow rectangle in the Runtime API row of the host thread launching the second kernel shows the launch of the kernel. However, it’s not until the first kernel finishes that the second one starts execution (that tiny line in the bottom right of the image in stream 16-11152). Both launches are instances of the same kernel and are supposed to work on the same data, so when the first one finishes there’s no work left for the second one to do and it finishes very quickly. They grab blocks of work by atomically adding to the same global memory location.
Thanks in advance for your help.