What the gaps on the nvvp pipeline mean? And how to shrink the gap size?

Just like the graph shows, there are gaps between the kernel running. I guess they are about the time consumed by CPU code, but my CPU code between the kernels call is not too complicate.

And I found that the gaps are nearly uniform. Is there a minimized time gap between adjacent kernels calling? How to shrink the gap?

is it possible your application is consuming CPU time in between kernel launches?

There is definitely some time used by CUDA Runtime API calls, but not enough to fully cover those gaps

Have you used a profiler such as VTUNE to verify that assumption?

There is not nearly enough detail provided here to come to any firm conclusions. With the speed of GPUs these days, it is not an uncommon occurrence for GPU-accelerated applications to become (partially) bottlenecked on serial host code [¹]. Possible remedies:

(1) Move more of the host code to the GPU
(2) Aggressively optimize the host code
(3) Use a faster host system

Note: If you are using a Windows system with the (default) WDDM driver, CUDA performance artifacts are to be expected and unfortunately, pretty much unavoidable. If possible, switch to the TCC driver, or run on a Linux system to avoid this issue.

[¹] A recent example:
Acun, B., D. J. Hardy, Laxmikant V. Kale, K. Li, J. C. Phillips, and J. E. Stone. “Scalable molecular dynamics with NAMD on the Summit system.” IBM Journal of Research and Development 62, no. 6 (2018): 4-1.

Your suggest is very useful. The bottleneck is CPU code and I/O indeed. The unified memory copies between CPU and GPU for many times, and the gaps are corresponding to these proceeding. After I changed the unified memory to device memory, the delay goes short.

However, the time consumed by data transferring is till. The data is only 4 bytes, namely the length of the array, has to be transferred between host and device. In host, it is used to allocate memory for next kernel launch. And in device, it is set during kernel is running. I want to reduce the times of calling memcpy, further more to eliminate. Could you give me some advice?

is that 4 byte buffer that holds the length of the array allocated in page locked memory? This would speed things up a bit…

Yes, it gets better a bit. I think the bandwidth is not bottleneck, but the system call is. So, it is hard to reduce the time consumed by a single memcpy calling. I have to figure out how to reduce the number of times of calling memcpy.

I tried to use asynchronized memcpy in a different stream to make I/O and kernel running overlapped, but on my GPU(GTX 1030), it doesn’t run as my expectation.

I think there is at least a reason, the kernel and memcpy have to join at the end of a circle. That is,

(KERNEL | I/O) || (KERNEL | I/O) || ...

So, when the kernel is running, memcpyAsync is ready to starts, but when memcpyAsync started the kernel went to its end. Is there some theoretical explanation? I just see that the nvvp graph shows it.

And my another guess is that the kernel update device memory which will be copied to host later, so when memcpyAsync tries to copy the device memory, it always find a lock util the kernel dead, that the memcpyAsync runs after kernel finished always too.