What the gaps on the nvvp pipeline mean? And how to shrink the gap size?

HolyChen · September 11, 2019, 3:07pm

Just like the graph shows, there are gaps between the kernel running. I guess they are about the time consumed by CPU code, but my CPU code between the kernels call is not too complicate.

And I found that the gaps are nearly uniform. Is there a minimized time gap between adjacent kernels calling? How to shrink the gap?

cbuchner1 · September 12, 2019, 2:25pm

is it possible your application is consuming CPU time in between kernel launches?

There is definitely some time used by CUDA Runtime API calls, but not enough to fully cover those gaps

njuffa · September 12, 2019, 5:50pm

Have you used a profiler such as VTUNE to verify that assumption?

There is not nearly enough detail provided here to come to any firm conclusions. With the speed of GPUs these days, it is not an uncommon occurrence for GPU-accelerated applications to become (partially) bottlenecked on serial host code [¹]. Possible remedies:

(1) Move more of the host code to the GPU
(2) Aggressively optimize the host code
(3) Use a faster host system

Note: If you are using a Windows system with the (default) WDDM driver, CUDA performance artifacts are to be expected and unfortunately, pretty much unavoidable. If possible, switch to the TCC driver, or run on a Linux system to avoid this issue.

[¹] A recent example:
Acun, B., D. J. Hardy, Laxmikant V. Kale, K. Li, J. C. Phillips, and J. E. Stone. “Scalable molecular dynamics with NAMD on the Summit system.” IBM Journal of Research and Development 62, no. 6 (2018): 4-1.

HolyChen · September 14, 2019, 5:12pm

my CPU code between the kernels call is not too complicate.

Have you used a profiler such as VTUNE to verify that assumption?

There is not nearly enough detail provided here to come to any firm conclusions. With the speed of GPUs these days, it is not an uncommon occurrence for GPU-accelerated applications to become (partially) bottlenecked on serial host code [¹]. Possible remedies:

(1) Move more of the host code to the GPU
(2) Aggressively optimize the host code
(3) Use a faster host system

Note: If you are using a Windows system with the (default) WDDM driver, CUDA performance artifacts are to be expected and unfortunately, pretty much unavoidable. If possible, switch to the TCC driver, or run on a Linux system to avoid this issue.

[¹] A recent example:
Acun, B., D. J. Hardy, Laxmikant V. Kale, K. Li, J. C. Phillips, and J. E. Stone. “Scalable molecular dynamics with NAMD on the Summit system.” IBM Journal of Research and Development 62, no. 6 (2018): 4-1.

The overall NAMD performance is heavily bounded by CPU performance while the GPU utilization is below 15%. From profiling, we can see that the CPU time between two consecutive non-bonded force GPU kernels are significantly longer than the execution of the force kernel itself. This excess CPU activity is associated with what was originally about 1% of the overall work, responsible for the integration of atomic coordinates, rigid bond constraints, thermostat, and barostat controls. We are in the process of overcoming this CPU calculation bottleneck by offloading these additional work phases to the GPUs.

Your suggest is very useful. The bottleneck is CPU code and I/O indeed. The unified memory copies between CPU and GPU for many times, and the gaps are corresponding to these proceeding. After I changed the unified memory to device memory, the delay goes short.

However, the time consumed by data transferring is till. The data is only 4 bytes, namely the length of the array, has to be transferred between host and device. In host, it is used to allocate memory for next kernel launch. And in device, it is set during kernel is running. I want to reduce the times of calling memcpy, further more to eliminate. Could you give me some advice?

cbuchner1 · September 14, 2019, 5:41pm

is that 4 byte buffer that holds the length of the array allocated in page locked memory? This would speed things up a bit…

HolyChen · September 14, 2019, 5:59pm

Yes, it gets better a bit. I think the bandwidth is not bottleneck, but the system call is. So, it is hard to reduce the time consumed by a single memcpy calling. I have to figure out how to reduce the number of times of calling memcpy.

HolyChen · September 15, 2019, 10:19am

I tried to use asynchronized memcpy in a different stream to make I/O and kernel running overlapped, but on my GPU(GTX 1030), it doesn’t run as my expectation.

I think there is at least a reason, the kernel and memcpy have to join at the end of a circle. That is,

(KERNEL | I/O) || (KERNEL | I/O) || ...

So, when the kernel is running, memcpyAsync is ready to starts, but when memcpyAsync started the kernel went to its end. Is there some theoretical explanation? I just see that the nvvp graph shows it.

And my another guess is that the kernel update device memory which will be copied to host later, so when memcpyAsync tries to copy the device memory, it always find a lock util the kernel dead, that the memcpyAsync runs after kernel finished always too.

Topic		Replies	Views
cudaMemcpy() Best approach when you need to call it many times? CUDA Programming and Performance	8	25029	March 8, 2010
How to effectively parallelize cuda kernel launches on CPU CUDA Programming and Performance	9	3013	January 19, 2018
Why does my kernel take too long occasionally? CUDA Programming and Performance	21	8754	October 13, 2010
Overlapping kernel computing with stream per (CPU) thread, slow kernel launches CUDA Programming and Performance	10	3648	October 21, 2017
low concurrency and low kernel utilization, but kernels are filled. CUDA Programming and Performance	6	1406	November 18, 2018
nvprof and difference in time reported CUDA Programming and Performance	4	1081	September 16, 2017
Slow memory transfers CUDA Programming and Performance	7	1986	May 23, 2011
Why kernel executions in different streams are not parallel? CUDA Programming and Performance	4	2443	April 29, 2019
Questions regarding allocation of buffers/memory CUDA Programming and Performance	11	915	April 20, 2017
Gap between kernel execution CUDA Programming and Performance	7	1173	March 9, 2017

What the gaps on the nvvp pipeline mean? And how to shrink the gap size?

Related topics