Edit: it would appear that the effect I describe here is mostly a phantom caused by having the profiler enabled. See http://forums.nvidia.com/index.php?showtopic=190368&view=findpost&p=1175891 for a discussion
Background: In my app, I run a sequence of ~5 different kernel calls back to back in sequence - over and over again 100’s of millions of times. All of these operate with on-GPU data, and no large transfers are ever made host<->device. However, at one point in that sequence I have to read back a flag from the GPU. The value of that flag (0/1) determines whether the typical sequence continues without interruption, or an additional kernel call needs to be made first. In a recent week-long microptimizing session, I discovered that the biggest bottleneck wasn’t the length of the kernel calls themselves - rather the idle time in between them. After much fun optimizing, I’ve removed all of the idle time gaps except for one - the one where this flag is transferred back to the host.
There are two ways to transfer this flag: 1) cudaMemcpy it (which inserts an implicit sync), or 2) write it to host mapped memory, and insert an explicit cudaThreadSynchronize before reading it. In app, I find that the fastest method is (2), and that it causes a delay of about 56 microseconds. For small problem sizes, this can result in a massive 20% overhead. 56 is a lot higher than the typical value of 2-10 microsecond latency that is quoted often on the forums (and found by many other benchmarks).
Here, I’m posting a microbenchmark (linux only) that you can run and see for yourselves. It operates by emulating what my real app does, it calls kernelB, then kernelA a number (N) of times. It then either does a cudaMemcpy of a 4-byte value or calls cudaThreadSynchronize(). The kernels do nothing more than loop to waste a configurable amount of time. The GPU idle time gaps are measured by enabling the CUDA profiler and recording the gpustarttimestamp field. This is a high resolution clock on the GPU itself that records the start time of every single kernel launch. A simple python script combs through the resulting data file, and computes all of the idle time gaps.
Results (on GTX 480, x86_64 linux, CUDA 3.2, drivers 260.19.21): For kernelA called back to back a number of times, the idle time gaps are only ~2.3 microseconds. I get the same value in the full app.
When using cudaMemcpy, I get the following gap times (in microseconds) for various values of N 1 through 20.
$ ./bmark_run.sh 50 0 Running sequence of gpu idle time gap measurements with delay=50 and flag=0 1 25.504 2 25.536 3 25.728 4 25.344 5 25.856 6 25.536 7 25.76 8 25.536 9 25.6 10 25.504 11 25.984 12 25.632 13 25.568 14 25.632 15 25.6 16 25.632 17 25.728 18 25.504 19 25.536 20 25.632
A constant 25.5 microseconds - not bad.
When using cudaThreadSynchronize, there is an interesting behavior. As N is increased, the idle time gap increases as well.
$ ./bmark_run.sh 50 1 Running sequence of gpu idle time gap measurements with delay=50 and flag=1 1 18.752 2 21.792 3 24.512 4 26.784 5 29.44 6 32.064 7 34.176 8 36.544 9 40.352 10 43.136 11 44.832 12 47.84 13 50.752 14 53.184 15 54.976 16 57.344 17 59.136 18 61.504 19 65.824 20 67.392
It starts below 20 microseconds, nice! But then goes up to 67 when syncing after launching 20 kernels, ouch!
Not shown here are a bunch of other tests I ran using different delay factors to make the kernels take longer. The idle time gaps seemed to remain rock solid stable at the shown values, no matter how long (even 1ms+) each kernel execution lasted.
If you want to run the benchmark yourself, be my guest. I’m attaching the files necessary. Just compile the test program
nvcc -o test_cudathreadsync test_cudathreadsync.cu
then, run the bmark_run.sh script as I show above. You need to have python and numpy installed for the analysis to work.
Has anyone else looked deeply into the performance considerations when transferring small flag values device->host? What is the best method you have come up with? Any ideas why CUDA is so slow at this?
Surly, I don’t expect the gap to be 0 - but the results where cudaThreadSynchronize inserts an increasingly large gap are worrisome. It is really illustrative to load up computeprof and display the timeline to look at these results (sorry, too lazy to post a screen capture). The idle time gap shows up as a space between the end of the last call of kernelA and the start of kernelB. From the async kernel launches, the cudaThreadSynchronize() would have been called on the timeline shortly after the start of the first kernelA, and then the host would be spin waiting for the last one to finish… which makes it puzzling why the gap between that and the start of the kernelB gets longer and longer.
Anyways, enough musing, maybe someone here can point out something that I missed, or has already figured out the solution.
– P.S., if the name looks familiar, you aren’t mistaken. I had to create a new account because I couldn’t get into the old one and the associated e-mail address was deleted long ago :(
test_cudathreadsync.tar.gz (1.8 KB)