massive hiccups when transferring flags back to host challenging the commonly quoted 2-10us latency

Edit: it would appear that the effect I describe here is mostly a phantom caused by having the profiler enabled. See for a discussion

Background: In my app, I run a sequence of ~5 different kernel calls back to back in sequence - over and over again 100’s of millions of times. All of these operate with on-GPU data, and no large transfers are ever made host<->device. However, at one point in that sequence I have to read back a flag from the GPU. The value of that flag (0/1) determines whether the typical sequence continues without interruption, or an additional kernel call needs to be made first. In a recent week-long microptimizing session, I discovered that the biggest bottleneck wasn’t the length of the kernel calls themselves - rather the idle time in between them. After much fun optimizing, I’ve removed all of the idle time gaps except for one - the one where this flag is transferred back to the host.

There are two ways to transfer this flag: 1) cudaMemcpy it (which inserts an implicit sync), or 2) write it to host mapped memory, and insert an explicit cudaThreadSynchronize before reading it. In app, I find that the fastest method is (2), and that it causes a delay of about 56 microseconds. For small problem sizes, this can result in a massive 20% overhead. 56 is a lot higher than the typical value of 2-10 microsecond latency that is quoted often on the forums (and found by many other benchmarks).

Here, I’m posting a microbenchmark (linux only) that you can run and see for yourselves. It operates by emulating what my real app does, it calls kernelB, then kernelA a number (N) of times. It then either does a cudaMemcpy of a 4-byte value or calls cudaThreadSynchronize(). The kernels do nothing more than loop to waste a configurable amount of time. The GPU idle time gaps are measured by enabling the CUDA profiler and recording the gpustarttimestamp field. This is a high resolution clock on the GPU itself that records the start time of every single kernel launch. A simple python script combs through the resulting data file, and computes all of the idle time gaps.

Results (on GTX 480, x86_64 linux, CUDA 3.2, drivers 260.19.21): For kernelA called back to back a number of times, the idle time gaps are only ~2.3 microseconds. I get the same value in the full app.

When using cudaMemcpy, I get the following gap times (in microseconds) for various values of N 1 through 20.

$ ./ 50 0

Running sequence of gpu idle time gap measurements with delay=50 and flag=0

1 25.504

2 25.536

3 25.728

4 25.344

5 25.856

6 25.536

7 25.76

8 25.536

9 25.6

10 25.504

11 25.984

12 25.632

13 25.568

14 25.632

15 25.6

16 25.632

17 25.728

18 25.504

19 25.536

20 25.632

A constant 25.5 microseconds - not bad.

When using cudaThreadSynchronize, there is an interesting behavior. As N is increased, the idle time gap increases as well.

$ ./ 50 1

Running sequence of gpu idle time gap measurements with delay=50 and flag=1

1 18.752

2 21.792

3 24.512

4 26.784

5 29.44

6 32.064

7 34.176

8 36.544

9 40.352

10 43.136

11 44.832

12 47.84

13 50.752

14 53.184

15 54.976

16 57.344

17 59.136

18 61.504

19 65.824

20 67.392

It starts below 20 microseconds, nice! But then goes up to 67 when syncing after launching 20 kernels, ouch!

Not shown here are a bunch of other tests I ran using different delay factors to make the kernels take longer. The idle time gaps seemed to remain rock solid stable at the shown values, no matter how long (even 1ms+) each kernel execution lasted.

If you want to run the benchmark yourself, be my guest. I’m attaching the files necessary. Just compile the test program

nvcc -o test_cudathreadsync

then, run the script as I show above. You need to have python and numpy installed for the analysis to work.


Has anyone else looked deeply into the performance considerations when transferring small flag values device->host? What is the best method you have come up with? Any ideas why CUDA is so slow at this?

Surly, I don’t expect the gap to be 0 - but the results where cudaThreadSynchronize inserts an increasingly large gap are worrisome. It is really illustrative to load up computeprof and display the timeline to look at these results (sorry, too lazy to post a screen capture). The idle time gap shows up as a space between the end of the last call of kernelA and the start of kernelB. From the async kernel launches, the cudaThreadSynchronize() would have been called on the timeline shortly after the start of the first kernelA, and then the host would be spin waiting for the last one to finish… which makes it puzzling why the gap between that and the start of the kernelB gets longer and longer.

Anyways, enough musing, maybe someone here can point out something that I missed, or has already figured out the solution.

– P.S., if the name looks familiar, you aren’t mistaken. I had to create a new account because I couldn’t get into the old one and the associated e-mail address was deleted long ago :(
test_cudathreadsync.tar.gz (1.8 KB)

I’ll try to find some time to take a look at this.

Hey, Dr. 42!

As a quick idea that may give you an immediate boost…

Remove the CPU flag logic entirely. Make the decision on the GPU.
Yes, I know the GPU can’t schedule itself but you just hack it.

Modify the “optional” kernel, the one that should be run if the flag is set, and skipped if it’s not.
Always call it. But at the start of every block, just test the flag (in GPU memory) and exit immediately if it’s false. So if the kernel shouldn’t be run, it will quickly exit.

This seems wasteful and expensive (you’re wasting a kernel launch which makes itself a no-op!) but by removing the CPU latency, that kernel no-op may be much much faster than the CPU memcopy/launch latency.

A typical RTOS might have a response time of 25 usec, your real-time might be very elastic. So to me this seem like a strange exercise unless you are running a RTOS?

Great tip.

I might try it in my app as well.

Thanks, Tim.

In many applications where this type of decision logic appears, your idea will probably work quite well.

In this particular case, however - I’m kind of stuck. The 2 optional kernels that need to be called both must also transfer flags back to the host. Depending on the state of the particles in memory, the associated data structures may need to be resized and there is no way to know a priori what minimum size they need without running the kernels first :(. Its kind of annoying. The only alternative is to only check for these overflow flags every once in a while and force the user to choose the correct size up front - that brings back memories of my days running simulations with a FORTRAN app that hardcoded the max size in an #include file. Overflow errors loved to show up 12 hours into a long simulation, forcing one to recompile the program and re-run! I guess it could be implemented as an optional mode that users could enable if they really wanted that last ounce of performance. I’ll give it a try and see how it works out.

Nah, I’m running something much more demanding than an RTOS: Molecular dynamics simulations in an HPC environment. In a fairly typical setup, one may run simulations of a mere 10,000 particles requiring 250 microseconds per step of processing time. The 25 microsecond delay is thus slowing performance by 10%. Another typical simulation with 60,000 particles would require ~1 millisecond per step where the 25 microsecond delay results in a performance degradation of only 3%.

OK, 10% isn’t exactly huge, but I did say that I was micro-optimizing every last ounce of performance out of my code :) And what happens when Maxwell comes out - if this latency remains the same and the processing time is cut in half, then the overhead becomes 20%, then 40% or more for Kepler!

That and there people a lot more obsessed with HPC than I am demand a LOT more - they look at their host-to-host transfers across Infiniband with latencies of ~1 microsecond, which involve 2 machines OS’s, infiniband cards on PCIe, cabling, and an infiniband switch, and they wonder why CUDA’s latencies are so high.

Mr to Dr? Congratulations! btw, I had PMed you on your old ID. Can you please reply? Thanks!

Did you try “__threadfence_system()” to flush out the memory-mapped thing to the host RAM? May be, that closes the gap a bit. Just a wild thought.

Hi, I understand the type of problem you are working with. However, i was not comparing molecular dynamics simulation workloads with that of running a real-time operating system.

My notion was rather that you can’t expect that level of precision without making sure you are running a real-time operating system. Typically a real-time response can be maybe 25 us. Thus your 10% awould be basically give or take dependant on your OS and not CUDA. I think this image is a bit descriptive: of real-time versus an elastic real-time.

Perhaps there is a real-time expert who could comment?

RedHawk (, a linux RTOS that has Cuda support, has a process dispatch latency of 1 usec.

Also, try the different scheduling flags to cudaSetDeviceFlags()

Ahh, I see what you mean now. Sorry for the misunderstanding.

At least in my simple view of the OS, when CUDA is spin-locking to wait on the GPU - why should the real-time capability of the OS matter at all? Sure, the occasional hiccup may occur from an OS process switch, but shouldn’t a while(!done) loop that lasts for ~200us be able to continue without any process switches?

Interesting, but not free which makes it harder to test out. Do you think that RT kernel patches would also work?

Good idea, it should default to spin, but one should never trust the default:

Timings for cudaMemcpy 4 bytes on the same GTX 480:

Spin: 25 us

Yield: 26 us

Blocking: 90 us

On a Tesla S1070 cluster node:

Spin: 71 us (ouch!)

Yield: 72 us

Blocking: 226 us

And again for the cudaThreadSynchronize:

GTX 480:

Spin: 19-68 us

Yield: 19-68 us

Blocking: 53-177 us

Tesla S1070:

Spin: 40-131 us

Yield: 40-124 us

Blocking: 128-238 us

… hmmm … yield = spin in these benchmarks. Maybe the spin sync method is interrupting the process back to the OS.

Not possible, sorry.

A roll-your-own sync? That is an interesting idea. And it nicely avoids Tim’s rule to not communicate between the host and a running kernel - I just want to get a flag set when the kernel is done.

Here is a really simple way:

__global__ void set_flag(unsigned int *d_out)


    *d_out = 1;




unsigned int *h_flag;

    cudaHostAlloc(&h_flag, sizeof(unsigned int), cudaHostAllocMapped);

    unsigned int *d_flag;

    cudaHostGetDevicePointer(&d_flag, h_flag, 0);


*h_flag = 0;


        while(! (*h_flag))



It performs quite nicely :) - I get only a 10 us delay factor on the GTX 480, add in the 2us delay before set_flag and the 2 us call in the GPU and it is only 14 additional us - better than the 25 I get with memcpy or the 19-68 I get with cudaThreadSynchronize().

Alternately, one could try events. Ohh! - I ran into a nice tidbit in the docs for cudaEventCreate():

In code:

cudaEvent_t evt;

    cudaEventCreate(&evt, cudaEventDisableTiming);




I get 14us delay on the GTX 480 - the same as above. Not bad! cudaThreadSynchronize must employ significant extra overhead - Tim can tell us for sure, but maybe it’s delay increases with time because it has to go back and check error codes for all the previous launches? That seems to be the most serious discrepancy that has come out of these discussions.

Now, does the performance of cudaEventSynchronize() in the microbenchmark translate to the full app? We’ll see…

I think there should be no difference between cudaThreadSynchronize and cudaEventSynchronize, even with an event that uses the cudaEventDisableTiming flag. I’ll have to ask the guys who know all about events and tracking work to be sure.

Also, don’t roll your own synchronization mechanisms. Bad things lie down that path.

I think you are right. If I change the benchmark to just measure the period per step with a high res system timer (gettimeofday) and disable the profiler, I get no difference between cudaThreadSynchronize and cudaEventSynchronize. A 4-byte cudaMemcpy adds about 5 us/step more. The weird effect where cudaThreadSynchronize took longer depending on the number of kernels called before it issue also goes away.

It would seem that I put too much faith in the profiler… After all, it has to transfer the timing results back to the host at some time to ship them off to the disk. The only thing that makes sense to me at this point is that those transfers occur during a cudaThreadSynchronize call (thus explaining the increasing time).

In summary: it would seem that on more careful inspection, the large gap is a phantom. As close as I can tell using gettimeofday (i.e. - a real world situation), each cudaThreadSynchronize call adds about 3-5 microseconds of gap, and a 4-byte cudaMemcpy adds 10-12. These results are +/- 5 microseconds. I think that settles the debate and reconfirms that host mapped memory combined with cudaThread/EventSynchronize is the fastest way to get these flags back to the host by a slight margin over cudaMemcpy.

Profiler has to insert a lot of additional timestamps into the GPU’s work queue in order to track when things begin/end, so that might be a lot of the phantom time (when you’re on these timescales, a single PCIe transaction starts to matter).

In the research poster above, presented @ NVIDIA GTC 2010, the authors implement a synchronization mechanism based on 0-copy and polling - which is same as the lowest-latency code produced by Dr.Anderson.

The poster is co-authored by an NVIDIA employee.

Is this type of synchronization bad? Can we use it in our projects?

Best Regards,