How to detect async event without polling

Isn’t this behavior controlled by the cudaSetDeviceFlags() function? It looks like you have a choice between spin, yield (presumably inside polling loop), and blocking on a CPU synchronization primitive. I haven’t seen anyone benchmark the latency on these, so I’m not sure how the last two compare.

Excellent catch! And I had even just re-read the docs for the synchronize() functions to make sure I wasn’t missing anything before posting. I’m surprised those options aren’t simply given right in the synchronize() call itself.

You also ask a good question about how the extra options behave… it’d be interesting to measure the latency and CPU use of all three options.

For ktashiro’s problem, he has multiple streams to wait for, so the manual polling method is likely still appropriate.

Excellent catch! And I had even just re-read the docs for the synchronize() functions to make sure I wasn’t missing anything before posting. I’m surprised those options aren’t simply given right in the synchronize() call itself.

You also ask a good question about how the extra options behave… it’d be interesting to measure the latency and CPU use of all three options.

For ktashiro’s problem, he has multiple streams to wait for, so the manual polling method is likely still appropriate.

In my testing, yield is about equal to spin. To really notice a difference, you would have to have a number of other host threads competing for CPU cores for the CUDA thread to yield to. The blocking sync adds latency. How much? I didn’t benchmark that specifically, but HOOMD is about 5-7% slower with the blocking sync instead of spin.

In my testing, yield is about equal to spin. To really notice a difference, you would have to have a number of other host threads competing for CPU cores for the CUDA thread to yield to. The blocking sync adds latency. How much? I didn’t benchmark that specifically, but HOOMD is about 5-7% slower with the blocking sync instead of spin.

You’re right, it really depends on your app and kernel.

In my case, I have kernels that take 10-200 seconds, and run in loops for hours or even days. The CPU has nothing to do. So the manual polling method doesn’t change runtime at all, but has the advantage of reducing my CPU from 600% down to 0% for all those hours, saving power and heat and keeping the PC more responsive.

In ktashiro’s case, he may need as much CPU free as possible (for parallel computes) plus the flexibility of waiting for any number of streams, so manual polling should work best.

It does sound like in my case, I could replace my 4-5 lines of code with the DeviceFlags “block without polling” setting and let the CUDA runtime handle the waiting, though.

I certainly don’t care about even multiple millisecond latencies since my kernels are so long anyway.

You’re right, it really depends on your app and kernel.

In my case, I have kernels that take 10-200 seconds, and run in loops for hours or even days. The CPU has nothing to do. So the manual polling method doesn’t change runtime at all, but has the advantage of reducing my CPU from 600% down to 0% for all those hours, saving power and heat and keeping the PC more responsive.

In ktashiro’s case, he may need as much CPU free as possible (for parallel computes) plus the flexibility of waiting for any number of streams, so manual polling should work best.

It does sound like in my case, I could replace my 4-5 lines of code with the DeviceFlags “block without polling” setting and let the CUDA runtime handle the waiting, though.

I certainly don’t care about even multiple millisecond latencies since my kernels are so long anyway.

IIRC it’s about 40-50us for a blocking sync (could vary based on the vagaries of the OS thread scheduler).

IIRC it’s about 40-50us for a blocking sync (could vary based on the vagaries of the OS thread scheduler).