letting the host thread sleep in 2.2?

Can anyone tell me how to access this new functionality?

I thought maybe cudaThreadSynchronize() would just automatically do this now, but I tried it and my host CPU usage is still pinned at 100%, even when it’s spending most of its time just waiting for the GPU.

Read all the 2.2 new features threads, looked in the new manuals… can’t find it!


It’s in the reference manual, look for the function “cudaSetDeviceFlags”:

cudaError_t cudaSetDeviceFlags ( int flags )

Records flags as the flags to use when the active host thread executes device code. If the host thread has already initialized the CUDA runtime by calling non-device management runtime functions, this call returns cudaErrorSetOnActiveProcess.

The two LSBs of the flags parameter can be used to control how the CPU thread interacts with the OS scheduler when waiting for results from the device.

cudaDeviceScheduleAuto: The default value if the flags parameter is zero, uses a heuristic based on the number of active CUDA contexts in the process C and the number of logical processors in the system P. If C > P, then CUDA will yield to other OS threads when waiting for the device, otherwise CUDA will not yield while waiting for results and actively spin on the processor.

cudaDeviceScheduleSpin: Instruct CUDA to actively spin when waiting for results from the device. This can decrease latency when waiting for the device, but may lower the performance of CPU threads if they are performing work in parallel with the CUDA thread.

cudaDeviceScheduleYield: Instruct CUDA to yield its thread when waiting for results from the device. This can increase latency when waiting for the device, but can increase the performance of CPU threads performing work in parallel with the device.

cudaDeviceBlockingSync: Instruct CUDA to block the CPU thread on a synchronization primitive when waiting for the device to finish work.

cudaDeviceMapHost: This flag must be set in order to allocate pinned host memory that is accessible to the device. If this flag is not set, cudaHostGetDevicePointer() will always return a failure code.

flags - Parameters for device operation

Ah, so it is… I searched for “sleep”… should’ve tried “yield” too. Thanks!

cudaSetDeviceFlags() influences how a context waits when you call cudaThreadSynchronize(). For finer control of blocking sync’s, CUDA 2.2 also supports events that block instead of polling when you call cudaEventSynchronize().

To create such an event, you have to call cudaEventCreateWithFlags with the cudaEventBlockingSync flag.

Events that can do blocking sync may not be used for timing. (cudaEventRecord will return an error.)

Do you mind to please provide an example of cudaEventCreateWithFlags() with cudaEventBlockingSync and cudaEventSynchronize()?

I am confused about the following:

  1. What is the official way of detecting kernel failure in this usage pattern? Will cudaEventSynchronize() return an error, if my kernel fails for some reason?

  2. In the above quote nwilt writes “cudaEventRecord will return an error [presumably if called on such an event]”. However, CUDA_Reference_Manual_2.2.pdf, section 3.6, page 19 reads “[cudaEventSynchronize] Blocks until the event has actually been recorded. If cudaEventRecord() has not been called on this event, the function returns cudaErrorInvalidValue”. Does this contradiction imply, that cudaEventRecord() is called on this event automatically, once the kernel has completed?

  3. Do I understand it correctly, that if cudaEventBlockingSync flag is used, then the CPU thread, that is waiting for the event will actually yield, rather than spin: i.e. the CPU usage by that CPU thread will be very low while the kernel, that that CPU thread is waiting on is running on the GPU?

If my above assumptions are correct, then the use of cudaEventSynchronize(), as suggested by nwilt might become fairly standard for situations, in which the GPU kernel requires relatively much calculation, which is probably a very common scenario of CUDA. If so, then this usage scenario requires a clear and detailed example.

bah, double post!

A little clarification:

The typical default is to spin-wait: = 100% CPU usage

to “yield” is to spin-wait with a yield inside the wait = 100% CPU usage, but other running threads have a higher chance of running

to “block” is to wait for an interrupt (or something) = 0 % CPU usage.

If you do cudaSetDeviceFlags(cudaDeviceBlockingSync) before initializing your context, then any of the implicit synchronizations or calls to cudaThreadSynchronize will “block”. I tried this in my app and CPU usage reduces to 10% while it runs :) Of course, performance also decreases by about 10% due to the higher latencies in detecting when kernels complete.

I have always had the best luck with:

if (gpu_error_checking)



	cudaError_t error = cudaGetLastError();

	// handle the error value


The documentation for cudaThreadSynchronize states that it will return an error code if the kernel had an error, but in my experience this is not always the case (especially in older versions of CUDA). The behavior of either type of error checking will not change no matter what device mode you use. The device mode is just changing the way the driver waits for kernels to finish.

I don’t use events, so I can’t comment on the event API in particular, but I assume it works in a similar way.

Mr Anderson,

Thank you for the clarification of the yield term. I was indeed understanding it incorrectly. Apparently, I’m more interested in executing a blocking, than a yielding call.

I understand the cudaSetDeviceFlags(cudaDeviceBlockingSync) usage pattern, and I can probably live with it. However, I’d also like to understand how the blocking events work (if they can be used for arranging the blocking wait), because those would allow to encapsulate the logic better: just create/destroy the event around the kernel call and don’t mess with the more global device flags.

Thanks again!

Events let you synchronize with a certain point in the stream.

cudaEventCreateWithFlags(&my_event, cudaEventBlockingSync)



cudaEventRecord(my_event, 0)




After the call to cudaEventSynchronize() (which will block to wait), then at least kernel1 and kernel 2 on the GPU will have completed. kernel3 and 4 could still be running.

If you use streams, events can be recorded in a given stream, too.