What happens to my CPU thread?

Hi, I was wondering if anybody could tell me or point me at some resource that has info on how CUDA blocks on GPU calls? I’m assuming though I haven’t measured yet that it takes somewhere in the region of thousands of cycles to transfer data and execute a call on the GPU and get the results back. Obviously because I’m using the GPU cycles are very precious to me and the only CUDA calls I’ve seen so far are synchronous. Are there async calls? Does the CPU go into an efficient sleep or does it do something silly like spin?

Kernel launches are asynchronous with respect to the host cpu, so you can be doing something on the host side while waiting for the GPU kernel to complete. The standard host side GPU memory operations are all synchronous. There are asynchronous versions of the copy functions which operator on page locked memory, but they cannot overlap with a running kernel. In the case any operation is synchronous (or forced to be synchronous), the host thread sits in a spinlock. There might be a mechanism for adjusting the spinlock polling frequency in the newest release of CUDA, but that might also be a figment of my imagination…