100% CPU usage when running CUDA code

__global__ void do_something(float* p, int n)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    float x = 1.0 / (i + 1);
    for(int j = 0; j < n*n; ++j)
        x = cos(x);
    if(i < n)
        p[i] = x;
}


int main()
{
    int num_threads = 1024;
    int n = 8 * num_threads;

    float* p;
    cudaMalloc(&p, n * sizeof(float));

    do_something<<<n/num_threads, num_threads>>>(p, n);

    cudaDeviceSynchronize();
}

When I run this, top and htop show 99-101% CPU usage:

$ time ./a.out
real    0m12.921s
user    0m10.819s
sys     0m1.996s

Is the CPU really busy when it’s waiting for the kernel to finish, or is this a measurement artifact? If it is, what is it doing (and is it important)?

Yes, it’s actually busy in a polling loop inside the driver function associated with cudaDeviceSynchronize(), waiting for the GPU to finish. In a single threaded scenario, it probably doesn’t matter (what else were you going to have that core do?) But in a multi-threaded scenario, you may prefer some other sort of control-relinquishing scheme, and in theory CUDA offers these, take a look at the documentation:

http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g18074e885b4d89f5a0fe1beab589e0c8

In practice, I haven’t observed much difference depending on flags settings. YMMV. It may be somewhat platform-dependent and CUDA version dependent, and host threading model dependent, as well.

(Ultimately, it may not be easy to differentiate if your host CPU core is busy spinning in a wait loop in the driver function, or if it’s busy because it was released to do the work of some other thread code you had queued up.)

Thanks – that was the part of the documentation I was looking for!

Unfortunately, I also see little difference after adding

cudaSetDeviceFlags(cudaDeviceScheduleYield);

in the beginning of main: Htop still shows 100% usage, although time shows improvement in “user time”:

$ time ./a.out

real    0m12.957s
user    0m7.216s
sys     0m5.642s