__global__ void do_something(float* p, int n)
int i = blockDim.x * blockIdx.x + threadIdx.x;
float x = 1.0 / (i + 1);
for(int j = 0; j < n*n; ++j)
x = cos(x);
if(i < n)
p[i] = x;
int num_threads = 1024;
int n = 8 * num_threads;
cudaMalloc(&p, n * sizeof(float));
do_something<<<n/num_threads, num_threads>>>(p, n);
When I run this, top and htop show 99-101% CPU usage:
$ time ./a.out
Is the CPU really busy when it’s waiting for the kernel to finish, or is this a measurement artifact? If it is, what is it doing (and is it important)?
Yes, it’s actually busy in a polling loop inside the driver function associated with
cudaDeviceSynchronize(), waiting for the GPU to finish. In a single threaded scenario, it probably doesn’t matter (what else were you going to have that core do?) But in a multi-threaded scenario, you may prefer some other sort of control-relinquishing scheme, and in theory CUDA offers these, take a look at the documentation:
In practice, I haven’t observed much difference depending on flags settings. YMMV. It may be somewhat platform-dependent and CUDA version dependent, and host threading model dependent, as well.
(Ultimately, it may not be easy to differentiate if your host CPU core is busy spinning in a wait loop in the driver function, or if it’s busy because it was released to do the work of some other thread code you had queued up.)
Thanks – that was the part of the documentation I was looking for!
Unfortunately, I also see little difference after adding
in the beginning of main: Htop still shows 100% usage, although time shows improvement in “user time”:
$ time ./a.out