CPU Spins while waiting for GPU to finish computation

I simulate a long-running GPU kernel by the code below:

#include <stdio.h>
#include <stdlib.h>

__global__ void saxpy(int n, float *x)
{
    clock_t start = clock64();
    clock_t now;
    for (;;) {
    now = clock64();
    clock_t cycles = now > start ? now - start : now + (0xffffffff - start);
    if (cycles >= 9000000000) {
        break;
    }
    }
    x[1] = now;
    x[10] = blockIdx.x * blockDim.x + threadIdx.x;
}

int main(void)
{
    int N = 1 << 10;
    float *x, *y, *d_x, *d_y;
    x = (float *)malloc(N * sizeof(float));
    cudaSetDeviceFlags(cudaDeviceScheduleYield);
    cudaMalloc(&d_x, N * sizeof(float));

    for (int i = 0; i < N; i++)
    {
        x[i] = 1.0f;
    }

    cudaMemcpy(d_x, x, N * sizeof(float), cudaMemcpyHostToDevice);
    cudaStream_t stream_1;
    cudaStreamCreate(&stream_1);
    saxpy<<<(N + 255) / 256, 256, 0, stream_1>>>(N, d_x);
    
    printf("Hello world\n");
    cudaMemcpy(x, d_x, N * sizeof(float), cudaMemcpyDeviceToHost);
    printf("Hi");
    
    cudaFree(d_x);
    
    free(x);
}

Here is the nsys timeline:

Here, we see that the CPU spins while waiting for saxpy kernel to complete and then performs the device to host data transfer.
Questions

  • Why does the CPU spin when it can do other things? Is there a reason behind this?
  • If there is a way to yield the host thread, how can one do it on a linux machine?
    Thanks

@jcohen

@hwilper Is there a resolution to the above issue?

I’m getting you an update now.

Hi Puneeth,

There are two things to note here… the fact that cudaMemcpy is blocking, and the way it is blocking.

The cudaMemcpy call blocks because it is the synchronous version of the memcpy API. The asynchronous version, cudaMemcpyAsync, will return immediately once the memcpy command has been enqueued to the GPU, which would allow that CPU thread to do other tasks while the GPU is working through the commands you’ve sent to it. Then you’d eventually call cudaDeviceSynchronize or cudaStreamSynchronize to make the CPU thread wait for all the GPU work to finish. Note that the <<< >>> syntax for kernel launches is asynchronous – that’s why the synchronous cudaMemcpy call right after the launch is blocking until both the saxpy kernel and the DtoH memcpy are completed. You can think of cudaMemcpy as shorthand for cudaMemcpyAsync followed by cudaDeviceSynchronize. You are correctly using the stream argument to the <<< >>> launch, so you just need to change your cudaMemcpy calls to cudaMemcpyAsync, and pass stream_1 to them as well. Fixing up your example, it would look like:

    cudaStream_t stream_1;
    cudaStreamCreate(&stream_1);

    cudaMemcpyAsync(d_x, x, N * sizeof(float), cudaMemcpyHostToDevice, stream_1);
    saxpy<<<(N + 255) / 256, 256, 0, stream_1>>>(N, d_x);
    cudaMemcpyAsync(x, d_x, N * sizeof(float), cudaMemcpyDeviceToHost, stream_1);
    
    printf("Now the CPU can do other things while the GPU is working...\n");
    printf("Then we wait to make sure the GPU is finished before using its output.\n");
    cudaStreamSynchronize(stream_1);
    printf("Now it's safe for the CPU to read data from buffer x\n");

As for why the CPU is spinning during cudaMemcpy while waiting for the kernel to complete, note that you’ve chosen cudaDeviceScheduleYield for the scheduling mode. There are 3 different modes here:

  • cudaDeviceScheduleSpin
  • cudaDeviceScheduleYield
  • cudaDeviceScheduleBlockingSync

There’s also an “auto” option that tries to cleverly select one of the above modes based on your situation. The best documentation of the AUTO mode I’ve found is here:
https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CTX.html#group__CUDA__CTX_1g65dc0012348bc84810e2103a40d8e2cf
…for the CU_CTX_SCHED_AUTO context create flag. But the three fundamental modes work as follows:

“Spin” means the driver does a busy-wait, e.g.:

while (!GpuWorkFinished()) {}

This means the thread will cause one CPU core to go to 100% utilization while waiting. The OS may move that thread from one core to another, but one core will stay busy until the loop exits. If your CPU has more cores that aren’t busy, this isn’t preventing the CPU from doing other work in the meantime. Even if the cores are oversubscribed (e.g. you have 10 active threads on an 8 core CPU), the OS will ensure the threads take turns on the cores and can all make forward progress.

“Yield” is similar to spin, except inside the loop, a call is made to the OS to suggest it schedule other threads instead, e.g.:

while (!GpuWorkFinished()) { pthread_yield(); }

This means you’re still busy-waiting, so this loop still will cause one CPU core to go to 100% utilization until the loop exits. The only difference is that in an oversubscribed situation (more active threads than CPU cores), the OS won’t spend full timeslices running the loop. Instead, it will immediately give priority to any other active thread. So this avoids taking away CPU core time from a 100% utilized system when other threads could be making progress. However there may be higher latency for the waiting thread to detect the work is finished and proceed, since the waiting thread isn’t checking as frequently – it has to be scheduled back in for each check.

“BlockingSync” means the thread will not busy-wait – it will use an OS kernel notification mechanism like sem_wait to suspend itself, and then have the OS wake it up when the GPU work has finished. This means the thread will not use a CPU core while waiting. In an oversubscribed-CPU scenario, this ensures the waiting thread is the least disruptive to the other threads. But it is also the slowest for the waiting thread to resume from.

In general, if you want CUDA syncs (the cudaDeviceSynchronize or cudaStreamSynchronize calls, or any other implicitly synchronous calls like cudaMemcpy) to have the lowest latency of waking up after the GPU work completes, you should use Spin mode. If you expect to be in a scenario where all the CPU cores are fully utilized doing work for your application, where wasting a core on busy-waiting for long-running kernels would hurt overall throughput, then I’d recommend BlockingSync. Yield tends to have the disadvantages of both Spin (uses up a CPU core) and BlockingSync (higher latency since thread has to be scheduled back onto a core), so it’s usually not the optimal choice.

Hope that helps!

Thanks, @jasoncohen, for the detailed answer. I am playing around with a pytorch application, and AFAIK they do not allow us to set the wait mode. Then again, as you said, using spin mode is preferable as the latency of waking up and doing further work is the lowest with spin mode, and since most kernels are short duration, the spin mode makes sense as the default mode of waiting… Thanks, @hwilper for the update.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.