cudaStreamSynchronize is much slower than polling on a flag for kernel completion

nkwkelvin · January 14, 2023, 11:12pm

I am working on a project where I will need to run a bunch of small kernels concurrently. I realize that cudaStreamSynchronize has become the bottleneck. So, I tried to avoid cudaStreamSynchronize and use a flag-based approach: the kernel will set a flag in pinned memory as it completes, and the CPU code will busy poll on the flag until the flag is set. Below is an example code that compares the performance between cudaStreamSynchronize and the flag-based approach:

#include <cuda_runtime.h>

#include <cstdio>
#include <cstdlib>
#include <chrono>
#include <thread>
#include <vector>

__global__ void dummy_cuda_sync() {
}

__global__ void dummy_flag(volatile int* flag) {
    *flag = 1;
}

void run_cuda_sync(int num_iter, cudaStream_t stream) {
    for (int i = 0; i < num_iter; ++i) {
        dummy_cuda_sync<<<1, 1, 0, stream>>>();
        cudaStreamSynchronize(stream);
    }
}

void run_flag(int num_iter, cudaStream_t stream, volatile int* flag) {
    for (int i = 0; i < num_iter; ++i) {
        *flag = 0;

        dummy_flag<<<1, 1, 0, stream>>>(flag);

        while (*flag == 0);
    }
}

int main(int argc, char** argv) {
    int num_iter = atoi(argv[1]);
    int num_thrs = atoi(argv[2]);

    std::vector<cudaStream_t> streams;
    streams.resize(num_thrs);
    for (int i = 0; i < num_thrs; ++i) {
        cudaStreamCreate(&streams[i]);
    }

    volatile int* flags;
    cudaMallocHost(&flags, sizeof(int) * num_thrs);

    double time_cuda_sync;
    double time_flag;

    {
    std::vector<std::thread> thrs;
    auto start_time = std::chrono::steady_clock::now();
    for (int i = 0; i < num_thrs; ++i) {
        thrs.emplace_back(run_cuda_sync, num_iter, streams[i]);
    }
    for (auto& thr : thrs) {
        thr.join();
    }
    auto end_time = std::chrono::steady_clock::now();
    time_cuda_sync = std::chrono::duration<double, std::micro>(end_time - start_time).count();
    }

    {
    std::vector<std::thread> thrs;
    auto start_time = std::chrono::steady_clock::now();
    for (int i = 0; i < num_thrs; ++i) {
        thrs.emplace_back(run_flag, num_iter, streams[i], flags + i);
    }
    for (auto& thr : thrs) {
        thr.join();
    }
    auto end_time = std::chrono::steady_clock::now();
    time_flag = std::chrono::duration<double, std::micro>(end_time - start_time).count();
    }

    printf("%f,%f\n", time_cuda_sync, time_flag);
}

Here are the results:

I have two questions regarding the results:

Why are the time linear to the number of threads? My kernel is so small that even when I run 20 at a time, the GPU should be able to run them concurrently and so the time it takes should not increase.
Why is cudaStreamSynchronize so much slower than the flag-based apporach?

Thank you.

nkwkelvin · January 23, 2023, 9:27pm

I tried adding __threadfence_system(); before *flag = 1; in dummy_flag(). Here are the results:

So even with __threadfence_system(), flag-based approach is still much faster than cudaStreamSynchronize() especially when there is a high concurrency.

Robert_Crovella · January 23, 2023, 9:58pm

It’s not the best design for CUDA work.

The CUDA runtime has known locking behavior when multiple threads use the CUDA runtime simultaneously. This could impact both kernel launches (which use the runtime under the hood) as well as explicit calls such as cudaDeviceSynchronize().

It’s possible they are not measuring the same thing. A kernel which has posted something in pinned memory is still running and has not released its resources back to the block scheduler/CWD. A kernel launched into a stream that has definitely reach a cuda stream sync point has definitely done those things.

More generally, back to the original idea, kernels whose duration is long compared to the launch overhead are going to make more efficient use of the GPU than kernels that are short relative to the launch overhead. So you are definitely operating in a region here that is known to be non-optimal.

Also, launch overhead, launch efficiency, and even multi-stream behavior are things that get worked on by the CUDA dev team from time to time. Therefore, if by chance you happen to be working on an old CUDA version (9.x would be “old” for example) you may discover better behavior with the latest CUDA releases.

Finally, you can affect the CUDA behavior at a sync point. You might wish to experiment with spin vs. yield.

nkwkelvin · January 26, 2023, 8:31pm

Thank you for your reply. In an effort to understand how the lock of the CUDA runtime affects the performance of cudaStreamSynchronize, I tried to use cudaLaunchHostFunc to register a callback after each kernel, so that there will be only one CPU thread monitoring all streams.

#include <cuda_runtime.h>

#include <cstdio>
#include <cstdlib>
#include <chrono>
#include <mutex>
#include <thread>
#include <vector>
#include <queue>
#include <atomic>

std::vector<cudaStream_t> streams;
std::queue<unsigned> noti_queue;
std::atomic_uint noti_queue_num;
std::mutex mtx;

__global__ void dummy() {
}

void callback(void* stream_id_) {
    unsigned stream_id = (unsigned long)stream_id_;
    std::unique_lock<std::mutex> lock(mtx);
    noti_queue.push(stream_id);
    lock.unlock();
    noti_queue_num.fetch_add(1, std::memory_order_release);
}

int main(int argc, char** argv) {
    unsigned num_iter = atoi(argv[1]);
    unsigned num_streams = atoi(argv[2]);

    streams.resize(num_streams);
    for (unsigned i = 0; i < num_streams; ++i) {
        cudaStreamCreate(&streams[i]);
    }

    std::vector<unsigned> streams_finished;
    streams_finished.resize(num_streams);

    unsigned total_finished = 0;
    const unsigned total_num = num_iter * num_streams;

    noti_queue_num.store(0);

    auto start_time = std::chrono::steady_clock::now();

    for (unsigned i = 0; i < num_streams; ++i) {
        dummy<<<1, 1, 0, streams[i]>>>();
        cudaLaunchHostFunc(streams[i], callback, (void*)i);
    }

    while (total_finished < total_num) {
        while (noti_queue_num.load(std::memory_order_acquire) == 0);
        noti_queue_num.fetch_sub(1, std::memory_order_release);

        std::unique_lock<std::mutex> lock(mtx);
        unsigned stream_id = noti_queue.front();
        noti_queue.pop();
        lock.unlock();

        ++streams_finished[stream_id];
        ++total_finished;

        if (streams_finished[stream_id] < num_iter) {
            dummy<<<1, 1, 0, streams[stream_id]>>>();
            cudaLaunchHostFunc(streams[stream_id], callback, (void*)stream_id);
        }
    }

    auto end_time = std::chrono::steady_clock::now();

    double time_elasped = std::chrono::duration<double, std::micro>(end_time - start_time).count();

    printf("%f\n", time_elasped);
}

And below are the results:

It is actually even slower than cudaStreamSynchronize. Any insight into this?

Also, regarding the sync method, I tried using cudaSetDeviceFlags to set it to spin, but the results for cudaStreamSynchronize are similar.

Robert_Crovella · February 2, 2023, 4:22am

You’re adding more cumbersome weight to an already inefficient work issuance strategy, I’m not surprised it makes performance worse.

nkwkelvin · February 2, 2023, 4:25am

So you mean the slowness is caused by the call to cudaLaunchHostFunc instead of the synchronization?

Anyway, do you have any suggestions on how I should listen to multiple streams from only one CPU thread?

Robert_Crovella · February 2, 2023, 4:37am

I don’t know what the slowness is caused by. Your approach seems unwise to me, and adding additional CUDA runtime API calls to each kernel launch of tiny little kernels is only going to make the overhead problem worse.

I never had any trouble monitoring multiple streams from one CPU thread using cudaStreamQuery. I agree you may not like the performance of it if you use it to monitor kernels that have less than 50us duration, and I don’t have any further suggestions or comments on that topic.

It’s a bad idea. Taking a bad idea and grafting additional scaffolding on top of a bad idea is not going to make the bad idea any better.

It’s a bad idea. I doubt I would be able to comment further.

If you have a standard set of a sequence of very short kernels you want to run, CUDA graphs may offer some improvement, but its not going to provide any benefit for your “monitoring”.

nkwkelvin · February 2, 2023, 6:49pm

Thank you. I will just use the atomic flag approach then.

system · February 16, 2023, 6:49pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Why does cudaStreamAddCallback serialize kernel execution and break concurrency? CUDA Programming and Performance	12	8052	April 5, 2015
Fail to sync the cudaMemcpyAsync using the cudaEvent in two streams CUDA Programming and Performance	4	244	April 1, 2024
Got wrong result when not using cudaDeviceSynchronize in threads CUDA Programming and Performance	6	838	February 1, 2024
unable to get the cpu and gpu to run in parallel CUDA Programming and Performance	34	23202	October 7, 2010
Synchronization methods? CUDA Programming and Performance	11	2109	November 7, 2010
cudaMemcpyAsync CUDA Programming and Performance	10	20755	October 16, 2015
cudaMemcpyAsync makes code faster even when using the default stream 0 CUDA Programming and Performance	1	1459	January 10, 2022
massive hiccups when transferring flags back to host challenging the commonly quoted 2-10us latency CUDA Programming and Performance	14	38117	February 1, 2011
4.0 RC - many host threads per one GPU - cudaStreamQuery and cudaStreamSynchronize behaviour. CUDA Programming and Performance	5	13389	March 10, 2011
cudaMemcpy2DAsync not always fully synchronous CUDA Programming and Performance	11	1164	February 4, 2021

cudaStreamSynchronize is much slower than polling on a flag for kernel completion

Related topics