How to make a kernel's execution wait for a signal from another thread

adeagle · October 28, 2024, 9:59am

tee cuda_event_test.cu<<-'EOF'
#include <iostream>
#include <cuda_runtime.h>
#include <iostream>
#include <vector>
#include <stdio.h>
#include <assert.h>
#include <cstdio>
#include <cuda.h>
#include <iostream>
#include <chrono>
#include <thread>
#include <unistd.h>
#include <stdlib.h>

#define CUDA_CHECK(call) \
    do { \
        cudaError_t error = call; \
        if (error != cudaSuccess) { \
            fprintf(stderr, "CUDA error in file '%s' in line %i: %s.\n", __FILE__, __LINE__, cudaGetErrorString(error)); \
            exit(EXIT_FAILURE); \
        } \
    } while (0)

__global__ void dummyKernel(float *data) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    atomicAdd(&data[idx], idx);
    if(idx==0)
    {
        printf("dummyKernel run\n");
    }
}

int main() {
    int devID0 = 0;
    int block_size=32;
    size_t dataSize = block_size * sizeof(float);
    float *data0_dev;

    CUDA_CHECK(cudaSetDevice(devID0));
    CUDA_CHECK(cudaMalloc(&data0_dev, dataSize));
    
    cudaStream_t stream0;
    CUDA_CHECK(cudaStreamCreate(&stream0));
    
    cudaEvent_t event;
    CUDA_CHECK(cudaEventCreate(&event));

    std::thread t([&]() {
            sleep(10);
            CUDA_CHECK(cudaEventRecord(event, stream0)); 
            printf("cudaEventRecord\n");
        });
    
    CUDA_CHECK(cudaStreamWaitEvent(stream0, event, 0));
    dummyKernel<<<1, block_size,0,stream0>>>(data0_dev); 

    CUDA_CHECK(cudaStreamSynchronize(stream0));
    CUDA_CHECK(cudaFree(data0_dev));
    t.join();
    return 0;
}
EOF
/usr/local/cuda/bin/nvcc -std=c++17 -o cuda_event_test cuda_event_test.cu -I /usr/local/cuda/include -L /usr/local/cuda/lib64  -lcuda
./cuda_event_test

Curefab · October 28, 2024, 10:38am

Please be a bit more clear:

“Make a kernel’s execution wait”, you mean the whole kernel should wait or one thread within the kernel should wait?

“from another thread”, you mean another CPU thread or another GPU thread within that kernel or another GPU thread within another simultaneous kernel?

adeagle · October 28, 2024, 10:42am

the whole dummyKernel should wait thread ‘t’ done

Curefab · October 28, 2024, 10:55am

You could just use t.join() before invoking the dummyKernel?

Robert_Crovella · October 28, 2024, 1:33pm

A few ideas in addition that the one already suggested:

you could use an event based system similar to what you have shown. It can’t be constructed the way you have shown, for at least 2 reasons. First, an event recorded into an “empty” stream will immediately complete. Second, cudaStreamWaitEvent can’t be issued (properly) until the event has been recorded. This effectively imposes need for additional host-based synchronization between your various threads.
you could use cuStreamWaitValue32 from the driver API.
you could use a callback before the kernel that you want to wait. The callback would use ordinary host-based methods to wait on a signal from the thread t that it can complete. The kernel, launched after the callback, would not begin until the callback got that signal.

Sorry, I don’t have worked example/recipes/sample codes for all these ideas right at the moment, but all of them have been discussed in forum threads, which you can find with a bit of searching.

When people are working on cross-stream dependencies, I usually suggest they see if a refactoring can allow for the stream semantics to provide the ordering they are looking for. Although that description doesn’t exactly fit the suggestion already given by curefab, its in the same vein.

Topic		Replies	Views
Wait for completion of any stream? CUDA Programming and Performance	4	4575	July 29, 2009
Question about CUDA streams CUDA Programming and Performance	8	732	November 8, 2019
Allow kernel to wait for completion of gpu code CUDA Programming and Performance	1	2193	August 19, 2009
Fail to sync the cudaMemcpyAsync using the cudaEvent in two streams CUDA Programming and Performance	4	236	April 1, 2024
Async start kernel in different stream after another completes? CUDA Programming and Performance	2	579	April 4, 2016
cuStreamWaitEvent using cuStreamWaitEvent with memcopies and kernel launches CUDA Programming and Performance	4	2259	November 19, 2011
Computation and PCIe tranfers overlaping with callbacks and events. CUDA Programming and Performance	7	914	July 7, 2016
Waiting for kernel CUDA Programming and Performance	6	1450	September 8, 2010
Can kernel function parallel with CPU code? CUDA Programming and Performance	12	7734	December 5, 2008
CUDA Callback function context CUDA Programming and Performance	2	1218	August 8, 2018

How to make a kernel's execution wait for a signal from another thread

Related topics