Using cudaEvents to synchronise with cudaStreamCallback

stebos100 · May 7, 2024, 2:55pm

Hi all, I was wondering if someone could help me regarding using CudaEvents in combination with using the cudaStreamAddCallback() function. As far as I understand it, the callback function is used to call a host function once the work in the specified stream has finished its tasks.

So my thinking was, I create a struct which has a cudaEvent for start and stop times, and once the relevant task has been completed by the cudaStream we can calculate the time taken for task execution. This however is producing some strange results… and I was wondering if someone could guide me in the right direction.

The line which prints "cpu side is : " prints a value in milliseconds. but when the callBack function is called, and the elapsed time is calculated, it is in the order of (some number) e^-28.

struct timeStamp {
    std::basic_string<char> functionId;
    cudaEvent_t start;
    cudaEvent_t end;
    cudaStream_t stream;
    float elapsedTimeMilli;

    timeStamp(std::basic_string<char> functionid, cudaEvent_t s, cudaEvent_t e): functionId(functionid), start(s), end(e) {}

    timeStamp(std::basic_string<char> functionid, cudaStream_t stream): functionId(functionid), stream(stream) {}

};


void streamCallback(cudaStream_t stream, cudaError_t status, void* data){

    timeStamp* hostData = static_cast<timeStamp*>(data);

    if(status == cudaSuccess) {
        cudaEventElapsedTime(&hostData->elapsedTimeMilli, hostData->start, hostData->event);
    }

    std::cout << "The total time taken for the stream execution is : " << hostData->elapsedTimeMilli << " milliseconds for task "<< std::endl;

}


void ExecuteItem(size_t p, size_t deviceId, std::shared_ptr<cuda_stream> stream) {

        float milliseconds;

        timeStamp ts = timeStamp(typeid(*this).name(), stream->GetStream());

        cudaEventCreate(&ts.start);
        cudaEventCreate(&ts.end);

        cudaEventRecord(ts.start, ts.stream);
		//execute the following function using the stream and device ID
        Execute(p, deviceId, stream);

        cudaEventRecord(ts.end, ts.stream);

        cudaEventSynchronize(ts.end);
        cudaStreamAddCallback(ts.stream, streamCallback, (void*)&(ts), 0);

        cudaEventElapsedTime(&milliseconds, ts.start,ts.end);

        std::cout << "cpu side is : " << milliseconds << std::endl;

}

Robert_Crovella · May 7, 2024, 3:43pm

calling the CUDA API from a host function/callback is illegal.

(also, cuda stream add callback has a suggested replacement: cudaLaunchHostFunc)

I think, at least based on what you have shown, there is no need for a host callback. Let’s declare some things about stream execution:

This is a CPU thread synchronizing operation:

  cudaEventSynchronize(ts.end);

That means that the CPU thread waits right there, in that function call, until that named event has been completed. This will occur after the device activity up through the Execute(...) call has completed, for the stream ts.stream. Likewise, since the CPU thread has reach the point of cudaEventSynchronize(...), we also know that CPU activity up to that point is complete.

So when you subsequently do this:

    cudaStreamAddCallback(ts.stream, streamCallback, (void*)&(ts), 0);

there is no prior work on the device, so that named function (streamCallback) should begin to execute immediately, barring the case where you have other callbacks registered. But whether you do or not, there is no particular need for a host callback - you can simply do the event elapsed time work right there.

Why are you doing this twice:

    cudaEventElapsedTime(&milliseconds, ts.start,ts.end);

(once after setting up the callback, and once in the callback)?

stebos100 · May 8, 2024, 7:21am

Hi @Robert_Crovella , thanks so much for the insight and for getting back to me.

So the actual goal that I have in mind is the following: Asynchronously perform stream operations, without blocking the CPU. But once a “task” has been executed and completed by a stream. I want to log the time that the stream execution took. Hence the cudaStreamAddCallback() function ( I have looked into the cudaLaunchHostFunc() as well thank you ! will maybe try and give this a go).

So I see now that using the:

 cudaEventSynchronize(ts.end);

will not achieve the desired outcome. Regarding the second call to →

 cudaEventElapsedTime();

I was just checking whether the elapsed time produced on in the code would be the same as the streamCallback function (which was not, asssuming this is becuase I am calling the CUDA API from a host function).

Do you have any suggestions on how I could achieve the asynchronous time logging using cuda events ? without blocking the CPU thread that is ?

striker159 · May 8, 2024, 2:35pm

You could use a stream callback / host function to pass the recorded events to a non-cuda thread which performs cudaEventElapsedTime()

Robert_Crovella · May 8, 2024, 2:55pm

assuming the callback does not wait on the non-cuda thread, you will lose synchronization that way. Which may be fine, for example if the events are only recorded once/not reused. But in that case, it seems like a simpler method would be just to wait until the elapsed time data is actually needed, then generate at that point.

Using a profiler is the only other idea that comes to mind. Conceptually the profiler is building a large table of everything that happened, and then doing post-processing on that table. You could do something conceptually similar by having a table of CUDA events, that are only each recorded once, and then do your post-processing later, i.e. when the data is needed. There is going to be some practical limit to this approach, as event creation presumably consumes resources and eventually you may run into a limit with a large number of created events.

CU_Steve · May 9, 2024, 11:58pm

I have done what you want to do.

My suggestions:

Create all events in advance and maintain a pool (vector or queue) of events that are available for use. Pop events from the pool as you need them.
Push your time stamps to a queue once their events are recorded to the stream.
You may want a separate queue for each stream you use or each kind of time measurement you make.
Poll the front of the time stamp queue(s), using cudaEventQuery() to test if ts.end is complete. If it is, call cudaEventElapsedTime() and save the result somewhere.
When you have made your time measurement and no longer need the time stamp or the events it contains, return its events to the event pool for reuse.

FYI, my benchmarking of cudaEventQuery() showed it was very fast, but see caveat below.

Beware:

If you are using Windows without HAGS enabled, cudaEventQuery() may interfere with concurrency. See Fewer concurrent kernels with Hardware Accelerated GPU Scheduling (HAGS).
If you use more than one stream, the time measurements may be confusing or of limited value, depending on what you are trying to measure. See the explanation at cudaEventElapsedTime().

Topic		Replies	Views
cudaLaunchHostFunc + cudaEventElapsedTime? CUDA Programming and Performance	4	903	August 3, 2022
Unexpected result with CudaEventRecord on stream CudaEventRecord on a stream waits for completion o CUDA Programming and Performance	0	1943	November 12, 2008
cudaEvntRecord for stream gives incorrect result CUDA Programming and Performance	0	2057	October 31, 2008
Very newbie questions on synchronisation between GPU & CPU, and time measurement CUDA Programming and Performance	4	516	December 17, 2017
Timer&Event CUDA Programming and Performance	3	3607	December 1, 2009
Measure execution time Multiple GPU on CUDA 4.0 with cudaEvents CUDA Programming and Performance	1	9821	March 9, 2011
Event Synchronization CUDA Programming and Performance	6	1797	February 8, 2019
cudaEventRecord functionality CUDA Programming and Performance	1	11480	March 10, 2009
Does cudaEventRecord time the cpu sentences? CUDA Programming and Performance	2	435	October 12, 2021
Timing cuda code I'm sorry for small for dÃ©ja-vu :-) CUDA Programming and Performance	12	36016	July 12, 2011

Using cudaEvents to synchronise with cudaStreamCallback

Related topics