Using cudaEvents to synchronise with cudaStreamCallback

Hi all, I was wondering if someone could help me regarding using CudaEvents in combination with using the cudaStreamAddCallback() function. As far as I understand it, the callback function is used to call a host function once the work in the specified stream has finished its tasks.

So my thinking was, I create a struct which has a cudaEvent for start and stop times, and once the relevant task has been completed by the cudaStream we can calculate the time taken for task execution. This however is producing some strange results… and I was wondering if someone could guide me in the right direction.

The line which prints "cpu side is : " prints a value in milliseconds. but when the callBack function is called, and the elapsed time is calculated, it is in the order of (some number) e^-28.

struct timeStamp {
    std::basic_string<char> functionId;
    cudaEvent_t start;
    cudaEvent_t end;
    cudaStream_t stream;
    float elapsedTimeMilli;

    timeStamp(std::basic_string<char> functionid, cudaEvent_t s, cudaEvent_t e): functionId(functionid), start(s), end(e) {}

    timeStamp(std::basic_string<char> functionid, cudaStream_t stream): functionId(functionid), stream(stream) {}

};


void streamCallback(cudaStream_t stream, cudaError_t status, void* data){

    timeStamp* hostData = static_cast<timeStamp*>(data);

    if(status == cudaSuccess) {
        cudaEventElapsedTime(&hostData->elapsedTimeMilli, hostData->start, hostData->event);
    }

    std::cout << "The total time taken for the stream execution is : " << hostData->elapsedTimeMilli << " milliseconds for task "<< std::endl;

}


void ExecuteItem(size_t p, size_t deviceId, std::shared_ptr<cuda_stream> stream) {

        float milliseconds;

        timeStamp ts = timeStamp(typeid(*this).name(), stream->GetStream());

        cudaEventCreate(&ts.start);
        cudaEventCreate(&ts.end);

        cudaEventRecord(ts.start, ts.stream);
		//execute the following function using the stream and device ID
        Execute(p, deviceId, stream);

        cudaEventRecord(ts.end, ts.stream);

        cudaEventSynchronize(ts.end);
        cudaStreamAddCallback(ts.stream, streamCallback, (void*)&(ts), 0);

        cudaEventElapsedTime(&milliseconds, ts.start,ts.end);

        std::cout << "cpu side is : " << milliseconds << std::endl;

}

calling the CUDA API from a host function/callback is illegal.

(also, cuda stream add callback has a suggested replacement: cudaLaunchHostFunc)

I think, at least based on what you have shown, there is no need for a host callback. Let’s declare some things about stream execution:

This is a CPU thread synchronizing operation:

  cudaEventSynchronize(ts.end);

That means that the CPU thread waits right there, in that function call, until that named event has been completed. This will occur after the device activity up through the Execute(...) call has completed, for the stream ts.stream. Likewise, since the CPU thread has reach the point of cudaEventSynchronize(...), we also know that CPU activity up to that point is complete.

So when you subsequently do this:

    cudaStreamAddCallback(ts.stream, streamCallback, (void*)&(ts), 0);

there is no prior work on the device, so that named function (streamCallback) should begin to execute immediately, barring the case where you have other callbacks registered. But whether you do or not, there is no particular need for a host callback - you can simply do the event elapsed time work right there.

Why are you doing this twice:

    cudaEventElapsedTime(&milliseconds, ts.start,ts.end);

(once after setting up the callback, and once in the callback)?

Hi @Robert_Crovella , thanks so much for the insight and for getting back to me.

So the actual goal that I have in mind is the following: Asynchronously perform stream operations, without blocking the CPU. But once a “task” has been executed and completed by a stream. I want to log the time that the stream execution took. Hence the cudaStreamAddCallback() function ( I have looked into the cudaLaunchHostFunc() as well thank you ! will maybe try and give this a go).

So I see now that using the:

 cudaEventSynchronize(ts.end);

will not achieve the desired outcome. Regarding the second call to →

 cudaEventElapsedTime();

I was just checking whether the elapsed time produced on in the code would be the same as the streamCallback function (which was not, asssuming this is becuase I am calling the CUDA API from a host function).

Do you have any suggestions on how I could achieve the asynchronous time logging using cuda events ? without blocking the CPU thread that is ?

You could use a stream callback / host function to pass the recorded events to a non-cuda thread which performs cudaEventElapsedTime()

assuming the callback does not wait on the non-cuda thread, you will lose synchronization that way. Which may be fine, for example if the events are only recorded once/not reused. But in that case, it seems like a simpler method would be just to wait until the elapsed time data is actually needed, then generate at that point.

Using a profiler is the only other idea that comes to mind. Conceptually the profiler is building a large table of everything that happened, and then doing post-processing on that table. You could do something conceptually similar by having a table of CUDA events, that are only each recorded once, and then do your post-processing later, i.e. when the data is needed. There is going to be some practical limit to this approach, as event creation presumably consumes resources and eventually you may run into a limit with a large number of created events.

I have done what you want to do.

My suggestions:

  1. Create all events in advance and maintain a pool (vector or queue) of events that are available for use. Pop events from the pool as you need them.

  2. Push your time stamps to a queue once their events are recorded to the stream.
    You may want a separate queue for each stream you use or each kind of time measurement you make.

  3. Poll the front of the time stamp queue(s), using cudaEventQuery() to test if ts.end is complete. If it is, call cudaEventElapsedTime() and save the result somewhere.

  4. When you have made your time measurement and no longer need the time stamp or the events it contains, return its events to the event pool for reuse.

FYI, my benchmarking of cudaEventQuery() showed it was very fast, but see caveat below.

Beware:

  1. If you are using Windows without HAGS enabled, cudaEventQuery() may interfere with concurrency. See Fewer concurrent kernels with Hardware Accelerated GPU Scheduling (HAGS).

  2. If you use more than one stream, the time measurements may be confusing or of limited value, depending on what you are trying to measure. See the explanation at cudaEventElapsedTime().