cudaLaunchHostFunc + cudaEventElapsedTime?

brian.budge · August 3, 2022, 6:30pm

Hey all. I’d like to asynchronously calculate the duration of an async kernel launch or async copy.

I could use events, but I don’t want to synchronize on an event and stall anything. I could capture CPU time before the launch, and again in cudaLaunchHostFunc, but that time will also include any queuing time.

It seems like it may be an ideal combination to do:

void recordScope(void *datav) {
 ...
  float millis;
  cudaEventElapsedTime(&millis, data.start, data.stop));
 ...
}

cudaEventRecord(data.start);
launch();
cudaEventRecord(data.stop);
cudaLaunchHostFunc(recordScope, &data);

Unfortunately, the documentation for cudaLaunchHostFunc states that no cuda calls should be made within the callback. Does that apply to cudaEventElapsedTime as well? Are there other ways to do what I want?

Thanks!

Robert_Crovella · August 3, 2022, 6:40pm

Yes

I haven’t noticed much difference between proper CPU timing of a kernel launch and the proper use of cudaEvent based method along the lines of what you have shown.

Of course you could wait until some other necessary synchronization. But you’re specifically asking about not that.

The only other idea I have would be to spin up additional thread(s), and use interthread signaling using e.g. a semaphore, so that the callback signals your waiting thread to go ahead and compute the cudaEventElapsedTime. That seems like an awful lot of machinery.

Perhaps others will have better ideas.

brian.budge · August 3, 2022, 6:53pm

Thanks. What you said makes sense, though I’d expect some differences between host timing before launch and after since I’m running many things on many streams. It strikes me that I should be able to use cudaLaunchHostFunc before and after the launch, e.g.

cudaLaunchHostFunc(recordScope, &dataStart);
launch();
cudaLaunchHostFunc(recordScope, &dataEnd);

It does mean maintaining a fair amount of bookkeeping, and I’m unsure what the overheads are of asking CUDA to call my callback, but otherwise, this seems like it should work. Opinions?

Robert_Crovella · August 3, 2022, 7:58pm

If it were me, I would not use a callback for timing:

$ cat t2084.cu
#include <iostream>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

const unsigned long long my_delay = 1000000ULL;

unsigned long long dtime_usec(unsigned long long start=0){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

__global__ void k(unsigned long long delay){

  unsigned long long start = clock64();
  unsigned long long end = start+delay;
  while (clock64() < end);
}

struct d{
  unsigned long long start;
} my_parms;

void dt_cpu(void *parms){
  d *p = (d *) parms;
  std::cout << "CB Duration: " << dtime_usec(p->start) << "us" << std::endl;
}

int main(){
  cudaEvent_t start, stop;
  cudaEventCreate(&start);
  cudaEventCreate(&stop);
  cudaStream_t s;
  cudaStreamCreate(&s);
  k<<<1,1, 0, s>>>(my_delay); // warm-up
  cudaDeviceSynchronize();
#if defined USE_EVENT
  cudaEventRecord(start, s);
#elif defined USE_CB
  my_parms.start = dtime_usec(0);
#else
  unsigned long long scpu = dtime_usec(0);
#endif
  k<<<1,1, 0, s>>>(my_delay);
#if defined USE_EVENT
  cudaEventRecord(stop, s);
  cudaEventSynchronize(stop);
  float ms;
  cudaEventElapsedTime(&ms, start, stop);
  std::cout << "EV Duration: " << (ms*1000) << "us" << std::endl;
#elif defined USE_CB
  cudaLaunchHostFunc(s, dt_cpu, &my_parms);
#else
  cudaDeviceSynchronize();
  std::cout << "CPU Duration: " << dtime_usec(scpu) << "us" << std::endl;
#endif
  cudaDeviceSynchronize();
}
$ nvcc -o t2084 t2084.cu
$ ./t2084
CPU Duration: 880us
$ nvcc -o t2084 t2084.cu -DUSE_EVENT
$ ./t2084
EV Duration: 823.84us
$ nvcc -o t2084 t2084.cu -DUSE_CB
$ ./t2084
CB Duration: 1096us
$

CUDA 11.4, V100, CentOS7

About 99% of the time I use timing measurements like this for comparative purposes, e.g. to test the benefit of a presumed optimization. In that case, for the first two methods (not callback), it seems that they can “resolve” a ~1% change in kernel duration (for this test case). Modifying the code to:

const unsigned long long my_delay = 1010000ULL;

I get the following output:

$ nvcc -o t2084 t2084.cu
$ ./t2084
CPU Duration: 896us
$ nvcc -o t2084 t2084.cu -DUSE_EVENT
$ ./t2084
EV Duration: 827.52us
$ nvcc -o t2084 t2084.cu -DUSE_CB
$ ./t2084
CB Duration: 1083us
$

For completeness, nvprof reports for the original duration:

$ nvprof ./t2084
==7309== NVPROF is profiling process 7309, command: ./t2084
CPU Duration: 874us
==7309== Profiling application: ./t2084
==7309== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  1.6225ms         2  811.24us  811.16us  811.32us  k(__int64)

and for the 1% increased duration:

$ nvprof ./t2084
==6697== NVPROF is profiling process 6697, command: ./t2084
CPU Duration: 879us
==6697== Profiling application: ./t2084
==6697== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  1.6386ms         2  819.32us  819.22us  819.41us  k(__int64)

YMMV

brian.budge · August 3, 2022, 8:12pm

Thanks for the thorough investigation. This gives me a good place to start.

Topic		Replies	Views
CUDA OpenCL comparison CUDA Programming and Performance	9	3400	August 23, 2011
timing performance of kernels how ? cudaprof vs cudaEventRecord vs cutStartTimer CUDA Programming and Performance	3	5298	March 21, 2009
Can kernel function parallel with CPU code? CUDA Programming and Performance	12	7737	December 5, 2008
cudaLaunchHostFunc API example CUDA Programming and Performance	31	5849	February 8, 2025
Using cudaEvents to synchronise with cudaStreamCallback CUDA Programming and Performance cuda	5	617	May 9, 2024
Why is the performance of cudaLaunchHostFunc low, and what are the optimization strategies or alternative solutions? CUDA Programming and Performance cuda	8	57	September 14, 2024
Timing cuda code I'm sorry for small for dÃ©ja-vu :-) CUDA Programming and Performance	12	35988	July 12, 2011
Why using [&] is 10% slower than [=] when wrapping a kernel function call in a lambda expression? CUDA Programming and Performance cuda , kernel	9	1166	February 21, 2023
Very newbie questions on synchronisation between GPU & CPU, and time measurement CUDA Programming and Performance	4	488	December 17, 2017
Number of GPU clock cycles CUDA Programming and Performance	15	10290	June 16, 2017

cudaLaunchHostFunc + cudaEventElapsedTime?

Related topics