Large % of time in cuBLAS calls spent in clock_gettime

I’ve run some cuBLAS calls on two different Ubuntu 22.04 machines, one with an A100 and one with a 3070Ti, and for both machines I’ve seen a high % of time in the cuBLAS calls being spent on clock_gettime. We can see it in the following flamegraph, where the calls cublasDgetrfbatched and cublasDgetribatched each spend about 66% of their time here (I’m using CUDA_LAUNCH_BLOCKING=1 to make the flamegraph clearer FYI, but I’ve seen it otherwise as well). Is there some way to get this time back by not calling clock_gettime so much?

When CUDA_LAUCH_BLOCKING is 1 it is akin calling cudaStreamSynchronize() right after each kernel submission. A short example below demonstrates roughly the same behavior with no dependency on cuBLAS – it seems clock_gettime() comes from CUDA RT or the user-mode driver.

You probably can collect Nsight Systems profile to confirm that the time spent in GPU kernel is about the same as what you see in cublasDgetrfbatched (w/ CUDA_LAUCH_BLOCKING=1). In other words, I don’t think the clock_gettime() itself brings any overheads that we need to take back, unless I missed something.

A simple test:

$ cat x.cu
__global__ void ker() { for (;;); }
int main(void) {
  ker<<<1, 1>>>();
  cudaStreamSynchronize(0);
}

$ nvcc x.cu -g --generate-code arch=compute_80,code=sm_80
$ perf record -g ./a.out
<wait a little>
^C

image

OK then. It would be nice to not spin a CPU thread on this clock_gettime call, and to have this noise in the flamegraph, but I understand now. Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.