Cuda runtime call after driver api call, excessive overhead

Hi:

I was trying to add a cuda external function call (with a lot of runtime api call) to a program which only involves cuda driver api call, using the separate linkage described in https://developer.nvidia.com/blog/separate-compilation-linking-cuda-device-code/, basically compile them into static lib and link together to the program

However when I am profiling the external lib, I found the first call to runtime api call, i.e. cudamalloc, is taking excessive time, as this function is called many many times, the excessive overhead is bothering its performance.

I also check its context, the runtime context is the same to driver api context, so there is no new context creation involved, I am using the chrono to profile the time.

A simple illustration can be described below:

cuCtxGetCurrent(&current_ctx) //driver current_ctx=0x5641b9aac1a0
checkCudaErrors(cuMemAlloc(&d_C1, 4)); //driver api call time: 6.724us, normal
cuCtxGetCurrent(&current_ctx); //driver current_ctx=0x5641b9aac1a0, same as above
cudaMalloc(&d_D, 4); //runtime api call time: 5896.57us, great overhead
cuCtxGetCurrent(&current_ctx); //runtime current_ctx1=0x5641b9aac1a0 same as above
cudaMalloc(&d_E, 4); //runtime call2 time: 10.655us, this time the overhead is normal

The first cudaMalloc takes excessive time. If this is in a loop, the overhead is significant.

Any advice would be appreciated.

This looks like CUDA runtime initialization overhead to me, probably the memory allocator inside the CUDA runtime getting an initial chunk of memory from the allocator in the software layer below.

This should not happen all the time, only occasionally when the runtime allocator runs out of allocated memory and needs to retrieve the next chunk from the next lower layer. This is similar to the behavior of malloc() in the C\C++ runtime library: When it runs out of memory it needs to go back to the operating system for the next large chunk, which is an expensive operation time-wise.

Have you actually tried a loop? What timing did you observe for the cudaMalloc() in it? Note that loops that continuously allocate and free memory are not advised from a performance perspective. Ideally what you would want is setting up necessary allocations prior to the loop, re-using those allocations throughout the loop, and the freeing allocated memory at the end. I know that is not always possible, but allocating and freeing GPU memory is fairly expensive so should happen infrequently.

FWIW, I would not call a 6 millisecond delay “great overhead”.

Hi,Njuffa:

Thanks for your reply.

Here I am just allocating 4 bytes, seems should not run out of memory and look for another layer.

Anyway, this is just demonstration showing the overhead. The actually program I use is I built a separate function, compiled to a static lib and link to the main. Inside this separate function is called in another function in the main program which is in a loop. At first in this function, it’s a thrust call, but I later found it’s taking excessive time, and then try to add a cudaMalloc and cudafree at the very beginning , and found malloc and free could even taking 20-40ms (then thrust time gets back to normal), but whole function is only taking 50-60ms. Every call to this function, the first cudamalloc taking 20ms is making this function less competent .

I hope I have described my problem clearly now. Any advice would be appreciated.

BTW, this illustration is what I have modified from /sample/0_Simple/vectorAddDrv, where I add cudamalloc after cuMemAlloc and measure its time:

start = std::chrono::steady_clock::now();
checkCudaErrors(cuMemAlloc(&d_C1, 4));
end= std::chrono::steady_clock::now();
elapsed_seconds = end-start;
std::cout << "driver api call time: " << elapsed_seconds.count()*1e6 << "us\n";
cuCtxGetCurrent(&current_ctx);
std::cout << "driver current_ctx=" << current_ctx << "\n";

start = std::chrono::steady_clock::now();
cudaMalloc(&d_D, 4);
end= std::chrono::steady_clock::now();
elapsed_seconds = end-start;
std::cout << "runtime call1 time: " << elapsed_seconds.count()*1e6 << "us\n";
cuCtxGetCurrent(&current_ctx);
std::cout << "runtime current_ctx1=" << current_ctx << "\n";

My guess would be that the call to cudaMalloc triggers additional lazy context initialization that is needed for a proper runtime API context at that point. This additional context initialization might be never needed (and never encountered) during driver API context usage.

As you point out, subsequent runtime API usage at that point seems roughly “normal”. You indicate “if this is in a loop…” but I see no evidence of that. I placed the code you have shown in a loop, and see approximately normal behavior after the first cudaMalloc call:

$ cat t1937.cu
#include <cuda.h>

#include <iostream>
#include <chrono>


int main() {
    CUresult status;
    CUdevice dev;
    CUdeviceptr d1;
    int *d2;
    int device_id = 0;
    status = cuInit(0);
    status = cuDeviceGet(&dev, device_id);
    CUcontext dctx, current_ctx;
    //status = cuDevicePrimaryCtxRetain(&dctx, dev);
    status = cuCtxCreate(&dctx, 0, dev);
    status = cuCtxSetCurrent(dctx);

    for (int i = 0; i < 3; i++) {
      auto start = std::chrono::steady_clock::now();
      cuMemAlloc(&d1, 4);
      auto end= std::chrono::steady_clock::now();
      std::chrono::duration<double> elapsed_seconds = end-start;
      std::cout << "driver api call time: " << elapsed_seconds.count()*1e6 << "us\n";
      cuCtxGetCurrent(&current_ctx);
      std::cout << "driver current_ctx=" << current_ctx << "\n";

      start = std::chrono::steady_clock::now();
      cudaMalloc(&d2, 4);
      end= std::chrono::steady_clock::now();
      elapsed_seconds = end-start;
      std::cout << "runtime call time: " << elapsed_seconds.count()*1e6 << "us\n";
      cuCtxGetCurrent(&current_ctx);
      std::cout << "runtime current_ctx1=" << current_ctx << "\n";
      }
}
$ nvcc -o t1937 t1937.cu -std=c++14 -lcuda
t1937.cu(8): warning: variable "status" was set but never used

$ ./t1937
driver api call time: 235.973us
driver current_ctx=0xe58300
runtime call time: 5284.57us
runtime current_ctx1=0xe58300
driver api call time: 6.795us
driver current_ctx=0xe58300
runtime call time: 6.539us
runtime current_ctx1=0xe58300
driver api call time: 5.042us
driver current_ctx=0xe58300
runtime call time: 4.89us
runtime current_ctx1=0xe58300
$

If you’re seeing something different, my guess is something about the context has changed from one loop iteration to the next. I would need to see a proper example of that to have any chance to comment further.

Note the description on cuCtxCreate():

In most cases it is recommended to use cuDevicePrimaryCtxRetain.

If that is an option, it may be worth a try.

The amount of memory the CUDA runtime allocator has available before being initialized is zero bytes. If four bytes are requested, it has to call the lower-level allocator to get its first chunk of memory that it can then parcel out.

Hi:

Thanks for the reply.

The original program is a bit too complicated to replay, but runtime api after some driver api calls did show slowing down in my program, but the context pointer is not changed. Regretfully I cannot repeat that in a small program. If it’s the first runtime api call in the first loop slowing down, that would not bother me at all, but it did slow down in very loop call to this function.

I can try cuDevicePrimaryCtxRetain to see if it helps

Thx for your advice

A basic fact of programming life is that dynamic memory allocators (host or device) are usually layered, and that allocation times will vary by factors when the top-most layer allocator cannot handle a call by itself and needs to call into lower-layer allocators. The way this is handled in many embedded applications that require a reasonable upper bound on allocation time is for the application to create a memory pool at application startup time and provide all allocations from that pool.

In terms of absolute performance, I see only around 1.5 ms (rather than 5 ms) spent in the initial cudaMalloc() call in Robert Crovella’s little test app, which is an indication that a faster host system (in particular, one with high single-thread performance) will bring the time down. Which, of course, is not surprising. I regularly recommend CPUs with >= 3.5 GHz base frequency for GPU-accelerated systems.

Yes, it is possible that:

  • the underlying allocators for cuMemAlloc and cudaMalloc are “disjoint”

and

  • the first cudaMalloc call is triggering a more costly allocation path

that would be an alternative possibility/theory to my hand-wavy “context lazy initialization” comment (or maybe just a better description of it, if it were truly a one-time event). As a test case, I increased the allocation size to 1GB and also increased the loop count to 8:

$ cat t1937.cu
#include <cuda.h>

#include <iostream>
#include <chrono>


int main() {
    CUresult status;
    CUdevice dev;
    CUdeviceptr d1;
    int *d2;
    int device_id = 0;
    status = cuInit(0);
    status = cuDeviceGet(&dev, device_id);
    CUcontext dctx, current_ctx;
    //status = cuDevicePrimaryCtxRetain(&dctx, dev);
    status = cuCtxCreate(&dctx, 0, dev);
    status = cuCtxSetCurrent(dctx);
    const size_t ds = 1048576ULL*1024;
    for (int i = 0; i < 8; i++) {
      auto start = std::chrono::steady_clock::now();
      cuMemAlloc(&d1, ds);
      auto end= std::chrono::steady_clock::now();
      std::chrono::duration<double> elapsed_seconds = end-start;
      std::cout << "driver api call time: " << elapsed_seconds.count()*1e6 << "us\n";
      cuCtxGetCurrent(&current_ctx);
      std::cout << "driver current_ctx=" << current_ctx << "\n";

      start = std::chrono::steady_clock::now();
      cudaMalloc(&d2, ds);
      end= std::chrono::steady_clock::now();
      elapsed_seconds = end-start;
      std::cout << "runtime call time: " << elapsed_seconds.count()*1e6 << "us\n";
      cuCtxGetCurrent(&current_ctx);
      std::cout << "runtime current_ctx=" << current_ctx << "\n";
      }
}
[user2@dc10 misc]$ nvcc -o t1937 t1937.cu -std=c++14 -lcuda
t1937.cu(8): warning: variable "status" was set but never used

$ ./t1937
driver api call time: 2036.38us
driver current_ctx=0x18d0300
runtime call time: 7015.93us
runtime current_ctx=0x18d0300
driver api call time: 1939.08us
driver current_ctx=0x18d0300
runtime call time: 1956.98us
runtime current_ctx=0x18d0300
driver api call time: 1927.05us
driver current_ctx=0x18d0300
runtime call time: 1945.73us
runtime current_ctx=0x18d0300
driver api call time: 1852.46us
driver current_ctx=0x18d0300
runtime call time: 1877.57us
runtime current_ctx=0x18d0300
driver api call time: 1855.99us
driver current_ctx=0x18d0300
runtime call time: 1849.87us
runtime current_ctx=0x18d0300
driver api call time: 1867us
driver current_ctx=0x18d0300
runtime call time: 1807.41us
runtime current_ctx=0x18d0300
driver api call time: 1814.15us
driver current_ctx=0x18d0300
runtime call time: 1814.83us
runtime current_ctx=0x18d0300
driver api call time: 1795.24us
driver current_ctx=0x18d0300
runtime call time: 1793.53us
runtime current_ctx=0x18d0300
$

The first call to cudaMalloc still appears to be ~5ms longer than the others. I would think that a 1GB allocation would be enough for a pool allocator to “go back to the well” each time, or for a multi-level allocator to incur the slow path, but perhaps not.

As an aside, I’m fairly confident that cudaMalloc uses some form of “mild”, undocumented pool allocator scheme under the hood. This would be in the form of allocating memory in certain “small” reasonably sized chunks, and then fulfilling allocations from those chunks first, before going to the well. (I was thinking that 4 bytes might fall within those chunks, 1GB probably not.) Furthermore, as an aside, CUDA has started to offer some limited semantics for pool allocation usage since CUDA 11.2.

Historically, the memory allocators used by CUDA and the driver stack below CUDA have changed multiple times including the details of layering. This suggests that one should not assume any particular design and that allocator design continues to evolve over time just as it does in operating systems and C/C++ runtime libraries for CPUs.

While use of a memory pool does not seem to make sense by my understanding of the term (software engineers do not always agree on terminology), I believe that at least at one point there was a slab allocator involved. Regardless of such implementation details, CUDA programmers are advised to start with the basic premise that allocation times can vary quite a bit, just as they can for malloc() in host code.

Excessive overhead for dynamic memory allocation certainly can be (and historically, has been) reported as a bug, but this does not seem applicable to the case at hand. As a general principle of high-performance programming, dynamic memory allocation and deallocation should be minimized.

Yes, slab allocator, not pool allocator, for cudaMalloc. I had intended the experiment to exceed the slab size or chunk size. I used sloppy language. However the new APIs I referenced are for a (hidden, mostly) pool allocator. Deallocations are returned to a pool.

Does this explain sometimes after some driver api calls(such as cumemalloc or culaunch) , the runtime api call such as cudamalloc will take more than expected time again even if we have initiated the cudamalloc before these driver api calls ? I can not replay this in a simple code, but this is happening in my program…

I don’t believe there is any explanation here.

OK, I just found not only first runtime api call but also driver api call is take excessive time in my function appended to the program, but I don’t know why every call to this function, the first malloc is expensive. I can move some device memory allocation out of my function, but there are some dynamic allocation that cannot be avoid and some runtime libs such as thrust, it internally will call cudamalloc I believe.

Below is what I profile the cuMemalloc: (these are repeatedly called in the program)

  struct timeval stop_cpu, start_cpu;
  gettimeofday(&start_cpu, NULL);
  devMalloc(igpu, &d_a[igpu], 1000);// will call devMemAlloc
  devFree(igpu, d_a[igpu]);
  gettimeofday(&stop_cpu, NULL);
  printf("the cpu %d takes %lu us to do the devmalloc1\n",igpu, (stop_cpu.tv_sec - start_cpu.tv_sec) * 1000000 + stop_cpu.tv_usec - start_cpu.tv_usec);

  gettimeofday(&start_cpu, NULL);
  devMalloc(igpu, &d_a[igpu], 1000);// will call devMemAlloc
  devFree(igpu, d_a[igpu]);
  gettimeofday(&stop_cpu, NULL);
  printf("the cpu %d takes %lu us to do the devmalloc2\n",igpu, (stop_cpu.tv_sec - start_cpu.tv_sec) * 1000000 + stop_cpu.tv_usec - start_cpu.tv_usec);

(other driver api call)

Print on screen:

[1,1]:the cpu 0 takes 56060 us to do the devmalloc1
[1,1]:the cpu 0 takes 27 us to do the devmalloc2

[1,1]:the cpu 0 takes 28866 us to do the devmalloc1
[1,1]:the cpu 0 takes 22 us to do the devmalloc2

[1,1]:the cpu 0 takes 29291 us to do the devmalloc1
[1,1]:the cpu 0 takes 25 us to do the devmalloc2


Any advice would be greatly appreicated

OK, my current situation is every first runtime call (or driver api call) in my function will take excessive time, even though I can move most of device memory allocations outside the function, there are some device memory’s sizes are determined on the fly, and some runtime calls such as thrust, it will internally allocate device memory.

Below is my brief code and profiling

void my_function(){
 start = std::chrono::steady_clock::now();  
  thrust_call();
  end = std::chrono::steady_clock::now();  
  elapsed_seconds = end-start;
  std::cout << "thrust call1: " << elapsed_seconds.count()*1e6 << " us\n";

  start = std::chrono::steady_clock::now();  
  thrust_call();
  end = std::chrono::steady_clock::now();  
  elapsed_seconds = end-start;
  std::cout << "thrust call2: " << elapsed_seconds.count()*1e6 << " us\n";
 //many runtime and driver api calls following, but did not change any context or do device set
}

and stdout shows
[1,1]:thrust call1: 39827.7 us
[1,1]:thrust call2: 850.589 us

[1,1]:thrust call1: 29869.1 us
[1,1]:thrust call2: 824.341 us

[1,1]:thrust call1: 30099.5 us
[1,1]:thrust call2: 856.375 us

you can see normal thrust call only takes less than 1ms to do the sorting, but the first call always takes 30-40ms. My function totally only cost about 60-70ms except for the thrust call, if the thrust itself is taking 30ms, that would made my_function less competent. Any suggestions?

I consider the benchmarking methodology used here flawed. In benchmarking one typically does not time any code on its first invocation because of the cold start effect. Instead, one uses a couple of passes to “warm up” hardware and software and then measures steady-state throughput. Is there a specific reason not to follow this established benchmarking practice?

I did, but after many iterations, the first thrust call always consumes like 30ms, while the following goes to normal. My initial guess is other driver api call between two my_function calls may corrupt the context or memory allocator, for example:

for loop{
 my_function();//first thrust, or cudamalloc costs 30ms, second costs 1ms

 driver_api_call (i.e cufft, culaunch)

 my_function();//first thrust, or cudamalloc still costs 10ms, second costs 1ms

//other driver api call

}

However if there was no driver_api_call in between, the second my_function would go normal

for loop{
 my_function();//first thrust, or cudamalloc costs 30ms, second costs 1ms

 my_function();//first thrust, or cudamalloc costs 1ms, second costs 1ms

//other driver api call
}

Not sure if any driver call such as culaunch or cufft are responsible for this. Any advice would be appreciated. Thx

I am afraid I have no insight into that observation. My usual recommendation is to either use the CUDA driver API or use the CUDA runtime API and avoid mixing them. Because various libraries are built on top of the CUDA runtime, I have used the runtime API exclusively ever since it came into existence (prior to initial CUDA release) and have no experience with side effects of mixing the two APIs.

To my knowledge there is no public documentation on how driver API and runtime API interact under the hood. If one assumes that the CUDA runtime sits on top of the driver layer in the software stack, operating on the driver layer directly (thus “bypassing” the runtime layer) may force (partial) invalidation followed by refresh of state in the runtime because state shadowed / cached by the runtime has become stale.

In other words, your hypothesis seems plausible except this is a case of data invalidation rather than corruption. Corruption would lead to functional rather than performance deficiencies. This would reinforce the recommendation above: For best performance, do not mix the two APIs.

Thx for your recommendation

The fact is the program is a big cuda driver api code, and the function I am trying to add is a big open source code which is using cuda runtime, changing to the cuda api code is not that straightforward.

Any way, thank for your attention, Merry Christmas!