Hi, I have a application that multiple cudaMalloc calling is needed. And during the runtime, I found that it would be extremely slow (about 200ms) a few times and lowered the whole time. As I know it would happen in the first time to call it, but why does it happen in the middle of runtime ?
NVIDIA L20
Driver Version: 550.54.15
CUDA Version: 12.4
Hi bjfujsy,
cudaMalloc is generally not advised to be called, when your program reaches the part that is performance critical. Try to allocate all the memory upfront. However, 200ms is quite a lot. Is your memory nearly full or are you allocating a huge amount of memory?
Hi, it has a lot of free memory and the size of allocating memory is about 3.5M. I don’t think it is huge. Is there any other possible reason ?
Speed also depends heavily on the type of memory allocated. Whether it’s device only, pinned, managed memory, etc…
Speed is also dependent on the operating system, and whether or not the card is running in TCC mode on Windows.
And finally, when cudaMalloc is called the first time, no CUDA context may yet exist. This triggers the creation of the CUDA context, which is a very slow operation (hundreds of ms)