Why cudamalloc and cudaFree so expensive?

zhch5450 · November 14, 2020, 4:59pm

I checked how long it might cost by calling cudamalloc and cudaFree. To be honest, I am surprised how long it costs. The following is my code.

int main(){
    cudaFree(0);//ignore the initialization
    auto start0 = std::chrono::steady_clock::now();

    int num = 10000;
    int* hptr[num];
    for(int i = 0; i<num; i++){
        cudaMalloc((void**)&hptr[i], sizeof(int));
    }
    for(int i = 0; i<num; i++){
        cudaFree(hptr[i]);
    }

    auto end0 = std::chrono::steady_clock::now();
    auto nano0 = std::chrono::duration_cast<std::chrono::nanoseconds>(end0-start0).count();
    // printf("time of memory clean (ms) .............................. %ld ............................ \n",nano0);
    printf("time of memory clean (ms) .............................. %f ............................ \n",(float)nano0 * 1e-6);
}

The above program uses about 30ms to allocate memory and use 25 ms to clean. Total about 55ms.
If I simply replace cudaMalloc by malloc and cudaFree by free (just test memory allocation time on CPU), it costs about 0.47ms in total.
I know of course allocate memory on device is more difficult than allocating memory on host but it is 100 times longer. Is this kind of normal performance of cudaMalloc and cudaFree?
I use GTX 1050Ti and CPU is intel i7 8thGen, vPro.

Robert_Crovella · November 14, 2020, 5:09pm

It’s normal.

The usual advice is to get memory allocations out of performance-sensitive loops, e.g. reuse allocations.

For more complex use cases, people sometimes implement pool allocators. Tensorflow, RAPIDS, and cupy come to mind as examples of well-known libraries that implement pool allocators. I’m sure there are others.

zhch5450 · November 14, 2020, 5:42pm

Thanks for the reply.
The reason I raised this question is I need to solve the following kind of problem. I have two structs, one’s pointer is inside the other (like the linked list)

struct A{
  B* b;
};
struct B{
  int something;
}

First I need to allocate memory of A*.

A* a;
int length_a = 10000;
cudaMallocManaged(&a, length_a*sizeof(A))

Then for every a[0], a[1], a[2]… I need to do something like

for(int i = 0; i<length_a; i++){
//length_b varies according to some condition
cudaMallocManaged(&(a[i].b), length_b*sizeof(int));
}

This kind of program costs 60 ms before I can do anything else, which makes my mission impossible to complete
If the pool allocator good for solving this kind of problem?

Robert_Crovella · November 14, 2020, 6:39pm

First, your code does not make sense given your structure definitions.

Second, if you’re using a GPU to try to make something run faster that runs on the order of 60ms exactly once, you are wasting your time. If you are doing it more than once, figure out a way to reuse the allocations. (The discussion below may help show a way.)

However even if we leave those things aside, there are several reasons you should not do what you are showing, instead I suggest that you follow the method indicated here. It would look something like this starting after the allocation for a:

size_t total_size = 0;
for (int i = 0; i < length_a; i++) total_size += length_b[i];
cudaMallocManaged(&(a[0].b), total_size*sizeof(int));
for (int i = 1; i < length_a; i++) a[i].b = a[i-1].b+length_b[i-1];

effectively you have done your own one-off “pool allocation”

Once again, your code does not make sense for your structure definitions, and I’m not suggesting the above code is in any way correct. It’s designed to demonstrate an idea. (effectively, the above code assumes that b is a pointer to int, not B)

Also note that this approach assumes the only thing you ever need is int alignment. If you actually want some higher level of alignment for each allocation that each b points to, you would need to modify the above approach to round-up to that alignment at each loop iteration, as you compute the total size, and also as you assign each pointer after the first.

zhch5450 · November 14, 2020, 7:08pm

Your idea really helps. I am gonna try it.
The reason I have such an approach is because of this link
allocate memory for data member of a class
The question in the link is similar to mine except dev_P[i]._w size is not 300 but varies. The first answer suggests using some loops as you can see so I tried that method until I found calling cudaMalloc or cudaMallocManaged costs too much time.
I am still new to this area so I may come up with many stupid ideas. Thanks for the help.

njuffa · November 14, 2020, 10:35pm

@zhch5450 Is your performance data from a Linux system, by any chance? On my system with Windows 10, CUDA 11.1, Xeon W-2133 (Skylake) @ 3.8 GHz I measure 4.15 µs per cudaMalloc() call, and 8.87 µs per cudaFree() call. Or is the performance difference just down to an i7 with a very high clock boost?

zhch5450 · November 14, 2020, 10:46pm

Yeah. I forgot to mention my system is ubuntu18.04.
Seems they have a similar cost because cudaMalloc() on my machine, as I have shown, use 30ms for 10000 iterations so 3us per call on average.

njuffa · November 14, 2020, 10:52pm

OK, that makes sense then. What does not make sense to me is that cudaFree() has about twice the cost of cudaMalloc() on Windows. I am not just seeing this ratio on this Windows 10 workstation but also on my old Windows 7 workstation with CUDA 9.2.

Topic		Replies	Views
cudaMalloc, cudaFree speed CUDA Programming and Performance	2	3617	April 4, 2013
about latency to free device memory CUDA Programming and Performance	3	5559	February 18, 2008
Help regarding slow cudaMalloc CUDA Programming and Performance	9	9881	November 29, 2008
16GB cudaMalloc() on A10 (24GB) takes ~300-400ms after previous cudaFree CUDA Programming and Performance tensorrt , cuda , driver	7	528	February 7, 2024
cudaMalloc takes several seconds CUDA Programming and Performance	6	2529	August 13, 2013
cudaMalloc's speed is two slow CUDA Programming and Performance	5	1481	November 9, 2010
Cuda Malloc CudaFree before CudaMalloc, how is that possible? CUDA Programming and Performance	1	2718	May 7, 2012
Calculate time ? CUDA Programming and Performance	5	2828	November 23, 2008
cudaFreeHost consistently 20x slower than free/cudaFree (full runnable example code available) CUDA Programming and Performance	5	1009	July 26, 2022
cudamalloc slow on Kepler K10 CUDA Programming and Performance	9	1142	October 28, 2014

Why cudamalloc and cudaFree so expensive?

Related topics