Why cudamalloc and cudaFree so expensive?

I checked how long it might cost by calling cudamalloc and cudaFree. To be honest, I am surprised how long it costs. The following is my code.

int main(){
    cudaFree(0);//ignore the initialization
    auto start0 = std::chrono::steady_clock::now();

    int num = 10000;
    int* hptr[num];
    for(int i = 0; i<num; i++){
        cudaMalloc((void**)&hptr[i], sizeof(int));
    for(int i = 0; i<num; i++){

    auto end0 = std::chrono::steady_clock::now();
    auto nano0 = std::chrono::duration_cast<std::chrono::nanoseconds>(end0-start0).count();
    // printf("time of memory clean (ms) .............................. %ld ............................ \n",nano0);
    printf("time of memory clean (ms) .............................. %f ............................ \n",(float)nano0 * 1e-6);

The above program uses about 30ms to allocate memory and use 25 ms to clean. Total about 55ms.
If I simply replace cudaMalloc by malloc and cudaFree by free (just test memory allocation time on CPU), it costs about 0.47ms in total.
I know of course allocate memory on device is more difficult than allocating memory on host but it is 100 times longer. Is this kind of normal performance of cudaMalloc and cudaFree?
I use GTX 1050Ti and CPU is intel i7 8thGen, vPro.

It’s normal.

The usual advice is to get memory allocations out of performance-sensitive loops, e.g. reuse allocations.

For more complex use cases, people sometimes implement pool allocators. Tensorflow, RAPIDS, and cupy come to mind as examples of well-known libraries that implement pool allocators. I’m sure there are others.

Thanks for the reply.
The reason I raised this question is I need to solve the following kind of problem. I have two structs, one’s pointer is inside the other (like the linked list)

struct A{
  B* b;
struct B{
  int something;

First I need to allocate memory of A*.

A* a;
int length_a = 10000;
cudaMallocManaged(&a, length_a*sizeof(A))

Then for every a[0], a[1], a[2]… I need to do something like

for(int i = 0; i<length_a; i++){
//length_b varies according to some condition
cudaMallocManaged(&(a[i].b), length_b*sizeof(int));

This kind of program costs 60 ms before I can do anything else, which makes my mission impossible to complete
If the pool allocator good for solving this kind of problem?

First, your code does not make sense given your structure definitions.

Second, if you’re using a GPU to try to make something run faster that runs on the order of 60ms exactly once, you are wasting your time. If you are doing it more than once, figure out a way to reuse the allocations. (The discussion below may help show a way.)

However even if we leave those things aside, there are several reasons you should not do what you are showing, instead I suggest that you follow the method indicated here. It would look something like this starting after the allocation for a:

size_t total_size = 0;
for (int i = 0; i < length_a; i++) total_size += length_b[i];
cudaMallocManaged(&(a[0].b), total_size*sizeof(int));
for (int i = 1; i < length_a; i++) a[i].b = a[i-1].b+length_b[i-1];

effectively you have done your own one-off “pool allocation”

Once again, your code does not make sense for your structure definitions, and I’m not suggesting the above code is in any way correct. It’s designed to demonstrate an idea. (effectively, the above code assumes that b is a pointer to int, not B)

Also note that this approach assumes the only thing you ever need is int alignment. If you actually want some higher level of alignment for each allocation that each b points to, you would need to modify the above approach to round-up to that alignment at each loop iteration, as you compute the total size, and also as you assign each pointer after the first.

Your idea really helps. I am gonna try it.
The reason I have such an approach is because of this link
allocate memory for data member of a class
The question in the link is similar to mine except dev_P[i]._w size is not 300 but varies. The first answer suggests using some loops as you can see so I tried that method until I found calling cudaMalloc or cudaMallocManaged costs too much time.
I am still new to this area so I may come up with many stupid ideas. Thanks for the help.

@zhch5450 Is your performance data from a Linux system, by any chance? On my system with Windows 10, CUDA 11.1, Xeon W-2133 (Skylake) @ 3.8 GHz I measure 4.15 µs per cudaMalloc() call, and 8.87 µs per cudaFree() call. Or is the performance difference just down to an i7 with a very high clock boost?

Yeah. I forgot to mention my system is ubuntu18.04.
Seems they have a similar cost because cudaMalloc() on my machine, as I have shown, use 30ms for 10000 iterations so 3us per call on average.

OK, that makes sense then. What does not make sense to me is that cudaFree() has about twice the cost of cudaMalloc() on Windows. I am not just seeing this ratio on this Windows 10 workstation but also on my old Windows 7 workstation with CUDA 9.2.