I want to calculate computing time, and I have a problem.
My experience result is as following.
Before executing kernel function, I execute some cudaMalloc() functions.
When I execute cutStartTimer( timer) before the first cudaMalloc(), I find the operation time is so extremely long (120 ms).
So I move cutStartTimer( timer) after the first cudaMalloc(), I find the time is so short (1.33ms).
I don’t know why the first cudaMalloc() wastes so long time (120 - 1.33) ?
Thanks for any info :blink:
Dynamic memory allocation is very expensive
Is cudaMalloc() the first cuda* call in your program? The first such call also initializes the driver runtime and the GPU to prepare them for CUDA calculations. That also takes a significant amount of time.
So, is there any solution to displace the Dynamic memory allocation efficiently ?
Thanks for reply :rolleyes:
Dynamic memory allocation is very expensive
Is cudaMalloc() the first cuda* call in your program? The first such call also initializes the driver runtime and the GPU to prepare them for CUDA calculations. That also takes a significant amount of time.
Are there Static memory allocation methods ?
Thank you for reply.
Only allocate once at the beginning of the progam.
Of course. Just declare a device array. You will need to use cudaMemcpToSymbol to copy to it.
Sure. Here’s one:
float myStaticMemory[1000];
Extend it to CUDA as per what MrAnderson said.
There’s also dynamic allocation that is light. Eg you can make your own stack.
float* stackmemory = malloc(1000000);
int stackpointer = 0;
float* myMalloc(int size) {
float* pointer = &stackmemory[stackpointer];
stackpointer+=size;
return pointer;
}
void myFree(int size) {
stackpointer-=size;
}
// EXAMPLE USE
for(int i= 0; i< 100; i+= 1){
a = myMalloc(i);
b = myMalloc(10*i);
// use a and b
myFree(10*i + i);
}
The above code will be much faster than calling malloc() multiple times. Extend the concept to cudaMalloc() as well.