NVIDIA Developer Forums

cudaFree is slow

Accelerated Computing CUDA CUDA Programming and Performance

pssuperZoro November 7, 2010, 2:17am 1

My cuda program works with openMP library, there are 6 CPU threads, each one controls one Tesla C870. The kernel program is wrapped, so the kernel program on different Tesla use a chunk of non-overlap host memory.

The behavior of my cuda program is like:

cudaMalloc(150KB data)
cudaMemcpy(HostToDevice)
“a kernel program”
cudaThreadSynchronize();
cudaMemcpy(…cudaMemcpyDeviceToHost);
cudaFree(“device data pointer, size is 150KB”);

The interesting thing is when I timing it, i found that all cudaFree() run pretty slow, like 20 milliseconds, except one of them runs fast.

Any suggestion about this problem?

Device 0 (miliseconds)
mem alloc: 0.304928
memcpy h2d: 0.136832
Computing time: 1.08838
memcpy d2h: 0.0504
memfree: 0.217472

Device 2
mem alloc: 0.858656
memcpy h2d: 0.137088
Computing time: 1.09094
memdcpy d2h: 0.050464
memfree: 21.4218

Device 3
mem alloc: 2.95382
memcpy h2d: 0.145376
Computing time: 1.08544
memcpy d2h: 0.050048
memfree: 20.1498

Device 4
mem alloc: 2.82544
memcpy h2d: 0.138272
Computing time: 1.10192
memcpy d2h: 0.055904
memfree: 20.1657

pssuperZoro November 7, 2010, 2:17am 2

My cuda program works with openMP library, there are 6 CPU threads, each one controls one Tesla C870. The kernel program is wrapped, so the kernel program on different Tesla use a chunk of non-overlap host memory.

The behavior of my cuda program is like:

cudaMalloc(150KB data)
cudaMemcpy(HostToDevice)
“a kernel program”
cudaThreadSynchronize();
cudaMemcpy(…cudaMemcpyDeviceToHost);
cudaFree(“device data pointer, size is 150KB”);

The interesting thing is when I timing it, i found that all cudaFree() run pretty slow, like 20 milliseconds, except one of them runs fast.

Any suggestion about this problem?

Device 0 (miliseconds)
mem alloc: 0.304928
memcpy h2d: 0.136832
Computing time: 1.08838
memcpy d2h: 0.0504
memfree: 0.217472

Device 2
mem alloc: 0.858656
memcpy h2d: 0.137088
Computing time: 1.09094
memdcpy d2h: 0.050464
memfree: 21.4218

Device 3
mem alloc: 2.95382
memcpy h2d: 0.145376
Computing time: 1.08544
memcpy d2h: 0.050048
memfree: 20.1498

Device 4
mem alloc: 2.82544
memcpy h2d: 0.138272
Computing time: 1.10192
memcpy d2h: 0.055904
memfree: 20.1657

pssuperZoro November 7, 2010, 2:22am 3

I inserted “cudaThreadSynchronize()” because i saw some topics also discussed this problem, and also insert a dummy “cudaFree(0)” at beginning, both methods do not help to reduce the exec time of cudaFree() :(.

pssuperZoro November 7, 2010, 2:22am 4

I inserted “cudaThreadSynchronize()” because i saw some topics also discussed this problem, and also insert a dummy “cudaFree(0)” at beginning, both methods do not help to reduce the exec time of cudaFree() :(.

wlangdon November 13, 2010, 4:19pm 5

Seen something similar.
Can you recode so cudaMalloc() is called (once per device)
only when your program starts?
Ie never call cudaFree() but instead reuse the buffer
each time you use the kernel on that GPU device?

Bill

wlangdon November 13, 2010, 4:19pm 6

Seen something similar.
Can you recode so cudaMalloc() is called (once per device)
only when your program starts?
Ie never call cudaFree() but instead reuse the buffer
each time you use the kernel on that GPU device?

Bill

Topic		Replies	Views	Activity
Interleaving cudaMalloc and kernels on multiple cpu threads - performance? CUDA Programming and Performance	6	1418	March 5, 2018
Odd performance problem/question CUDA Programming and Performance	3	830	June 3, 2009
cudaFree extremely slow CUDA Programming and Performance	15	2111	February 6, 2020
cudaFreeHost consistently 20x slower than free/cudaFree (full runnable example code available) CUDA Programming and Performance	5	919	July 26, 2022
Questions about cudaMalloc Questions about runtime for cudaMalloc and cudaMemcpy CUDA Programming and Performance	1	3335	June 23, 2009
Is cuda API serial inner the drive level CUDA Programming and Performance	4	452	March 1, 2019
cuda is really slow - even when doing nothing CUDA Programming and Performance	10	2363	September 3, 2010
cudaMalloc, cudaFree speed CUDA Programming and Performance	2	3574	April 4, 2013
Possibly Studpid question bout cudaMemcpy CudaMemcpy getting slow by time CUDA Programming and Performance	4	2001	February 26, 2010
Whole system freezes when using cudaMallocManaged CUDA Programming and Performance	18	2458	February 11, 2019