cudaFree is slow

My cuda program works with openMP library, there are 6 CPU threads, each one controls one Tesla C870. The kernel program is wrapped, so the kernel program on different Tesla use a chunk of non-overlap host memory.

The behavior of my cuda program is like:

cudaMalloc(150KB data)
cudaMemcpy(HostToDevice)
“a kernel program”
cudaThreadSynchronize();
cudaMemcpy(…cudaMemcpyDeviceToHost);
cudaFree(“device data pointer, size is 150KB”);

The interesting thing is when I timing it, i found that all cudaFree() run pretty slow, like 20 milliseconds, except one of them runs fast.

Any suggestion about this problem?

Device 0 (miliseconds)
mem alloc: 0.304928
memcpy h2d: 0.136832
Computing time: 1.08838
memcpy d2h: 0.0504
memfree: 0.217472

Device 2
mem alloc: 0.858656
memcpy h2d: 0.137088
Computing time: 1.09094
memdcpy d2h: 0.050464
memfree: 21.4218

Device 3
mem alloc: 2.95382
memcpy h2d: 0.145376
Computing time: 1.08544
memcpy d2h: 0.050048
memfree: 20.1498

Device 4
mem alloc: 2.82544
memcpy h2d: 0.138272
Computing time: 1.10192
memcpy d2h: 0.055904
memfree: 20.1657

My cuda program works with openMP library, there are 6 CPU threads, each one controls one Tesla C870. The kernel program is wrapped, so the kernel program on different Tesla use a chunk of non-overlap host memory.

The behavior of my cuda program is like:

cudaMalloc(150KB data)
cudaMemcpy(HostToDevice)
“a kernel program”
cudaThreadSynchronize();
cudaMemcpy(…cudaMemcpyDeviceToHost);
cudaFree(“device data pointer, size is 150KB”);

The interesting thing is when I timing it, i found that all cudaFree() run pretty slow, like 20 milliseconds, except one of them runs fast.

Any suggestion about this problem?

Device 0 (miliseconds)
mem alloc: 0.304928
memcpy h2d: 0.136832
Computing time: 1.08838
memcpy d2h: 0.0504
memfree: 0.217472

Device 2
mem alloc: 0.858656
memcpy h2d: 0.137088
Computing time: 1.09094
memdcpy d2h: 0.050464
memfree: 21.4218

Device 3
mem alloc: 2.95382
memcpy h2d: 0.145376
Computing time: 1.08544
memcpy d2h: 0.050048
memfree: 20.1498

Device 4
mem alloc: 2.82544
memcpy h2d: 0.138272
Computing time: 1.10192
memcpy d2h: 0.055904
memfree: 20.1657

I inserted “cudaThreadSynchronize()” because i saw some topics also discussed this problem, and also insert a dummy “cudaFree(0)” at beginning, both methods do not help to reduce the exec time of cudaFree() :(.

I inserted “cudaThreadSynchronize()” because i saw some topics also discussed this problem, and also insert a dummy “cudaFree(0)” at beginning, both methods do not help to reduce the exec time of cudaFree() :(.

Seen something similar.
Can you recode so cudaMalloc() is called (once per device)
only when your program starts?
Ie never call cudaFree() but instead reuse the buffer
each time you use the kernel on that GPU device?

Bill

Seen something similar.
Can you recode so cudaMalloc() is called (once per device)
only when your program starts?
Ie never call cudaFree() but instead reuse the buffer
each time you use the kernel on that GPU device?

Bill