Is there a time delay when using cuMemGetInfo or cudaFree?

I am trying to run a 1 million+ particle problem but it appears the memory on the devices is not being fully deallocated. I am using cuMemGetInfo for this.

I run a cuMemGetInfo in a kernel immediately after memory has been allocated on the device, showing 80% used.

When I leave that kernel I run another cuMemGetInfo before I execute the next kernel and either it is still 80% used or halved to 40% used.

Similar behaviour occurs for some other kernels in the iteration, but in other kernels the memory is fully deallocated.

Needless to say in the next iteration the device runs out of memory, as notified by cudaGetLastError

And yes I cudaFree all arrays allocated via cudaMalloc.

What is occuring? Do I need to use cudaMalloc and cudaFree with some other call?

I might have hit this yesterday. If I can figure out what was causing it exactly, I’ll let you know.

What card, driver, and OS were you using?

tesla c870, don’t know the driver, linux SUSE enterprise 10.1

In a sequence of 4 kernels the used memory after leaving that kernel is

2% (with an internal peak of 11%) ie all memory deallocated

74%(with an internal peak of 74%)

82%(with an internal peak of 82%)

82% (with an internal peak of 89%)