I am trying to run a 1 million+ particle problem but it appears the memory on the devices is not being fully deallocated. I am using cuMemGetInfo for this.
I run a cuMemGetInfo in a kernel immediately after memory has been allocated on the device, showing 80% used.
When I leave that kernel I run another cuMemGetInfo before I execute the next kernel and either it is still 80% used or halved to 40% used.
Similar behaviour occurs for some other kernels in the iteration, but in other kernels the memory is fully deallocated.
Needless to say in the next iteration the device runs out of memory, as notified by cudaGetLastError
And yes I cudaFree all arrays allocated via cudaMalloc.
What is occuring? Do I need to use cudaMalloc and cudaFree with some other call?