I’m seeing some strange stalls in my application at the point where I’m synchronizing my stream in order to flush asynchronous memory frees (prompted to by a failed asynchronous allocation):
I’m not sure why
cuStreamSynchronize takes so much time here, as the GPU has become idle much sooner. There’s no other streams active, the screenshot shows all that matters. Most of the stall (i.e. after the last kernel on this stream has finished, but before
cuStreamSynchronize has returned) is spent doing some
ioctl. Is this the async memory manager compacting memory, or anything like that? But then I would have expected the samples to point to
libcuda, and not to the kernel.
What you have described is consistent with the stream synchronize releasing gpu memory back to the OS on behalf of the stream ordered allocator. You should be able to confirm that your stream ordered allocation pool’s reserved memory dropped during this synchronize.
Thanks. Is there a way to reduce the cost of these? That is, I guess I don’t really require that memory is given back to the OS as my application is still very much running and going to be allocating again right away. Does that require me to set CU_MEMPOOL_ATTR_RELEASE_THRESHOLD, or what’s the recommendation here?
Yes. If you set CU_MEMPOOL_ATTR_RELEASE_THRESHOLD above the memory used in the pool, the pool will not release its memory.
You can set CU_MEMPOOL_ATTR_RELEASE_THRESHOLD to
UINT64_MAX. This will avoid releasing any pages to the os during the synchronization call. If you do so, you may want to look at the cuMemPoolTrimTo api to release memory. The CUDA driver may still release the memory from the pool to avoid allocation failures of other cuda apis in the same process, so cuMemPoolTrimTo would only be needed if multiple processes are trying to use the vidmem resources.