GPU stalls due to stream synchronization -- even when idle?

tbesard · October 11, 2021, 2:35pm

I’m seeing some strange stalls in my application at the point where I’m synchronizing my stream in order to flush asynchronous memory frees (prompted to by a failed asynchronous allocation):

I’m not sure why cuStreamSynchronize takes so much time here, as the GPU has become idle much sooner. There’s no other streams active, the screenshot shows all that matters. Most of the stall (i.e. after the last kernel on this stream has finished, but before cuStreamSynchronize has returned) is spent doing some ioctl. Is this the async memory manager compacting memory, or anything like that? But then I would have expected the samples to point to libcuda, and not to the kernel.

hhoffman · November 18, 2021, 4:48pm

What you have described is consistent with the stream synchronize releasing gpu memory back to the OS on behalf of the stream ordered allocator. You should be able to confirm that your stream ordered allocation pool’s reserved memory dropped during this synchronize.

tbesard · November 19, 2021, 6:56am

Thanks. Is there a way to reduce the cost of these? That is, I guess I don’t really require that memory is given back to the OS as my application is still very much running and going to be allocating again right away. Does that require me to set CU_MEMPOOL_ATTR_RELEASE_THRESHOLD, or what’s the recommendation here?

hhoffman · November 19, 2021, 8:15pm

Yes. If you set CU_MEMPOOL_ATTR_RELEASE_THRESHOLD above the memory used in the pool, the pool will not release its memory.

You can set CU_MEMPOOL_ATTR_RELEASE_THRESHOLD to
UINT64_MAX. This will avoid releasing any pages to the os during the synchronization call. If you do so, you may want to look at the cuMemPoolTrimTo api to release memory. The CUDA driver may still release the memory from the pool to avoid allocation failures of other cuda apis in the same process, so cuMemPoolTrimTo would only be needed if multiple processes are trying to use the vidmem resources.

Topic		Replies	Views
Implicit synchronization CUDA Programming and Performance	6	3658	April 30, 2015
Asyncronus call CUDA Programming and Performance	1	2280	September 24, 2009
No need to check cudaThreadSynchronize() in release mode? CUDA Programming and Performance	9	6365	April 21, 2009
about latency to free device memory CUDA Programming and Performance	3	5563	February 18, 2008
Program freezes machine after several runs , or cudaThreadSynchronize() and its effect. CUDA Programming and Performance	1	2638	December 2, 2009
cudaStreamSynchronize(a_stream) simpleStreams CUDA Programming and Performance	2	2403	December 2, 2010
cudaThreadSynchronize() stalls? CUDA Programming and Performance	2	9000	January 8, 2008
Compilation optimalisation CUDA Programming and Performance	3	4160	February 27, 2008
Synchronization methods? CUDA Programming and Performance	11	2177	November 7, 2010
cudaThreadSynchronize usage CUDA Programming and Performance	3	2941	October 21, 2008

GPU stalls due to stream synchronization -- even when idle?

Related topics