I’m seeing some strange stalls in my application at the point where I’m synchronizing my stream in order to flush asynchronous memory frees (prompted to by a failed asynchronous allocation):
I’m not sure why
cuStreamSynchronize takes so much time here, as the GPU has become idle much sooner. There’s no other streams active, the screenshot shows all that matters. Most of the stall (i.e. after the last kernel on this stream has finished, but before
cuStreamSynchronize has returned) is spent doing some
ioctl. Is this the async memory manager compacting memory, or anything like that? But then I would have expected the samples to point to
libcuda, and not to the kernel.