GPU stalls due to stream synchronization -- even when idle?

I’m seeing some strange stalls in my application at the point where I’m synchronizing my stream in order to flush asynchronous memory frees (prompted to by a failed asynchronous allocation):

I’m not sure why cuStreamSynchronize takes so much time here, as the GPU has become idle much sooner. There’s no other streams active, the screenshot shows all that matters. Most of the stall (i.e. after the last kernel on this stream has finished, but before cuStreamSynchronize has returned) is spent doing some ioctl. Is this the async memory manager compacting memory, or anything like that? But then I would have expected the samples to point to libcuda, and not to the kernel.