I’m seeing some strange stalls in my application at the point where I’m memcpyAsync.
During these 7ms, cuda is only executing the kernel and does not perform memory copies simultaneously.I’m not sure why cuda stream cannot overlap memory copy and execute kernel.