There’re many CUDA memcpy api calls in our project, with the profiling results, we found something interesting here.
Some async dtoh memcpy calls act like sync calls, whose duration is longer than the device memory copy time and return only after the actual memory copy end. (Profiling screenshot: https://ibb.co/tC46Hpt) But there’s no such case for htod calls.
Some memcpy operations are not associated with a driver api call. (Profiling screenshot: https://ibb.co/XbV9YjW)
How to explain?