I try to copy about 1.3 GB of data from host memory to device memory. I have a non-blocking stream to use with cuMemcpyHtoDAsync.
I allocated the device memory with cuMemAlloc and the host memory with malloc. I have also tried for some test cuMemAllocHost and cuMemHostAlloc.
I took the time it takes to execute the cuMemcpyHtoDAsync call and was expecting that it executes very fast since I was hoping it returns without waiting for the memcpy to finish.
The results are these (all copy 1.375 GB):
- 1050Ti: 384 ms
- 3060: 363 ms
- 1070: 1972 ms
So my guess is that there is something wrong. I would expect the 1050 to be the slowest, but it is by far the 1070. Is there something I’m doing wrong here?
I already read: CUDA Runtime API :: CUDA Toolkit Documentation (nvidia.com)
Unfortunately it explains when async memcopy will NOT execute synchronous. It would be very helpful if it list the conditions required to make sure that a cuMemcpyHtoDAsync is executed async.
In order for async behavior, the host memory must be pinned. This statement can be found in the programming guide.
Thanks a lot. Any idea why the 1070 performs that bad compared to thew 1050Ti?
Maybe a question of the PCI slot I use for the 1070. I will check this. I have all 3 GPUs installed 3060, 1050 and 1070,
Yes, PCI slot config is one possibility. For example if the 1070 is plugged into a x4 slot and the others are plugged into a x16 slot, then I would expect the activity duration to be about 4x longer.
Not having an actual test case here leaves a bunch of questions. When measuring such an activity, we can look at the activity duration as well as the API duration. For a properly issued async op (such as a kernel call or async memcopy) I would always expect the API duration (the time from the beginning of the call to the time when the call returns, this is looking at behavior from the host thread point of view) to be short, on the order of a few tens of microseconds. For an improperly issued async op that becomes a blocking call as a result (e.g. an async memcopy where the host memory is not pinned) then the API duration could be all over the map (due to other issued activity to the GPU) from as little as the actual activity duration (see below) to as long as nearly infinity.
The activity duration is a separate concept, however. The activity duration is asking the question, “once the activity started, how long did it last?”. We can have proper expectations about such things given a few pieces of data, without having to know the entire history of work issued to the GPU.
The profiler can report both types of information.
It’s not clear which you are reporting or measuring. But I’ll say it again, with few exceptions (e.g. a full work queue) I would expect a properly issued async op to have a short API duration, on the order of 50 microseconds or less.
Clearly, whatever you are reporting is not that case. Either you are measuring something else (the activity duration, not the API duration), or your work issuance is not proper to achieve correct async behavior.
Having a test case is usually preferable to all this discourse, IMO.