The trtexec shipped with TensorRt versions
7.2.3-1
8.3.0-1 (Orin)
8.4.1-1 (TensorRt docker 22.07-py3)
all do not seem to have a feature to disable the dedicated cuda streams for the Host-Device copy. We don’t want to rely on the Latency
feature that adds the measured H2d, compute and d2h copy time separately but rather an e2e time when everything is on a single stream.
Also, the flag --streams
is a little bit misleading because it tricked me to think that everything is on the same stream until I see the JSON file and realized H-D copy happened concurrently as GPU computation and pulled out nsys.