While performing inference on xavier agx with pytorch, I see that about 70% of loop time is spent in copying data from cpu to gpu.
I use torch.to().
Out of total loop time of 90ms, 60ms is spent in copy.
However, when I run jtop in a terminal and clear cache while inference is running, the loop time suddenly reduces to 30ms. Basically the memory copy time reduces by a lot.
Why is it like this?
Also, what is the fastest way to copy data from numpy arrays to torch cuda tensors back and forth for inference?
May be my question was not clear, the problem is NOT with tensorrt.
The inference part runs as expected.
The problem is, data loading from numpy array on CPU to torch cuda tensor using torch.from_numpy().float().to(“cuda:0”).unsqueeze_(0).permute(0, 3, 1, 2)
This copy operation takes a lot of time. (Nearly 70% of all time spent in one cycle of pre-process + inference + post-process).
I used python timeit module to measure time spent by each of pre-processing, inference and post-processing.
Surprisingly, when I do a cache clear using the jtop tool (jetson-stats tool), the copy time reduces drastically and my full cycle time now reduces as a result of that.
I am trying to understand:
Why is memory copy from numpy to cuda tensor so slow? I am copying a 1024x1024x5 input.
Why cache clear boosts speed?
Unfortunately the code needs ROS to setup and some rosbag files to be replayed since it is integrated into a ROS node.
Therefore it’s not easy to share.