Clearing cache in xavier agx improves speed of copy from cpu to gpu

While performing inference on xavier agx with pytorch, I see that about 70% of loop time is spent in copying data from cpu to gpu.

I use torch.to().

Out of total loop time of 90ms, 60ms is spent in copy.

However, when I run jtop in a terminal and clear cache while inference is running, the loop time suddenly reduces to 30ms. Basically the memory copy time reduces by a lot.

Why is it like this?

Also, what is the fastest way to copy data from numpy arrays to torch cuda tensors back and forth for inference?

I am using tensorrt for inference itself.

Best Regards
Sambit

Hi,

Request you to share the model, script, profiler, and performance output if not shared already so that we can help you better.

Alternatively, you can try running your model with trtexec command.

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer to the below links for more details:
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-803/best-practices/index.html#measure-performance

https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-803/best-practices/index.html#model-accuracy

Thanks!

Hi,

May be my question was not clear, the problem is NOT with tensorrt.
The inference part runs as expected.

The problem is, data loading from numpy array on CPU to torch cuda tensor using torch.from_numpy().float().to(“cuda:0”).unsqueeze_(0).permute(0, 3, 1, 2)

This copy operation takes a lot of time. (Nearly 70% of all time spent in one cycle of pre-process + inference + post-process).

I used python timeit module to measure time spent by each of pre-processing, inference and post-processing.

Surprisingly, when I do a cache clear using the jtop tool (jetson-stats tool), the copy time reduces drastically and my full cycle time now reduces as a result of that.

I am trying to understand:

  1. Why is memory copy from numpy to cuda tensor so slow? I am copying a 1024x1024x5 input.
  2. Why cache clear boosts speed?

Unfortunately the code needs ROS to setup and some rosbag files to be replayed since it is integrated into a ROS node.
Therefore it’s not easy to share.

Best Regards
Sambit

Hi,

Sorry, this issue doesn’t look like TensorRT related. We recommend you please reach out to torch related forum to get better help.

Thank you.