I have a Jetson Nano Orin which I’m running a model with TensorRT.
I’ve seen that the most time consuming operation is when I transfer a torch Tensor into the cuda device using the function to(“cuda:0”).
By profiling with torch.profiler, I’ve seen that the 97% of the overhead came from the copy of the tensor. However the Jetson Orin has shared memory between the CPU and the GPU, so technically I could avoid to copy the tensor because the GPU has the access to the same memory.
It is possible to transfer the tensor’s device without copying it? If yes, how?
Now I better understood the memory management in cuda.
It seems CuPy support the unified memory allocation and also the conversion to pytorch tensors.
If there is no copy between the cupy-pytorch conversion, I think this solution could work but I didn’t test it yet.