Transfer rate from GPU to CPU with pytorch on Xavier NX

I’m trying to run a segmentation network, the result of which is a 8x3x224x224 tensor.
This takes ~500ms, which seems excessive (this is the time it takes the .cpu() function to run, as measured by cProfile).

Is there a way to reduce this?
Thanks