How do you manage your memory buffers? Do you use the traditional approach to allocate host memory with malloc()/new and device memory with cudaMalloc(), using cudaMemcpy() to transfer between both?
For the Jetson series, it might be useful to look into zero copy memory via cudaHostAlloc() or alternatively Unified memory via cudaMallocManaged() (assuming the latter is supported on your Jetson Nano platform). This should eliminate any memory copy overhead on your platform.
Here’s a related thread that I found. It has some links to useful resources.
https://forums.developer.nvidia.com/t/jetson-nano-device-local-memory-specifications/73524/6