Questions about efficient memory management for TensorRT on TX2

Hi Robert,

Thanks for your reply. This clarifies things a little and I believe it solves my problem.

I checked cudaHostAlloc with the cudaHostAllocMapped flag. Next I obtain a device pointer using cudaHostGetDevicePointer and both pointer appear to be the same so this is pretty cool.

NV_CUDA_CHECK(cudaHostAlloc(&trt_output_cpu_ptr, output_size, cudaHostAllocMapped));
NV_CUDA_CHECK(cudaHostGetDevicePointer(&trt_output_gpu, trt_output_cpu_ptr, 0));
std::cout << "[" << std::this_thread::get_id() << "] Allocated output cpu_ptr " << trt_output_cpu_ptr << " and obtained gpu_ptr" << trt_output_gpu << std::endl;

My application ensures that the GPU input memory will only be read by the GPU after the CPU has completed writing. Conversely I use cudaStreamSynchronize to ensure that the GPU has completed its work before I start to read the results on the CPU. Do I still need other calls to e.g. cudaDeviceSynchronize to fix some caching issues?