Hi Robert,
Thanks for your reply. This clarifies things a little and I believe it solves my problem.
I checked cudaHostAlloc with the cudaHostAllocMapped flag. Next I obtain a device pointer using cudaHostGetDevicePointer and both pointer appear to be the same so this is pretty cool.
NV_CUDA_CHECK(cudaHostAlloc(&trt_output_cpu_ptr, output_size, cudaHostAllocMapped));
NV_CUDA_CHECK(cudaHostGetDevicePointer(&trt_output_gpu, trt_output_cpu_ptr, 0));
std::cout << "[" << std::this_thread::get_id() << "] Allocated output cpu_ptr " << trt_output_cpu_ptr << " and obtained gpu_ptr" << trt_output_gpu << std::endl;
My application ensures that the GPU input memory will only be read by the GPU after the CPU has completed writing. Conversely I use cudaStreamSynchronize to ensure that the GPU has completed its work before I start to read the results on the CPU. Do I still need other calls to e.g. cudaDeviceSynchronize to fix some caching issues?