Is there any other way to improve the performance CUDA program in TX1?
When I use cudaMemcpy or cudaMemcpyManaged to copy data from host to device (or device to host), it shows not very good.
When I use Zero-Copy, it can decrease copy time, but running rime in cuda increase, and the total is changeless.
In addition, how many cameras would be supported on TX1(We use v4l2 get buffer ) at the same time ? Is there any limitation ?
It’s recommended to use unified memory(cudaMallocManaged()).
You don’t need to apply memory copy for it since CUDA driver will handler the synchronize for you.
Thanks.
But How should I use the cuda stream if I don’t use the cudaHostAlloc to specify a pinned memory? Or can I create cuda stream with cudaMallocManaged function ?(cudaStreamCreate,cudaMemcpyAsync, etc )