Is there any other way to improve the performance CUDA program in TX1?
When I use cudaMemcpy or cudaMemcpyManaged to copy data from host to device (or device to host), it shows not very good.
When I use Zero-Copy, it can decrease copy time, but running rime in cuda increase, and the total is changeless.
In addition, how many cameras would be supported on TX1(We use v4l2 get buffer ) at the same time ? Is there any limitation ?
Any answer is welcome, thanks.