About copy-time for cuda in TX1

Is there any other way to improve the performance CUDA program in TX1?
When I use cudaMemcpy or cudaMemcpyManaged to copy data from host to device (or device to host), it shows not very good.
When I use Zero-Copy, it can decrease copy time, but running rime in cuda increase, and the total is changeless.

In addition, how many cameras would be supported on TX1(We use v4l2 get buffer ) at the same time ? Is there any limitation ?

Any answer is welcome, thanks.


It’s recommended to use unified memory(cudaMallocManaged()).
You don’t need to apply memory copy for it since CUDA driver will handler the synchronize for you.


But How should I use the cuda stream if I don’t use the cudaHostAlloc to specify a pinned memory? Or can I create cuda stream with cudaMallocManaged function ?(cudaStreamCreate,cudaMemcpyAsync, etc )


You can create cuda stream with function like this:

cudaStream_t stream;

CUDA stream works well for both memory type. Here is our tutorial for your reference:


That make sense, thanks a lot !