How to achieve best performances in feeding data to VisionWorks vx_image?

I’m working on a TX1 with a VisionWorks vx_graph and profiling shows a lot of time spent in data transfer between CPU and GPU; it sounds wasteful to me because on TX1 the memory is shared between CPU and GPU.

In order to reduce them I tried to exploit the CUDA UVA/ZeroCopy feature by allocating pinned memory and using vx_image with the flag NVX_MEMORY_TYPE_CUDA, expecting not to see memcpy between host and device anymore, but vx_graph continues to copy data internally.

I looked also at NVXIO FrameSource but it does not seem to exploit CUDA UVA feature.

What is the best way to feed data to VisionWorks avoiding (apparently) redundant copies?

Thank you


Thanks for your question.
Although memory is shared, caches are not coherent. So there’s always some penalty for switching between CPU and GPU.

For the sample using NVX_MEMORY_TYPE_CUDA, please refer to the opengl_interop example.