VisionWorks nvxFindHomographyNode and nvxuFindHomography both use the default CUDA stream for device-to-host memory copies of what appear to be the two sets of vx_array keypoints. Processing is then done on the CPU for these calls. Is this expected?
I’ve discovered this by running nvx_demo_video_stabilizer on the TX2 and profiling with Nvidia Tegra System Profiler 3.8 on an Ubuntu 16.04 desktop. After the sparseLK and pyrLkPostProcess CUDA kernels, the default CUDA stream memcpy occurs followed by CPU processing, before the harris3x3 CUDA kernel executes.
I have 2 questions regarding the above behavior:
- Is it expected that nvxFindHomographyNode and nvxuFindHomography run on the CPU?
- Most importantly: if so, can I somehow prevent use of the default CUDA stream and instead force use of the graph's CUDA stream or use of nvcc's --default-stream=per-thread option?