VisionWorks nvxFindHomographyNode uses default CUDA stream for memcpy. How do I use non-default stream?

VisionWorks nvxFindHomographyNode and nvxuFindHomography both use the default CUDA stream for device-to-host memory copies of what appear to be the two sets of vx_array keypoints. Processing is then done on the CPU for these calls. Is this expected?

I’ve discovered this by running nvx_demo_video_stabilizer on the TX2 and profiling with Nvidia Tegra System Profiler 3.8 on an Ubuntu 16.04 desktop. After the sparseLK and pyrLkPostProcess CUDA kernels, the default CUDA stream memcpy occurs followed by CPU processing, before the harris3x3 CUDA kernel executes.

I have 2 questions regarding the above behavior:

  1. Is it expected that nvxFindHomographyNode and nvxuFindHomography run on the CPU?
  2. Most importantly: if so, can I somehow prevent use of the default CUDA stream and instead force use of the graph's CUDA stream or use of nvcc's --default-stream=per-thread option?

Hi,

1. The node you mentioned is implemented on GPU.
Tegra System Profiler is designed for CPU profiling.
It’s recommended to use NVVP to monitor GPU status:
https://developer.nvidia.com/nvidia-visual-profiler

2. You can modify the Makefile with CUDA_API_PER_THREAD_DEFAULT_STREAM option.
Check our document for details:
http://docs.nvidia.com/cuda/cuda-runtime-api/stream-sync-behavior.html

Thanks.

Hi AastaLLL,

Thank you for the response.

  1. I see the same thing in NVVP as I do in Tegra System Profiler. Tegra System Profiler can and does capture CUDA data, but NVVP does offer more analysis tools for GPU kernels, so thank you for that. The default stream device to host memcpys still occur after pyrLkPostProcess and before harris3x3 kernels, with a delay in between during which I believe homography computation is happening on the CPU. I see no kernels listed with a name related to homography computation. Computation on the CPU doesn't really bother me, though, default stream use does.
  2. The Makefile already has that option defined as shipped with VisionWorks.

Thanks,
Jon

Hi,

Could you share the profiling figure with us?
Thanks.

Is there any way to upload the image directly to this forum?

Or do I need to upload it elsewhere and provide a link to it?

Thanks,
Jon

You can click the ‘Add Attachment’ button on a posted comment to upload the file.
Thanks.

Ahh thank you, I didn’t see that before. I will attach the image to this post.

Thanks.

We are checking this issue and will update information to you later.

Great, thank you for looking into it!

Hi,

In VisionWorks library, the CUDA stream is controlled by an internal NodeStream class to handle the GPU tasks.
Could you share more information about the requirement for appling memcpy on a non-default stream?

Thanks.

And there is no way to access/control this internal NodeStream class, correct?

I have a multi-threaded application involving CUDA use on several threads. VisionWorks’ internal use of the default stream causes the threads to fight over/wait for GPU access unnecessarily.

I have since discovered the nvxcu low-level kernel interface of VisionWorks, in which the application developer has control over streams and memory. I may be transitioning to this interface for both this reason and CUDA unified memory purposes.

Hi,

As you said, the NodeStream class is not available for the users.
It is a good alternative to use low-level API to get more control of GPU hardware.

Welcome to update your future progress with us. : )
Thanks.