CUDA stream & VPI stream synchronization

Hi,

We have a cuda stream created with non-blocking flags, then we create a VPI stream (with greedy flag) that wraps it:

cudaStreamCreateWithFlags(&rawCudaStream, cudaStreamNonBlocking);
vpiStreamCreateWrapperCUDA(rawCudaStream, VPI_BACKEND_CUDA | VPI_STREAM_GREEDY, &rawVpiStream);

After we submit some VPI tasks to the CUDA backend:

vpiSubmitRemap(rawVpiStream, VPI_BACKEND_CUDA, hwLdcHandle.get(), srcPitchLinearImage->vpiImage, dstPitchLinearImage->vpiImage, VPI_INTERP_LINEAR, VPI_BORDER_ZERO, 0)

and we wait only on the CUDA stream, the output image is broken:

cudaStreamSynchronize(rawCudaStream);

The image is intact if we manually call

vpiStreamSync(rawVpiStream);

Is this the intended behavior? Our VPI stream wraps a CUDA stream, and we submit the task to the CUDA backend. The task is immediately submitted (because of the greedy flag), but even if we manually submit the task and then wait on the CUDA stream:

vpiStreamFlush(rawVpiStream);
cudaStreamSynchronize(rawCudaStream);

The image is still broken, unless we do vpiStreamSync(rawVpiStream); manually. Is this the intended behavior or it’s a bug?

If it is intended, we don’t want to introduce any synchronization that involves CPU. Is there a way to wait for a VPI stream / VPI event on a specific CUDA stream without CPU sync?

Which Jetson board are you working with?

Hi @Fiona.Chen, so far we haven’t tested it on a Jetson board, only on x86_64/focal with Jetpack 6.0.

Before we go deep into the investigation it would be nice to know the intended behavior: if we wrap a non-blocking cuda stream in VPI stream and start every VPI operations on this stream with the cuda backend, is it enough to sync only the cuda stream to ensure all the VPI processing has finished?

Where did you install the JetPack 6.0 packages?

@Fiona.Chen I created a Docker image based on Ubuntu Focal x86_64 and run the container on my local PC. Everything works as expected, I get perfectly fine images if I call vpiStreamSync(rawVpiStream); before saving them.

Please refer to VPI - Vision Programming Interface: Stream. The VPI-driven processing is inserted into an existing CUDA pipeline, it does not mean the VPI-driven processing is changed to a CUDA processing. VPI stream has its own life cycle.

Please notice the sentence “CUDA kernels can only be submitted directly to cudaStream_t if it’s guaranteed that all tasks submitted to VPIStream are finished.” in the document.

@Fiona.Chen thank you very much for your reply. I see.

Is there any way to synchronize the VPI stream and the cudaStream without involving the CPU?

Like in cuda we can record an event on one stream and wait for that event on another stream without blocking the CPU thread.

As I see even in VPI we can record events, but I don’t see any API making waiting for it on a cuda stream possible :(

Please advise what would be the best practice in our case?

Thank you in advance,
Adam

Hi @Fiona.Chen, I hope you are doing great. Did you have the chance to look into this issue and have some recommendation?

Thanks, Adam

Hi,

Thanks for your patience.

We have checked this with our internal team but unfortunately, you will need to do the vpiStreamSync before submitting a CUDA task to the same stream.

CUDA stream handling is complicated inside VPI, some bookkeeping tasks are done that preclude users from manipulating the original cudaStream directly, without breaking some assumptions VPI makes internally.

But we have an internal plan to improve this in our future release.

Thanks.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.