vpiStreamCreateWrapperCUDA with cuda kernels

I want to write some cuda kernels to work with the output of a VPI pipeline.

I was thinking of creating a cudaStream_t cuda_stream. Then wrapping it with a vpi stream using vpiStreamCreateWrapperCUDA. Then doing something like this:

…vpi pipeline…
vpiSubmitStereoDisparityEstimator(vpi_stream, CUDA backend, …)
myCudaKernel<<<cuda_stream, …>>>(…)

However, in the documentation, it says “CUDA kernels can only be submitted directly to cudaStream_t if it’s guaranteed that all tasks submitted to VPIStream are finished.”. Does that mean I have to call vpiStreamSync before executing my cuda kernel?

If so, what is the point of wrapping the cuda stream in a vpi stream? If I have to call stream_sync anyways, I might as well create a new VPI stream.

Or is “guaranteed that all tasks submitted to VPIStream are finished” satisfied because the previous VPI operation I’m doing uses the CUDA backend, so it’s implicitly guaranteed to be finished before the cuda kernel starts executing?

Also a follow-up question. If I call vpiStreamSync on a wrapped cuda stream, does that also call cudaStreamSynchronize under the hood? Or if I call cudaStreamSynchronize, is that the same as calling vpiStreamSync (if the only backend is cuda)?

1 Like