Hi,
We have a cuda stream created with non-blocking flags, then we create a VPI stream (with greedy flag) that wraps it:
cudaStreamCreateWithFlags(&rawCudaStream, cudaStreamNonBlocking);
vpiStreamCreateWrapperCUDA(rawCudaStream, VPI_BACKEND_CUDA | VPI_STREAM_GREEDY, &rawVpiStream);
After we submit some VPI tasks to the CUDA backend:
vpiSubmitRemap(rawVpiStream, VPI_BACKEND_CUDA, hwLdcHandle.get(), srcPitchLinearImage->vpiImage, dstPitchLinearImage->vpiImage, VPI_INTERP_LINEAR, VPI_BORDER_ZERO, 0)
and we wait only on the CUDA stream, the output image is broken:
cudaStreamSynchronize(rawCudaStream);
The image is intact if we manually call
vpiStreamSync(rawVpiStream);
Is this the intended behavior? Our VPI stream wraps a CUDA stream, and we submit the task to the CUDA backend. The task is immediately submitted (because of the greedy flag), but even if we manually submit the task and then wait on the CUDA stream:
vpiStreamFlush(rawVpiStream);
cudaStreamSynchronize(rawCudaStream);
The image is still broken, unless we do vpiStreamSync(rawVpiStream);
manually. Is this the intended behavior or it’s a bug?
If it is intended, we don’t want to introduce any synchronization that involves CPU. Is there a way to wait for a VPI stream / VPI event on a specific CUDA stream without CPU sync?