VPI 0.4 features the possibility to wrap a CUDA stream into a VPI stream via
vpiStreamCreateCudaStreamWrapper(). The documentation states:
CUDA kernels can only be submitted directly to cudaStream_t if it’s guaranteed that all tasks submitted to VPIStream are finished.
Which I read as: I cannot asynchronously enque some VPI algo on a VPI wrapped CUDA stream, and then immediately launch a CUDA kernel on the underlying CUDA stream. Is that the case?
Guaranteeing that all tasks in the VPI stream are finished could be done synchronously via
vpiStreamSync(), however that also blocks the host thread. A better solution would be events, but the interop between VPI and CUDA for events is not implemented yed:
vpiEventCreateCudaEventWrapper() will always return