Hello,
VPI 0.4 features the possibility to wrap a CUDA stream into a VPI stream via vpiStreamCreateCudaStreamWrapper()
. The documentation states:
CUDA kernels can only be submitted directly to cudaStream_t if it’s guaranteed that all tasks submitted to VPIStream are finished.
Which I read as: I cannot asynchronously enque some VPI algo on a VPI wrapped CUDA stream, and then immediately launch a CUDA kernel on the underlying CUDA stream. Is that the case?
Guaranteeing that all tasks in the VPI stream are finished could be done synchronously via vpiStreamSync()
, however that also blocks the host thread. A better solution would be events, but the interop between VPI and CUDA for events is not implemented yed: vpiEventCreateCudaEventWrapper()
will always return VPI_ERROR_NOT_IMPLEMENTED
.