Hi everyone,
I’m working on a project using Holoscan (in C++), and to integrate MatX with it, I’m using the CUDA stream passed between operators (since v2.9) like this:
auto cuda_stream = op_input.receive_cuda_stream("input", true, false);
In one of my operators — the one right after an InferenceOp
— I notice that cudaStreamSynchronize(cuda_stream);
takes around 30ms to complete.
My current hypothesis is that cudaStreamSynchronize
is waiting for the previous stream’s operations to finish, but the InferenceOp
itself only takes about 20ms to compute, so I don’t fully understand where the extra time would be coming from.
Does anyone know what could cause this extra delay in my cudaStreamSynchronize
?
Thanks!
Hi Valentin,
For the CUDA stream handling feature, each operator by default has its own internal CUDA stream. Use of receive_cuda_stream
finds any streams on a given input port and synchronizes them to the operator’s dedicated internal stream before returning that internal stream to the user. It also will automatically emit the operator’s internal CUDA stream on all output ports of the operator. This ensures any tensors that had pending upstream work will be ready for use within the operator after receive_cuda_stream
has returned and notifies downstream operators they may need to synchronize on that provided stream. This would explain why you are seeing synchronization occur, but I do not know what would be the source of any extra delay (30 ms vs. expected 20 ms). How were you measuring the 20 ms time? If you profile the app with NSight Systems, do you see unexpected dead time after any InferenceOp kernels have completed?
The actual synchronization over the streams found on the input when calling receive_cuda_stream
is using cudaEventRecord
and cudaStreamWaitEvent
as in the code here.
There is a separate receive_cuda_streams
(with an s) that returns a vector<cudaStream_t>
of the streams found on that input port. This version, by contrast, does NOT do any synchronization and the user must handle it manually as needed. This latter version also does not automatically publish any stream ID on output ports of the operators. The user will need to call set_cuda_stream
as needed to send any stream ID to downstream operators to indicate that they may need to synchronize on it.