I am trying to do a Vector-Vector dot product within a graph node in the latest CUDA 12.4. I figured I would start with cuBLAS since its documentation seemed to imply that this should be possible using cublasDdot with CUBLAS_POINTER_MODE_DEVICE but its not working for me.
From the docs:
cuBLAS routines can be captured in CUDA Graph stream capture without restrictions in most situations.
The exception are routines that output results into host buffers (e.g. cublasdot while pointer mode
CUBLAS_POINTER_MODE_HOST
is configured), as it enforces synchronization.
Here is my code. The first cublasDdot succeeds, and I put it in just to verify that the cublas call works. The second cublasDdot call fails with error “operation not permitted when stream is capturing”. Note all “d_ptr” names are device pointers.
I don’t really care what library I am using, I would like very much to avoid writing my own Vector-Vector dot product kernel just to have it work within a graph. Do you have any recommendations? Perhaps I am missing something, or maybe there is another library function that will get the job done without overcomplicating things.
if (!cublas_handle_) {
THROW_ON_CUBLAS_ERROR(cublasCreate(&cublas_handle_));
THROW_ON_CUBLAS_ERROR(cublasSetPointerMode(cublas_handle_, CUBLAS_POINTER_MODE_DEVICE));
}
cublasStatus_t err = cublasDdot(
cublas_handle_, in1.len,
(const double*)in1.d_ptr, 1,
(const double*)in2.d_ptr, 1,
(double*)out.d_ptr
);
THROW_ON_CUBLAS_ERROR(err);
cudaGraph_t subgraph;
cudaStream_t capture_stream;
THROW_ON_CUDA_ERROR(cudaStreamCreate(&capture_stream));
THROW_ON_CUDA_ERROR(cudaStreamBeginCapture(capture_stream, cudaStreamCaptureModeGlobal));
err = cublasDdot(
cublas_handle_, in1.len,
(const double*)in1.d_ptr, 1,
(const double*)in2.d_ptr, 1,
(double*)out.d_ptr
);
THROW_ON_CUBLAS_ERROR(err);
THROW_ON_CUDA_ERROR(cudaStreamEndCapture(capture_stream, &subgraph));
// ...