Graph Capture of cublasDdot in Device Pointer Mode

I am trying to do a Vector-Vector dot product within a graph node in the latest CUDA 12.4. I figured I would start with cuBLAS since its documentation seemed to imply that this should be possible using cublasDdot with CUBLAS_POINTER_MODE_DEVICE but its not working for me.

From the docs:

cuBLAS routines can be captured in CUDA Graph stream capture without restrictions in most situations.

The exception are routines that output results into host buffers (e.g. cublasdot while pointer mode CUBLAS_POINTER_MODE_HOST is configured), as it enforces synchronization.

Here is my code. The first cublasDdot succeeds, and I put it in just to verify that the cublas call works. The second cublasDdot call fails with error “operation not permitted when stream is capturing”. Note all “d_ptr” names are device pointers.

I don’t really care what library I am using, I would like very much to avoid writing my own Vector-Vector dot product kernel just to have it work within a graph. Do you have any recommendations? Perhaps I am missing something, or maybe there is another library function that will get the job done without overcomplicating things.

if (!cublas_handle_) {
	THROW_ON_CUBLAS_ERROR(cublasCreate(&cublas_handle_));
	THROW_ON_CUBLAS_ERROR(cublasSetPointerMode(cublas_handle_, CUBLAS_POINTER_MODE_DEVICE));
}

cublasStatus_t err = cublasDdot(
	cublas_handle_, in1.len,
	(const double*)in1.d_ptr, 1,
	(const double*)in2.d_ptr, 1,
	(double*)out.d_ptr
);
THROW_ON_CUBLAS_ERROR(err);

cudaGraph_t subgraph;

cudaStream_t capture_stream;
THROW_ON_CUDA_ERROR(cudaStreamCreate(&capture_stream));
THROW_ON_CUDA_ERROR(cudaStreamBeginCapture(capture_stream, cudaStreamCaptureModeGlobal));

err = cublasDdot(
	cublas_handle_, in1.len,
	(const double*)in1.d_ptr, 1,
	(const double*)in2.d_ptr, 1,
	(double*)out.d_ptr
);
THROW_ON_CUBLAS_ERROR(err);

THROW_ON_CUDA_ERROR(cudaStreamEndCapture(capture_stream, &subgraph));

// ...

When I run your code under compute-sanitizer, it gives me a useful clue as to the central problem, I believe:

# compute-sanitizer ./t163
========= COMPUTE-SANITIZER
========= Program hit cudaErrorStreamCaptureImplicit (error 906) due to "operation would make the legacy stream depend on a capturing blocking stream" on CUDA API call to cudaStreamGetCaptureInfo_v2.

Unless you make another selection, cublas calls get issued into the default stream (which would be called “legacy default” if you have not changed the behavior of the default stream – which you haven’t).

Graph capture doesn’t like operations issued into the legacy default stream:

Stream capture can be used on any CUDA stream except cudaStreamLegacy (the “NULL stream”).

So when using CUBLAS, be sure to issue cublas calls into a created stream. The stream-capture stream does not fill this role for you.

When I make changes to fix that, the code seems to run without error for me:

# cat t163.cu
#include <cublas_v2.h>
#include <iostream>



int main(){
  cublasHandle_t cublas_handle_  = NULL;
  double *in1_d_ptr, *in2_d_ptr, *out_d_ptr;
  int in1_len = 256;
  cudaMalloc(&in1_d_ptr, in1_len*sizeof(double));
  cudaMalloc(&in2_d_ptr, in1_len*sizeof(double));
  cudaMalloc(&out_d_ptr, sizeof(double));
  cublasStatus_t err = cublasCreate(&cublas_handle_);
  cudaStream_t cublas_stream;
  cudaStreamCreate(&cublas_stream);
  err = cublasSetStream(cublas_handle_, cublas_stream);
  if (err != CUBLAS_STATUS_SUCCESS) std::cout << "Error1: " << (int)err << std::endl;
  err = cublasSetPointerMode(cublas_handle_, CUBLAS_POINTER_MODE_DEVICE);
  if (err != CUBLAS_STATUS_SUCCESS) std::cout << "Error2: " << (int)err << std::endl;
  err = cublasDdot(
        cublas_handle_, in1_len,
        (const double*)in1_d_ptr, 1,
        (const double*)in2_d_ptr, 1,
        (double*)out_d_ptr
);
  if (err != CUBLAS_STATUS_SUCCESS) std::cout << "Error3: " << (int)err << std::endl;
  cudaGraph_t subgraph;
  cudaStream_t capture_stream;
  cudaStreamCreate(&capture_stream);
  cudaStreamBeginCapture(capture_stream, cudaStreamCaptureModeGlobal);
  err = cublasDdot(
        cublas_handle_, in1_len,
        (const double*)in1_d_ptr, 1,
        (const double*)in2_d_ptr, 1,
        (double*)out_d_ptr
  );
  if (err != CUBLAS_STATUS_SUCCESS) std::cout << "Error4: " << (int)err << std::endl;
  cudaStreamEndCapture(capture_stream, &subgraph);
}
# nvcc -o t163 t163.cu -lcublas
# compute-sanitizer ./t163
========= COMPUTE-SANITIZER
========= ERROR SUMMARY: 0 errors
#

Thanks! That worked, the key is that I was missing cublasSetStream(cublas_handle_, cublas_stream);, since I am not used to using the cuda libraries with streams I didn’t think to add that function.

A word of warning though to anyone who tries doing this, my next task was to register this subgraph as a child node of my main graph and I encountered the issue described here which is that cuBLAS sometimes does allocations and allocations are not allowed in child graphs: cuBLAS 12 graphs cannot be used as child graphs because of stream ordered memory allocation

The painful part was that with cublasDdot the allocation issue was not being caught in Windows but was in Linux. So maybe the windows version of this function happens to not do allocations, while the Linux one does. So be wary of that - the fix as mentioned in the post I linked to is to set the cublas workspace so it doesn’t do allocations in the child graph.

This is covered in the CUBLAS documentation also.

1 Like