Graph Capture of cublasDdot in Device Pointer Mode

dhenshall · March 26, 2024, 1:45pm

I am trying to do a Vector-Vector dot product within a graph node in the latest CUDA 12.4. I figured I would start with cuBLAS since its documentation seemed to imply that this should be possible using cublasDdot with CUBLAS_POINTER_MODE_DEVICE but its not working for me.

From the docs:

cuBLAS routines can be captured in CUDA Graph stream capture without restrictions in most situations.

The exception are routines that output results into host buffers (e.g. cublasdot while pointer mode CUBLAS_POINTER_MODE_HOST is configured), as it enforces synchronization.

Here is my code. The first cublasDdot succeeds, and I put it in just to verify that the cublas call works. The second cublasDdot call fails with error “operation not permitted when stream is capturing”. Note all “d_ptr” names are device pointers.

I don’t really care what library I am using, I would like very much to avoid writing my own Vector-Vector dot product kernel just to have it work within a graph. Do you have any recommendations? Perhaps I am missing something, or maybe there is another library function that will get the job done without overcomplicating things.

if (!cublas_handle_) {
	THROW_ON_CUBLAS_ERROR(cublasCreate(&cublas_handle_));
	THROW_ON_CUBLAS_ERROR(cublasSetPointerMode(cublas_handle_, CUBLAS_POINTER_MODE_DEVICE));
}

cublasStatus_t err = cublasDdot(
	cublas_handle_, in1.len,
	(const double*)in1.d_ptr, 1,
	(const double*)in2.d_ptr, 1,
	(double*)out.d_ptr
);
THROW_ON_CUBLAS_ERROR(err);

cudaGraph_t subgraph;

cudaStream_t capture_stream;
THROW_ON_CUDA_ERROR(cudaStreamCreate(&capture_stream));
THROW_ON_CUDA_ERROR(cudaStreamBeginCapture(capture_stream, cudaStreamCaptureModeGlobal));

err = cublasDdot(
	cublas_handle_, in1.len,
	(const double*)in1.d_ptr, 1,
	(const double*)in2.d_ptr, 1,
	(double*)out.d_ptr
);
THROW_ON_CUBLAS_ERROR(err);

THROW_ON_CUDA_ERROR(cudaStreamEndCapture(capture_stream, &subgraph));

// ...

Robert_Crovella · March 26, 2024, 3:22pm

When I run your code under compute-sanitizer, it gives me a useful clue as to the central problem, I believe:

# compute-sanitizer ./t163
========= COMPUTE-SANITIZER
========= Program hit cudaErrorStreamCaptureImplicit (error 906) due to "operation would make the legacy stream depend on a capturing blocking stream" on CUDA API call to cudaStreamGetCaptureInfo_v2.

Unless you make another selection, cublas calls get issued into the default stream (which would be called “legacy default” if you have not changed the behavior of the default stream – which you haven’t).

Graph capture doesn’t like operations issued into the legacy default stream:

Stream capture can be used on any CUDA stream except cudaStreamLegacy (the “NULL stream”).

So when using CUBLAS, be sure to issue cublas calls into a created stream. The stream-capture stream does not fill this role for you.

When I make changes to fix that, the code seems to run without error for me:

# cat t163.cu
#include <cublas_v2.h>
#include <iostream>



int main(){
  cublasHandle_t cublas_handle_  = NULL;
  double *in1_d_ptr, *in2_d_ptr, *out_d_ptr;
  int in1_len = 256;
  cudaMalloc(&in1_d_ptr, in1_len*sizeof(double));
  cudaMalloc(&in2_d_ptr, in1_len*sizeof(double));
  cudaMalloc(&out_d_ptr, sizeof(double));
  cublasStatus_t err = cublasCreate(&cublas_handle_);
  cudaStream_t cublas_stream;
  cudaStreamCreate(&cublas_stream);
  err = cublasSetStream(cublas_handle_, cublas_stream);
  if (err != CUBLAS_STATUS_SUCCESS) std::cout << "Error1: " << (int)err << std::endl;
  err = cublasSetPointerMode(cublas_handle_, CUBLAS_POINTER_MODE_DEVICE);
  if (err != CUBLAS_STATUS_SUCCESS) std::cout << "Error2: " << (int)err << std::endl;
  err = cublasDdot(
        cublas_handle_, in1_len,
        (const double*)in1_d_ptr, 1,
        (const double*)in2_d_ptr, 1,
        (double*)out_d_ptr
);
  if (err != CUBLAS_STATUS_SUCCESS) std::cout << "Error3: " << (int)err << std::endl;
  cudaGraph_t subgraph;
  cudaStream_t capture_stream;
  cudaStreamCreate(&capture_stream);
  cudaStreamBeginCapture(capture_stream, cudaStreamCaptureModeGlobal);
  err = cublasDdot(
        cublas_handle_, in1_len,
        (const double*)in1_d_ptr, 1,
        (const double*)in2_d_ptr, 1,
        (double*)out_d_ptr
  );
  if (err != CUBLAS_STATUS_SUCCESS) std::cout << "Error4: " << (int)err << std::endl;
  cudaStreamEndCapture(capture_stream, &subgraph);
}
# nvcc -o t163 t163.cu -lcublas
# compute-sanitizer ./t163
========= COMPUTE-SANITIZER
========= ERROR SUMMARY: 0 errors
#

dhenshall · March 26, 2024, 6:16pm

Thanks! That worked, the key is that I was missing cublasSetStream(cublas_handle_, cublas_stream);, since I am not used to using the cuda libraries with streams I didn’t think to add that function.

A word of warning though to anyone who tries doing this, my next task was to register this subgraph as a child node of my main graph and I encountered the issue described here which is that cuBLAS sometimes does allocations and allocations are not allowed in child graphs: cuBLAS 12 graphs cannot be used as child graphs because of stream ordered memory allocation

The painful part was that with cublasDdot the allocation issue was not being caught in Windows but was in Linux. So maybe the windows version of this function happens to not do allocations, while the Linux one does. So be wary of that - the fix as mentioned in the post I linked to is to set the cublas workspace so it doesn’t do allocations in the child graph.

Robert_Crovella · March 26, 2024, 8:17pm

This is covered in the CUBLAS documentation also.

Topic		Replies	Views
cuBLAS 12 graphs cannot be used as child graphs because of stream ordered memory allocation GPU-Accelerated Libraries cublas	4	597	December 20, 2023
Stream capture of cublas gemm CUDA Programming and Performance	4	584	September 12, 2024
Can't capture stream when using cublasSgemmStridedBatched with batchCount=1 GPU-Accelerated Libraries	2	555	March 19, 2019
[CUDA Graph] cuBLAS routine produces incorrect result after calling cudaStreamBeginCapture CUDA Programming and Performance	2	653	March 2, 2022
[BUG][DEVICE LAUNCHED GRAPHS] CudaErrorInvalidValue with cuBLAS combined with the cudaGraphInstantiateFlagsDeviceLaunch starting with CUDA 12.1 CUDA Programming and Performance nvbugs	2	737	January 5, 2024
cublasSetVectorAsync Legacy PGI Compilers	2	3080	April 5, 2012
Issue when calling cublasDdot from within kernel GPU-Accelerated Libraries	7	918	March 21, 2018
Calling cuBLAS routine from CUDA Graph Node CUDA Programming and Performance	6	1345	February 11, 2022
cuBLAS call from kernel in CUDA 10.0 GPU-Accelerated Libraries	9	4777	April 7, 2021
Does cublasSgemm executes on default stream ? CUDA Programming and Performance	1	551	February 21, 2019

Graph Capture of cublasDdot in Device Pointer Mode

Related topics