cuBLAS 12 graphs cannot be used as child graphs because of stream ordered memory allocation

jy55 · December 19, 2023, 6:00pm

cuBLAS12 (sometimes) inserts stream allocation nodes, which makes the graph incompatible for use as a child graph (according to the doc for cudaGraphAddMemAllocNode).

I run into this in CUDA12 where a call to cudaGraphAddChildGraphNode returns 801 (operation not supported). The same code works in CUDA11 (I’m in the process of upgrading).

I gather this allocation has to do with cuBLAS workspace management, but the docs are not entirely clear on the semantics or how to control it for graphs.

I was hoping Nvidia could clarify:

Precisely when cuBLAS might use stream ordered allocations, and how we can avoid it by preallocating workspaces?
Best practice for cuBLAS workspaces when building graphs?
Will memory allocation nodes be supported in child graphs in the future, or is this a fundamental limitation that will be here for the foreseeable future?

Cheers

jy55 · December 19, 2023, 7:35pm

Here is a full repro example, based loosely off the TRSM example in the documentation:

#include <cstdio>
#include <cstdlib>
#include <cublas_v2.h>
#include <cuda_runtime.h>
#include <vector>

#define CUDA_CHECK(err)                                                        \
    do {                                                                       \
        cudaError_t err_ = (err);                                              \
        if (err_ != cudaSuccess) {                                             \
            std::printf("CUDA error %d at %s:%d\n", err_, __FILE__, __LINE__); \
            std::abort();                                                      \
        }                                                                      \
    } while (0)
#define CUBLAS_CHECK(err)                                                        \
    do {                                                                         \
        cublasStatus_t err_ = (err);                                             \
        if (err_ != CUBLAS_STATUS_SUCCESS) {                                     \
            std::printf("cublas error %d at %s:%d\n", err_, __FILE__, __LINE__); \
            std::abort();                                                        \
        }                                                                        \
    } while (0)

int main(int argc, char *argv[]) {
    int m = 256;
    if (argc > 1) {
        m = atoi(argv[1]);
    }
    int n = 256;
    if (argc > 2) {
        n = atoi(argv[2]);
    }
    cublasHandle_t cublasH = NULL;
    const int lda = n;
    const int ldb = m;

    std::printf("m = %d n = %d lda = %d ldb = %d\n", m, n, lda, ldb);

    cublasSideMode_t side = CUBLAS_SIDE_RIGHT;
    cublasFillMode_t uplo = CUBLAS_FILL_MODE_LOWER;
    cublasOperation_t transa = CUBLAS_OP_N;
    cublasDiagType_t diag = CUBLAS_DIAG_NON_UNIT;

    const double alpha = 1.0;
    double *d_A = nullptr;
    double *d_B = nullptr;
    CUDA_CHECK(cudaMalloc(reinterpret_cast<void **>(&d_A), n * lda * sizeof(double)));
    CUDA_CHECK(cudaMalloc(reinterpret_cast<void **>(&d_B), n * ldb * sizeof(double)));

    CUBLAS_CHECK(cublasCreate(&cublasH));

    cudaGraph_t main_graph;
    CUDA_CHECK(cudaGraphCreate(&main_graph, 0));

    cudaStream_t capture_stream = NULL;
    CUDA_CHECK(cudaStreamCreate(&capture_stream));
    CUDA_CHECK(cudaStreamBeginCapture(capture_stream, cudaStreamCaptureModeThreadLocal));

    CUBLAS_CHECK(cublasSetStream(cublasH, capture_stream));
    CUBLAS_CHECK(cublasSetPointerMode(cublasH, CUBLAS_POINTER_MODE_HOST));
    CUBLAS_CHECK(cublasDtrsm(cublasH, side, uplo, transa, diag, m, n, &alpha, d_A, lda, d_B, ldb));

    cudaGraph_t cublas_graph;
    CUDA_CHECK(cudaStreamEndCapture(capture_stream, &cublas_graph));

    cudaGraphNode_t node;
    CUDA_CHECK(cudaGraphAddChildGraphNode(&node, main_graph, nullptr, 0, cublas_graph));

    CUDA_CHECK(cudaFree(d_A));
    CUDA_CHECK(cudaFree(d_B));

    CUBLAS_CHECK(cublasDestroy(cublasH));

    CUDA_CHECK(cudaDeviceReset());

    return EXIT_SUCCESS;
}

I build that with nvcc 12.3.52, and run it with size 256 256 it errors

$ ./cmd_repro_d 256 256
m = 256 n = 256 lda = 256 ldb = 256
CUDA error 801 at repro.cc:72

and with a smaller size, 32 32, it works

$ ./cmd_repro_d 32 32
m = 32 n = 32 lda = 32 ldb = 32

If I recompile with nvcc 11.8.89, the program works without error for the larger sizes, as well.

fbastien · December 20, 2023, 3:07pm

Hi, this can be fixed by setting cublas workspace:

This way, cublas won’t do any allocation. So you will be able to embed the graph.

yavorgenadiev · December 20, 2023, 4:19pm

Thank you. That advice also helped me. However shouldn’t this be treated as a bug in cuBLAS ? I don’t see anywhere in the docs where it is specified that cuBLAS + cuda graphs is not supported.

jy55 · December 20, 2023, 4:50pm

Thanks, I can confirm that providing a workspace does solve the problem in this repro.

Unfortunately, I can’t seem to control the workspace used by cuBLAS via cuSOLVER. I tried creating a handle and setting a workspace, but most likely cuSOLVER has an internal cuBLAS handle.

Here is a repro calling potrf which for size 1024 will emit memory operations and crash with 801:

#include <cstdio>
#include <cstdlib>
#include <cublas_v2.h>
#include <cuda_runtime.h>

#include "cusolverDn.h"
#include "helper_cuda.h" // from CUDA samples repo

int main(int argc, char *argv[]) {
    int n = 1024;
    if (argc > 1) {
        n = atoi(argv[1]);
    }
    std::printf("n = %d\n", n);

    cublasHandle_t blas_handle = nullptr;
    checkCudaErrors(cublasCreate(&blas_handle));

    cusolverDnHandle_t solver_handle = nullptr;
    checkCudaErrors(cusolverDnCreate(&solver_handle));

    int lda = n;
    int bufferSize = 0;
    int *info = nullptr;
    double *buffer = nullptr;
    cublasFillMode_t uplo = CUBLAS_FILL_MODE_UPPER;
    double *d_A = nullptr;

    checkCudaErrors(cudaMalloc(reinterpret_cast<void **>(&d_A), n * lda * sizeof(double)));

    checkCudaErrors(cusolverDnDpotrf_bufferSize(solver_handle, uplo, n, d_A, lda, &bufferSize));

    checkCudaErrors(cudaMalloc(&info, sizeof(int)));
    checkCudaErrors(cudaMalloc(&buffer, sizeof(double) * bufferSize));

    cudaGraph_t main_graph;
    checkCudaErrors(cudaGraphCreate(&main_graph, 0));

    cudaStream_t capture_stream = NULL;
    checkCudaErrors(cudaStreamCreate(&capture_stream));
    checkCudaErrors(cudaStreamBeginCapture(capture_stream, cudaStreamCaptureModeThreadLocal));

    checkCudaErrors(cusolverDnSetStream(solver_handle, capture_stream));
    checkCudaErrors(cusolverDnDpotrf(solver_handle, uplo, n, d_A, lda, buffer, bufferSize, info));

    cudaGraph_t child_graph;
    checkCudaErrors(cudaStreamEndCapture(capture_stream, &child_graph));

    cudaGraphNode_t node;
    checkCudaErrors(cudaGraphAddChildGraphNode(&node, main_graph, nullptr, 0, child_graph));

    checkCudaErrors(cudaDeviceReset());

    return EXIT_SUCCESS;
}

From my point of view, as a Graph API user, it would be easier to program against an API without global state or behind-the-scenes magic. I’d prefer to pass in workspaces to BLAS calls explicitly, I like how cuSOLVER _bufferSize works, for example. It is a common pattern in the good old LAPACK APIs. cuSOLVER could of course include any cuBLAS workspace needs in its _bufferSize request, and forward that to any underlying cuBLAS calls. If the end user wants to use stream ordered allocation, that is trivial to implement outside of the BLAS or LAPACK APIs.

The other question is point 3 in my original post: Are there fundamental reasons why child graphs are “special” in this way? Nested graphs would be a powerful feature if they were equivalent, but now it seems they are a minefield. Personally I think programming against the graph API is a much cleaner experience than messing with parallel streams and events to achieve parallelism & correctness.

Topic		Replies	Views
Graph Capture of cublasDdot in Device Pointer Mode GPU-Accelerated Libraries cublas	3	345	March 26, 2024
cudaErrorNotSupported (code 801) while using cudaGraphAddChildGraphNode or cudaGraphClone() with CUBLAS nodes CUDA Programming and Performance	1	509	December 21, 2023
Stream capture of cublas gemm CUDA Programming and Performance	4	584	September 12, 2024
cuBLAS call from kernel in CUDA 10.0 GPU-Accelerated Libraries	9	4777	April 7, 2021
Getting Started with CUDA Graphs Technical Blog	11	1886	January 8, 2024
[BUG][DEVICE LAUNCHED GRAPHS] CudaErrorInvalidValue with cuBLAS combined with the cudaGraphInstantiateFlagsDeviceLaunch starting with CUDA 12.1 CUDA Programming and Performance nvbugs	2	737	January 5, 2024
Can't capture stream when using cublasSgemmStridedBatched with batchCount=1 GPU-Accelerated Libraries	2	555	March 19, 2019
Finding suitable cuBLAS function and half-spaces swap algorithm strategy discussion GPU-Accelerated Libraries	5	709	October 12, 2021
Upgrading to CUDA 12.4 broke down the application GPU-Accelerated Libraries cublas , cusparse	13	981	July 21, 2024
Is there any way to launch a graph from the HOST node? CUDA Programming and Performance	15	504	January 8, 2024

cuBLAS 12 graphs cannot be used as child graphs because of stream ordered memory allocation

Related topics