cuBLAS 12 graphs cannot be used as child graphs because of stream ordered memory allocation

cuBLAS12 (sometimes) inserts stream allocation nodes, which makes the graph incompatible for use as a child graph (according to the doc for cudaGraphAddMemAllocNode).

I run into this in CUDA12 where a call to cudaGraphAddChildGraphNode returns 801 (operation not supported). The same code works in CUDA11 (I’m in the process of upgrading).

I gather this allocation has to do with cuBLAS workspace management, but the docs are not entirely clear on the semantics or how to control it for graphs.

I was hoping Nvidia could clarify:

  1. Precisely when cuBLAS might use stream ordered allocations, and how we can avoid it by preallocating workspaces?

  2. Best practice for cuBLAS workspaces when building graphs?

  3. Will memory allocation nodes be supported in child graphs in the future, or is this a fundamental limitation that will be here for the foreseeable future?

Cheers

1 Like

Here is a full repro example, based loosely off the TRSM example in the documentation:

#include <cstdio>
#include <cstdlib>
#include <cublas_v2.h>
#include <cuda_runtime.h>
#include <vector>

#define CUDA_CHECK(err)                                                        \
    do {                                                                       \
        cudaError_t err_ = (err);                                              \
        if (err_ != cudaSuccess) {                                             \
            std::printf("CUDA error %d at %s:%d\n", err_, __FILE__, __LINE__); \
            std::abort();                                                      \
        }                                                                      \
    } while (0)
#define CUBLAS_CHECK(err)                                                        \
    do {                                                                         \
        cublasStatus_t err_ = (err);                                             \
        if (err_ != CUBLAS_STATUS_SUCCESS) {                                     \
            std::printf("cublas error %d at %s:%d\n", err_, __FILE__, __LINE__); \
            std::abort();                                                        \
        }                                                                        \
    } while (0)

int main(int argc, char *argv[]) {
    int m = 256;
    if (argc > 1) {
        m = atoi(argv[1]);
    }
    int n = 256;
    if (argc > 2) {
        n = atoi(argv[2]);
    }
    cublasHandle_t cublasH = NULL;
    const int lda = n;
    const int ldb = m;

    std::printf("m = %d n = %d lda = %d ldb = %d\n", m, n, lda, ldb);

    cublasSideMode_t side = CUBLAS_SIDE_RIGHT;
    cublasFillMode_t uplo = CUBLAS_FILL_MODE_LOWER;
    cublasOperation_t transa = CUBLAS_OP_N;
    cublasDiagType_t diag = CUBLAS_DIAG_NON_UNIT;

    const double alpha = 1.0;
    double *d_A = nullptr;
    double *d_B = nullptr;
    CUDA_CHECK(cudaMalloc(reinterpret_cast<void **>(&d_A), n * lda * sizeof(double)));
    CUDA_CHECK(cudaMalloc(reinterpret_cast<void **>(&d_B), n * ldb * sizeof(double)));

    CUBLAS_CHECK(cublasCreate(&cublasH));

    cudaGraph_t main_graph;
    CUDA_CHECK(cudaGraphCreate(&main_graph, 0));

    cudaStream_t capture_stream = NULL;
    CUDA_CHECK(cudaStreamCreate(&capture_stream));
    CUDA_CHECK(cudaStreamBeginCapture(capture_stream, cudaStreamCaptureModeThreadLocal));

    CUBLAS_CHECK(cublasSetStream(cublasH, capture_stream));
    CUBLAS_CHECK(cublasSetPointerMode(cublasH, CUBLAS_POINTER_MODE_HOST));
    CUBLAS_CHECK(cublasDtrsm(cublasH, side, uplo, transa, diag, m, n, &alpha, d_A, lda, d_B, ldb));

    cudaGraph_t cublas_graph;
    CUDA_CHECK(cudaStreamEndCapture(capture_stream, &cublas_graph));

    cudaGraphNode_t node;
    CUDA_CHECK(cudaGraphAddChildGraphNode(&node, main_graph, nullptr, 0, cublas_graph));

    CUDA_CHECK(cudaFree(d_A));
    CUDA_CHECK(cudaFree(d_B));

    CUBLAS_CHECK(cublasDestroy(cublasH));

    CUDA_CHECK(cudaDeviceReset());

    return EXIT_SUCCESS;
}

I build that with nvcc 12.3.52, and run it with size 256 256 it errors

$ ./cmd_repro_d 256 256
m = 256 n = 256 lda = 256 ldb = 256
CUDA error 801 at repro.cc:72

and with a smaller size, 32 32, it works

$ ./cmd_repro_d 32 32
m = 32 n = 32 lda = 32 ldb = 32

If I recompile with nvcc 11.8.89, the program works without error for the larger sizes, as well.

Hi, this can be fixed by setting cublas workspace:

This way, cublas won’t do any allocation. So you will be able to embed the graph.

1 Like

Thank you. That advice also helped me. However shouldn’t this be treated as a bug in cuBLAS ? I don’t see anywhere in the docs where it is specified that cuBLAS + cuda graphs is not supported.

Thanks, I can confirm that providing a workspace does solve the problem in this repro.

Unfortunately, I can’t seem to control the workspace used by cuBLAS via cuSOLVER. I tried creating a handle and setting a workspace, but most likely cuSOLVER has an internal cuBLAS handle.

Here is a repro calling potrf which for size 1024 will emit memory operations and crash with 801:

#include <cstdio>
#include <cstdlib>
#include <cublas_v2.h>
#include <cuda_runtime.h>

#include "cusolverDn.h"
#include "helper_cuda.h" // from CUDA samples repo

int main(int argc, char *argv[]) {
    int n = 1024;
    if (argc > 1) {
        n = atoi(argv[1]);
    }
    std::printf("n = %d\n", n);

    cublasHandle_t blas_handle = nullptr;
    checkCudaErrors(cublasCreate(&blas_handle));

    cusolverDnHandle_t solver_handle = nullptr;
    checkCudaErrors(cusolverDnCreate(&solver_handle));

    int lda = n;
    int bufferSize = 0;
    int *info = nullptr;
    double *buffer = nullptr;
    cublasFillMode_t uplo = CUBLAS_FILL_MODE_UPPER;
    double *d_A = nullptr;

    checkCudaErrors(cudaMalloc(reinterpret_cast<void **>(&d_A), n * lda * sizeof(double)));

    checkCudaErrors(cusolverDnDpotrf_bufferSize(solver_handle, uplo, n, d_A, lda, &bufferSize));

    checkCudaErrors(cudaMalloc(&info, sizeof(int)));
    checkCudaErrors(cudaMalloc(&buffer, sizeof(double) * bufferSize));

    cudaGraph_t main_graph;
    checkCudaErrors(cudaGraphCreate(&main_graph, 0));

    cudaStream_t capture_stream = NULL;
    checkCudaErrors(cudaStreamCreate(&capture_stream));
    checkCudaErrors(cudaStreamBeginCapture(capture_stream, cudaStreamCaptureModeThreadLocal));

    checkCudaErrors(cusolverDnSetStream(solver_handle, capture_stream));
    checkCudaErrors(cusolverDnDpotrf(solver_handle, uplo, n, d_A, lda, buffer, bufferSize, info));

    cudaGraph_t child_graph;
    checkCudaErrors(cudaStreamEndCapture(capture_stream, &child_graph));

    cudaGraphNode_t node;
    checkCudaErrors(cudaGraphAddChildGraphNode(&node, main_graph, nullptr, 0, child_graph));

    checkCudaErrors(cudaDeviceReset());

    return EXIT_SUCCESS;
}

From my point of view, as a Graph API user, it would be easier to program against an API without global state or behind-the-scenes magic. I’d prefer to pass in workspaces to BLAS calls explicitly, I like how cuSOLVER _bufferSize works, for example. It is a common pattern in the good old LAPACK APIs. cuSOLVER could of course include any cuBLAS workspace needs in its _bufferSize request, and forward that to any underlying cuBLAS calls. If the end user wants to use stream ordered allocation, that is trivial to implement outside of the BLAS or LAPACK APIs.

The other question is point 3 in my original post: Are there fundamental reasons why child graphs are “special” in this way? Nested graphs would be a powerful feature if they were equivalent, but now it seems they are a minefield. Personally I think programming against the graph API is a much cleaner experience than messing with parallel streams and events to achieve parallelism & correctness.

1 Like