CudaGraphKernelNodeGetParams Returns NULL kernelParams

I am using cudaGraphKernelNodeGetParams to retrieve the parameters of kernel nodes in a CUDA graph. However, for nodes that called cuDNN kernel, kernelParams is NULL, and gridDim and blockDim are both (1,1,1), while for other nodes that called at:native kernel, the kernelParams, gridDim, and blockDim are correctly retrieved.

Here is the code snippet I am using:

cudaKernelNodeParams params = {};
cudaGraphKernelNodeGetParams(nodes[i], &params);
std::cout << "  GridDim:  (" << params.gridDim.x << ", " << params.gridDim.y << ", " << params.gridDim.z << ")\n";
std::cout << "  BlockDim: (" << params.blockDim.x << ", " << params.blockDim.y << ", " << params.blockDim.z << ")\n";

if (params.kernelParams) {
    std::cout << "  kernelParams:\n";
    void** kernelParams = (void**)params.kernelParams;
    for (int j = 0; kernelParams[j] != nullptr; j++) {
        std::cout << "    Param " << j << ": " << kernelParams[j] << std::endl;
    }
} else {
    std::cout << "  kernelParams is NULL.\n";
}

Can I access the parameters of cuDNN kernels?

Any insights or suggestions would be greatly appreciated!

You can access the parameters of cuDNN kernels in CUDA graphs, particularly if you are using cuDNN version 9.0. Here are some insights and suggestions for retrieving those parameters:

  1. cuDNN Graph API: With the cuDNN Graph API, you can create an operation graph that includes tensor descriptors for inputs and outputs. This allows cuDNN to manage data flow and execution plans effectively.
  2. Engine Configurations: Inspect engine configurations and performance tuning knobs associated with each kernel call. You can query the architecture-specific engine configurations to understand which kernels are optimal for your specific computational graph.
  3. Dynamic Shape Optimization: Utilize the dynamic shape features in cuDNN version 9.0. This enables efficient plan finalization for dynamic shape operation graphs by binding previously compiled kernels to execution plans.
  4. Engine Count and Custom Heuristics: You can generate all possible engine configurations by querying for the engine count and easily access tuning knobs. This can be leveraged for auto-tuning and optimizing performance based on cuDNN’s heuristics.
  5. cuBLASLt Engine: For specific matrix multiplication optimizations, consider using the cuBLASLt engine, especially if your use case includes E4M3 and E5M2 data types, which support various fusion operations for improved performance.