Discrepancy in Kernel Behavior Between SM_86 and SM_89 GPUs When Modifying CUDA Graph Nodes with Functions from Decomposed Cubins

abbasi.mehryar · January 8, 2025, 8:18pm

I have been experimenting with CUDA Graphs and ran into an interesting issue while modifying kernel nodes in an executable graph. My process involves:

Extracting the function handle from a kernel node.
Retrieving the function name.
Replacing the function handle with a new one by:

Decomposing libcublasLt.so into individual cubins.
Loading each cubin as a module.
Iterating through the modules to identify which cubin contains the function.
Loading the function from the new module and replacing the original function handle in the graph node.

The function in question, cutlass_80_wmma_tensorop_f16_s161616gemm_f16_16x16_128x2_tn_align8, exists in two cubins:

One tagged with sm_90.
One tagged with sm_80.

I used the function from the sm_80 cubin since attempting to load it from sm_90 resulted in an invalid source error on GPUs with SM_86 architecture.

This approach worked correctly on an SM_89 GPU, with the modified graph producing the same results as the original. However, on an SM_86 GPU, the graph output diverged.

Key observations:

The same function behaves as expected when loaded from libcublasLt.so on SM_89.
On SM_86, discrepancies arise only after modifying the graph.
The original module contains 88 kernels, and the names match those in the extracted cubins.
Function attributes are identical between the original and modified function handles.
I initially suspected runtime compilation for SM_86, but I couldn’t locate any PTX file containing a kernel with the same name.
Given these observations, I am trying to understand the root cause of this mismatch. Am I simply getting lucky on SM_89, and the approach is inherently flawed?

Topic		Replies	Views
cudaGraphAddKernelNode() fails cuModuleGetFunction() function CUDA Programming and Performance	1	262	March 3, 2025
cudaErrorNotSupported (code 801) while using cudaGraphAddChildGraphNode or cudaGraphClone() with CUBLAS nodes CUDA Programming and Performance	1	622	December 21, 2023
Error: identifier "cudaGetCurrentGraphExec" is undefined CUDA Programming and Performance	4	298	August 28, 2023
Stream capture of cublas gemm CUDA Programming and Performance	8	865	March 31, 2025
cudaGraphAddKernelNode doesn't work on host side. CUDA Programming and Performance	0	593	October 9, 2018
Calling cuBLAS routine from CUDA Graph Node CUDA Programming and Performance	6	1594	February 11, 2022
Cuda Runtime API error for cuda Graph and OpenCV Jetson Xavier NX opencv , cuda , nvbugs	12	1455	August 3, 2022
At runtime: "Fatal error: Registered function 'nvkernel_xyz_foo_16_' not found in the CUBIN, error 1" nvc, nvc++ and nvfortran	3	388	December 21, 2023
[driver api][ptx]: cuModuleGetFunction fails with CUDA_ERROR_NOT_FOUND ... but cuModuleGetGlobal wor CUDA Programming and Performance	8	7345	September 13, 2010
[CUDA Graph] add node from 3rd party library that contains CUDA kernel CUDA Programming and Performance	0	44	June 9, 2025

Discrepancy in Kernel Behavior Between SM_86 and SM_89 GPUs When Modifying CUDA Graph Nodes with Functions from Decomposed Cubins

Related topics