Discrepancy in Kernel Behavior Between SM_86 and SM_89 GPUs When Modifying CUDA Graph Nodes with Functions from Decomposed Cubins

I have been experimenting with CUDA Graphs and ran into an interesting issue while modifying kernel nodes in an executable graph. My process involves:

  1. Extracting the function handle from a kernel node.
  2. Retrieving the function name.
  3. Replacing the function handle with a new one by:
  • Decomposing libcublasLt.so into individual cubins.
  • Loading each cubin as a module.
  • Iterating through the modules to identify which cubin contains the function.
  • Loading the function from the new module and replacing the original function handle in the graph node.

The function in question, cutlass_80_wmma_tensorop_f16_s161616gemm_f16_16x16_128x2_tn_align8, exists in two cubins:

  • One tagged with sm_90.
  • One tagged with sm_80.

I used the function from the sm_80 cubin since attempting to load it from sm_90 resulted in an invalid source error on GPUs with SM_86 architecture.

This approach worked correctly on an SM_89 GPU, with the modified graph producing the same results as the original. However, on an SM_86 GPU, the graph output diverged.

Key observations:

  • The same function behaves as expected when loaded from libcublasLt.so on SM_89.
  • On SM_86, discrepancies arise only after modifying the graph.
  • The original module contains 88 kernels, and the names match those in the extracted cubins.
  • Function attributes are identical between the original and modified function handles.
  • I initially suspected runtime compilation for SM_86, but I couldn’t locate any PTX file containing a kernel with the same name.
    Given these observations, I am trying to understand the root cause of this mismatch. Am I simply getting lucky on SM_89, and the approach is inherently flawed?