I have been experimenting with CUDA Graphs and ran into an interesting issue while modifying kernel nodes in an executable graph. My process involves:
- Extracting the function handle from a kernel node.
- Retrieving the function name.
- Replacing the function handle with a new one by:
- Decomposing
libcublasLt.so
into individual cubins. - Loading each cubin as a module.
- Iterating through the modules to identify which cubin contains the function.
- Loading the function from the new module and replacing the original function handle in the graph node.
The function in question, cutlass_80_wmma_tensorop_f16_s161616gemm_f16_16x16_128x2_tn_align8
, exists in two cubins:
- One tagged with
sm_90
. - One tagged with
sm_80
.
I used the function from the sm_80
cubin since attempting to load it from sm_90
resulted in an invalid source
error on GPUs with SM_86 architecture.
This approach worked correctly on an SM_89 GPU, with the modified graph producing the same results as the original. However, on an SM_86 GPU, the graph output diverged.
Key observations:
- The same function behaves as expected when loaded from
libcublasLt.so
on SM_89. - On SM_86, discrepancies arise only after modifying the graph.
- The original module contains 88 kernels, and the names match those in the extracted cubins.
- Function attributes are identical between the original and modified function handles.
- I initially suspected runtime compilation for SM_86, but I couldn’t locate any PTX file containing a kernel with the same name.
Given these observations, I am trying to understand the root cause of this mismatch. Am I simply getting lucky on SM_89, and the approach is inherently flawed?