Kernel output all correct but got NAN when profiling with nsight-compute

Hi Forum,
I wrote some matrix multiplication programs and want to profile them using nsight compute, I am using a script ./test_matrix_mul.py (the script is calling the kernels bounded by pybind11 and compare the results to the np.matmul) to do that. When running the kernels without the nsight compute, the results are all correct (with some numerical error since some of the kernels are using __half):

$ python3 ./test_matrix_mul.py 
cuda kernel <matrix_mul_naive_kernel_32x32> runtime 45.856735 ms.
cuda kernel <matrix_mul_half_kernel_32x32> runtime 39.917503 ms.
cuda kernel <matrix_mul_smit_kernel_128x128> runtime 3.993728 ms.
cuda kernel <matrix_mul_smit_pipeline_kernel_128x128> runtime 4.558400 ms.
naive version max abs err:      0.003662109375 (0.0001946256816154346%)
half version max abs err:       32.0 (1.2364760041236877%)
SIMT version max abs err:       34.0 (1.3137557543814182%)
SIMT with pipeline version max abs err: 34.0 (1.3137557543814182%)
cuda kernel <matrix_mul_naive_kernel_32x32> runtime 35.100735 ms.
cuda kernel <matrix_mul_half_kernel_32x32> runtime 29.672640 ms.
cuda kernel <matrix_mul_smit_kernel_128x128> runtime 3.346176 ms.
cuda kernel <matrix_mul_smit_pipeline_kernel_128x128> runtime 3.777856 ms.
naive version max abs err:      0.004150390625 (0.00019612816686276346%)
half version max abs err:       32.0 (1.4414414763450623%)
SIMT version max abs err:       34.0 (1.5315315686166286%)
SIMT with pipeline version max abs err: 34.0 (1.5315315686166286%)
cuda kernel <matrix_mul_naive_kernel_32x32> runtime 30.667007 ms.
cuda kernel <matrix_mul_half_kernel_32x32> runtime 29.640640 ms.
cuda kernel <matrix_mul_smit_kernel_128x128> runtime 3.346528 ms.
cuda kernel <matrix_mul_smit_pipeline_kernel_128x128> runtime 3.779584 ms.
naive version max abs err:      0.00390625 (0.00015077694115461782%)
half version max abs err:       34.0 (1.290812436491251%)
SIMT version max abs err:       34.0 (1.290812436491251%)
SIMT with pipeline version max abs err: 34.0 (1.290812436491251%)
cuda kernel <matrix_mul_naive_kernel_32x32> runtime 30.676449 ms.
cuda kernel <matrix_mul_half_kernel_32x32> runtime 29.645311 ms.
cuda kernel <matrix_mul_smit_kernel_128x128> runtime 3.346144 ms.
cuda kernel <matrix_mul_smit_pipeline_kernel_128x128> runtime 3.770592 ms.
naive version max abs err:      0.00390625 (0.00014346429679790162%)
half version max abs err:       38.0 (1.9348269328474998%)
SIMT version max abs err:       38.0 (1.9348269328474998%)
SIMT with pipeline version max abs err: 38.0 (1.9348269328474998%)
cuda kernel <matrix_mul_naive_kernel_32x32> runtime 30.666464 ms.
cuda kernel <matrix_mul_half_kernel_32x32> runtime 29.646112 ms.
cuda kernel <matrix_mul_smit_kernel_128x128> runtime 3.345984 ms.
cuda kernel <matrix_mul_smit_pipeline_kernel_128x128> runtime 3.782240 ms.
naive version max abs err:      0.00390625 (0.0001432469275641779%)
half version max abs err:       34.0 (1.3798701576888561%)
SIMT version max abs err:       36.0 (1.461038924753666%)
SIMT with pipeline version max abs err: 36.0 (1.461038924753666%)

But when profiling those kernels with:

sudo ncu -o matmul_simt_pipeline -f -k regex:matrix_mul_smit_pipeline_kernel_128x128 --set=full python3 ./test_matrix_mul.py 

I got the results to be NAN

......
=PROF== Profiling "cast_kernel_float2half" - 28: 0%....50%....100% - 35 passes
==PROF== Profiling "cast_kernel_float2half" - 29: 0%....50%....100% - 35 passes
==PROF== Profiling "matrix_mul_smit_pipeline_kern..." - 30: 0%....50%....100% - 35 passes
==PROF== Profiling "cast_kernel_half2float" - 31: 0%....50%....100% - 35 passes
cuda kernel <matrix_mul_smit_pipeline_kernel_128x128> runtime 1438.382324 ms.
naive version max abs err:      nan (0.0%)
half version max abs err:       nan (0.0%)
SIMT version max abs err:       nan (0.0%)
SIMT with pipeline version max abs err: nan (0.0%)
==PROF== Profiling "matrix_mul_naive_kernel_32x32" - 32: 0%....50%....100% - 35 passes
cuda kernel <matrix_mul_naive_kernel_32x32> runtime 5
.......

I am confused about this issue… Is this expected? or that meaning there are something wrong with my ncu?
Thank you!
Chengzhe

I am using ncu on orin developer kit, and the version:

$ ncu --version
NVIDIA (R) Nsight Compute Command Line Profiler
Copyright (c) 2018-2022 NVIDIA Corporation
Version 2022.2.1.0 (build 32234930) (public-release)

cuda version is

$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Sun_Oct_23_22:16:07_PDT_2022
Cuda compilation tools, release 11.4, V11.4.315
Build cuda_11.4.r11.4/compiler.31964100_0

The default replay mode saves and restores kernel state and replays the kernels. Can you try using application replay to see if the profiler is incorrectly saving and restoring state or if that save/restore process is somehow affecting the data? You can use --replay-mode application on the CLI to enable this.

Hi, czxu

Have you got chance to have a try ?