Hi Forum,
I wrote some matrix multiplication programs and want to profile them using nsight compute, I am using a script ./test_matrix_mul.py (the script is calling the kernels bounded by pybind11 and compare the results to the np.matmul) to do that. When running the kernels without the nsight compute, the results are all correct (with some numerical error since some of the kernels are using __half):
$ python3 ./test_matrix_mul.py
cuda kernel <matrix_mul_naive_kernel_32x32> runtime 45.856735 ms.
cuda kernel <matrix_mul_half_kernel_32x32> runtime 39.917503 ms.
cuda kernel <matrix_mul_smit_kernel_128x128> runtime 3.993728 ms.
cuda kernel <matrix_mul_smit_pipeline_kernel_128x128> runtime 4.558400 ms.
naive version max abs err: 0.003662109375 (0.0001946256816154346%)
half version max abs err: 32.0 (1.2364760041236877%)
SIMT version max abs err: 34.0 (1.3137557543814182%)
SIMT with pipeline version max abs err: 34.0 (1.3137557543814182%)
cuda kernel <matrix_mul_naive_kernel_32x32> runtime 35.100735 ms.
cuda kernel <matrix_mul_half_kernel_32x32> runtime 29.672640 ms.
cuda kernel <matrix_mul_smit_kernel_128x128> runtime 3.346176 ms.
cuda kernel <matrix_mul_smit_pipeline_kernel_128x128> runtime 3.777856 ms.
naive version max abs err: 0.004150390625 (0.00019612816686276346%)
half version max abs err: 32.0 (1.4414414763450623%)
SIMT version max abs err: 34.0 (1.5315315686166286%)
SIMT with pipeline version max abs err: 34.0 (1.5315315686166286%)
cuda kernel <matrix_mul_naive_kernel_32x32> runtime 30.667007 ms.
cuda kernel <matrix_mul_half_kernel_32x32> runtime 29.640640 ms.
cuda kernel <matrix_mul_smit_kernel_128x128> runtime 3.346528 ms.
cuda kernel <matrix_mul_smit_pipeline_kernel_128x128> runtime 3.779584 ms.
naive version max abs err: 0.00390625 (0.00015077694115461782%)
half version max abs err: 34.0 (1.290812436491251%)
SIMT version max abs err: 34.0 (1.290812436491251%)
SIMT with pipeline version max abs err: 34.0 (1.290812436491251%)
cuda kernel <matrix_mul_naive_kernel_32x32> runtime 30.676449 ms.
cuda kernel <matrix_mul_half_kernel_32x32> runtime 29.645311 ms.
cuda kernel <matrix_mul_smit_kernel_128x128> runtime 3.346144 ms.
cuda kernel <matrix_mul_smit_pipeline_kernel_128x128> runtime 3.770592 ms.
naive version max abs err: 0.00390625 (0.00014346429679790162%)
half version max abs err: 38.0 (1.9348269328474998%)
SIMT version max abs err: 38.0 (1.9348269328474998%)
SIMT with pipeline version max abs err: 38.0 (1.9348269328474998%)
cuda kernel <matrix_mul_naive_kernel_32x32> runtime 30.666464 ms.
cuda kernel <matrix_mul_half_kernel_32x32> runtime 29.646112 ms.
cuda kernel <matrix_mul_smit_kernel_128x128> runtime 3.345984 ms.
cuda kernel <matrix_mul_smit_pipeline_kernel_128x128> runtime 3.782240 ms.
naive version max abs err: 0.00390625 (0.0001432469275641779%)
half version max abs err: 34.0 (1.3798701576888561%)
SIMT version max abs err: 36.0 (1.461038924753666%)
SIMT with pipeline version max abs err: 36.0 (1.461038924753666%)
But when profiling those kernels with:
sudo ncu -o matmul_simt_pipeline -f -k regex:matrix_mul_smit_pipeline_kernel_128x128 --set=full python3 ./test_matrix_mul.py
I got the results to be NAN
......
=PROF== Profiling "cast_kernel_float2half" - 28: 0%....50%....100% - 35 passes
==PROF== Profiling "cast_kernel_float2half" - 29: 0%....50%....100% - 35 passes
==PROF== Profiling "matrix_mul_smit_pipeline_kern..." - 30: 0%....50%....100% - 35 passes
==PROF== Profiling "cast_kernel_half2float" - 31: 0%....50%....100% - 35 passes
cuda kernel <matrix_mul_smit_pipeline_kernel_128x128> runtime 1438.382324 ms.
naive version max abs err: nan (0.0%)
half version max abs err: nan (0.0%)
SIMT version max abs err: nan (0.0%)
SIMT with pipeline version max abs err: nan (0.0%)
==PROF== Profiling "matrix_mul_naive_kernel_32x32" - 32: 0%....50%....100% - 35 passes
cuda kernel <matrix_mul_naive_kernel_32x32> runtime 5
.......
I am confused about this issue… Is this expected? or that meaning there are something wrong with my ncu?
Thank you!
Chengzhe