I’ve run CUDA samples like matrixMulCUBLAS on Orin NX platform but the log didn’t show it hit the max TOPS marked from Orin datasheet.
Per that topic I think there is a method to calcute the realtime TOPS from Nsight log or something else.
Hi Aastalll,
Thanks for your help. After read the topic, I ran the CULTASS as below
$ git clone https://github.com/NVIDIA/cutlass.git
$ cd cutlass/
$ mkdir build && cd build
$ cmake .. -DCUTLASS_NVCC_ARCHS=87 -DCUTLASS_LIBRARY_KERNELS=i16864spgemm
$ make cutlass_profiler -j12
The result is bleow
=============================
Problem ID: 1
Provider: CUTLASS
OperationKind: spgemm
Operation: cutlass_tensorop_s8_i16864spgemm_s8_256x128_128x3_tn_align16
Status: Success
Verification: ON
Disposition: Not verified
reference_device: Not run
cuBLAS: Not run
cuDNN: Not run
Arguments: --gemm_kind=spgemm --m=16384 --n=16384 --k=16384 --A=s8:row --B=s8:column --C=s8:row --E=u32:nk2 --alpha=1 \
--beta=0 --split_k_slices=1 --batch_count=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128 \
--cta_k=128 --cluster_m=1 --cluster_n=1 --cluster_k=1 --stages=3 --warps_m=4 --warps_n=2 --warps_k=1 \
--inst_m=16 --inst_n=8 --inst_k=64 --min_cc=80 --max_cc=1024
Bytes: 704643072 bytes
FLOPs: 8796629893120 flops
FLOPs/Byte: 12483
Runtime: 473.512 ms
Memory: 1.38592 GiB/s
Math: 18577.4 GFLOP/s
=============================
CSV Results:
The hareware I’m using is Jetson Orin NX 8GB, Jetpack 6.0. Per the datasheet, “Jetson Orin NX 8GB: Up to 70 (Sparse) INT8 TOPs and 35 (Dense) INT8 TOPs”, after redcuing the DLA 20 TOPS, it should be Sparse INT8 50 TOPs. I tried use other m,n,k, the 18577.4 GFLOP/s is what I can get maximum.
Hi Aastalll,
Sorry I forget to to sudo jetson_clocks.
After sudo jetson_clocks, the max tops can be higher. to 20456.4 GFLOP/s.
I also tried Identity8 Identity4 Identity2 Identity1. After changed, also make cutlass_profiler -j12.
Now 21060.5 GFLOP/s is what I can get maximum.