ztm
August 31, 2024, 9:15am
1
Continuing the discussion from The performance of the Jetson Orin Nano module does not match the data provided on the official website :
Hi
I read the topic The performance of the Jetson Orin Nano module does not match the data provided on the official website
But still confused on the last reply " get the #operations per cycle and the #cycles per nsecond from the profiler".
Could someone tell how to calculate from the profiler or Nsight.
I’ve run CUDA samples like matrixMulCUBLAS on Orin NX platform but the log didn’t show it hit the max TOPS marked from Orin datasheet.
Per that topic I think there is a method to calcute the realtime TOPS from Nsight log or something else.
I basic know how to calculate the TOPS from topic The tensor core performance detail of Jetson AGX Orin 32GB
But I’d like to view the real results.
Many thanks for answering the question.
Hi,
We recommended trying our CUTLASS library to benchmark peak performance.
You can find more details in the below topic:
Hi,
Thanks for your patience.
Sorry, there are some incorrect messages in the previous suggestion.
The kernel you used before is for 1-bit dense GEMM.
To benchmark int8 sparse GEMM, please recompile the library with -DCUTLASS_LIBRARY_KERNELS=i16864spgemm.
Since Orin only has 16 SMS, we also recommend testing this with smaller problem sizes.
Then changing the below line from Identity8 to Identity4 also helps:
https://github.com/NVIDIA/cutlass/blob/main/python/cutlass_library/generator.py#L…
Thanks.
ztm
September 3, 2024, 3:19am
4
Hi Aastalll,
Thanks for your help. After read the topic, I ran the CULTASS as below
$ git clone https://github.com/NVIDIA/cutlass.git
$ cd cutlass/
$ mkdir build && cd build
$ cmake .. -DCUTLASS_NVCC_ARCHS=87 -DCUTLASS_LIBRARY_KERNELS=i16864spgemm
$ make cutlass_profiler -j12
The result is bleow
=============================
Problem ID: 1
Provider: CUTLASS
OperationKind: spgemm
Operation: cutlass_tensorop_s8_i16864spgemm_s8_256x128_128x3_tn_align16
Status: Success
Verification: ON
Disposition: Not verified
reference_device: Not run
cuBLAS: Not run
cuDNN: Not run
Arguments: --gemm_kind=spgemm --m=16384 --n=16384 --k=16384 --A=s8:row --B=s8:column --C=s8:row --E=u32:nk2 --alpha=1 \
--beta=0 --split_k_slices=1 --batch_count=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128 \
--cta_k=128 --cluster_m=1 --cluster_n=1 --cluster_k=1 --stages=3 --warps_m=4 --warps_n=2 --warps_k=1 \
--inst_m=16 --inst_n=8 --inst_k=64 --min_cc=80 --max_cc=1024
Bytes: 704643072 bytes
FLOPs: 8796629893120 flops
FLOPs/Byte: 12483
Runtime: 473.512 ms
Memory: 1.38592 GiB/s
Math: 18577.4 GFLOP/s
=============================
CSV Results:
The hareware I’m using is Jetson Orin NX 8GB, Jetpack 6.0. Per the datasheet, “Jetson Orin NX 8GB: Up to 70 (Sparse) INT8 TOPs and 35 (Dense) INT8 TOPs”, after redcuing the DLA 20 TOPS, it should be Sparse INT8 50 TOPs. I tried use other m,n,k, the 18577.4 GFLOP/s is what I can get maximum.
Hi,
Just want to double-confirm, have you maximized the device performance first?
$ sudo nvpmodel -m 0
$ sudo jetson_clocks
Since Orin NX has less memory, could you try Identity4, Identity2, and Identity1 in the below line to see if it helps?
def product(X, identity = 1):
result = identity
for item in X:
result *= item
return result
elements_per_thread = product(tile.threadblock_shape[:-1]) // product(tile.warp_count) // 32 // epilogue_steps
return min(max_alignment, elements_per_thread)
def DefaultSwizzlingFunctor():
return SwizzlingFunctor.Identity8
# To use StreamK decomposition for basic GEMMs, set `swizzling_functor = SwizzlingFunctor.StreamK`
#
def CreateGemmOperator(manifest, layouts, tile_descriptions, data_type, \
alignment_constraints, complex_transforms = None, epilogue_functor = EpilogueFunctor.LinearCombination, \
swizzling_functor = DefaultSwizzlingFunctor()):
if complex_transforms is None:
complex_transforms = [(ComplexTransform.none, ComplexTransform.none),]
Thanks.
ztm
September 4, 2024, 8:22am
6
Hi Aastalll,
Sorry I forget to to sudo jetson_clocks.
After sudo jetson_clocks, the max tops can be higher. to 20456.4 GFLOP/s.
I also tried Identity8 Identity4 Identity2 Identity1. After changed, also make cutlass_profiler -j12.
Now 21060.5 GFLOP/s is what I can get maximum.
Hi,
Thanks for the info.
We will give it a check and provide more info to you later.
Thanks.
Hi
We test this with Orin NX 16GB, which expects to get 60TOPs on GPU.
(Total 100 TOPs and 2x DLA can reach 40TOPs )
Then the peak sparse INT8 performance we can get is 34.122 TOPs, around 56 %SOL.
m=512, n=512, k=8192 with Identity2
=============================
Problem ID: 1
Provider: CUTLASS
OperationKind: spgemm
Operation: cutlass_tensorop_s8_i16864spgemm_s8_256x128_128x3_tn_align16
Status: Success
Verification: ON
Disposition: Not verified
reference_device: Not run
cuBLAS: Not run
cuDNN: Not run
Arguments: --gemm_kind=spgemm --m=512 --n=512 --k=8192 --A=s8:row --B=s8:column --C=s8:row --E=u32:nk2 --alpha=1 \
--beta=0 --split_k_slices=1 --batch_count=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128 \
--cta_k=128 --cluster_m=1 --cluster_n=1 --cluster_k=1 --stages=3 --warps_m=4 --warps_n=2 --warps_k=1 \
--inst_m=16 --inst_n=8 --inst_k=64 --min_cc=80 --max_cc=1024
Bytes: 7077888 bytes
FLOPs: 4295491584 flops
FLOPs/Byte: 606
Runtime: 0.125884 ms
Memory: 52.364 GiB/s
Math: 34122.6 GFLOP/s
Thanks.
ztm
September 10, 2024, 12:27am
9
Thanks AastaLLL.
I test it on Orin NX 8GB, which expects to get 50TOPS. m=512, n=512, k=8192 with Identity2 result is 28.5T is ~57%.
Problem ID: 1
Provider: CUTLASS
OperationKind: spgemm
Operation: cutlass_tensorop_s8_i16864spgemm_s8_256x128_128x3_tn_align16
Status: Success
Verification: ON
Disposition: Not verified
reference_device: Not run
cuBLAS: Not run
cuDNN: Not run
Arguments: --gemm_kind=spgemm --m=512 --n=512 --k=8192 --A=s8:row --B=s8:column --C=s8:row --E=u32:nk2 --alpha=1 \
--beta=0 --split_k_slices=1 --batch_count=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128 \
--cta_k=128 --cluster_m=1 --cluster_n=1 --cluster_k=1 --stages=3 --warps_m=4 --warps_n=2 --warps_k=1 \
--inst_m=16 --inst_n=8 --inst_k=64 --min_cc=80 --max_cc=1024
Bytes: 7077888 bytes
FLOPs: 4295491584 flops
FLOPs/Byte: 606
Runtime: 0.150457 ms
Memory: 43.8118 GiB/s
Math: 28549.6 GFLOP/s
May I know how to test the DLA performance, also marked with TOPS? So I can add DLA and GPU.
Hi,
Since you have less memory, maybe use a smaller matrix size, ex. m=256, n=256, will help.
DLA doesn’t support CUTLASS. Please use TensorRT API instead.
Below is some DLA sample for your reference:
NVIDIA DLA-SW, the recipes and tools for running deep learning workloads on NVIDIA DLA cores for inference applications.
OrinNX 8GB only has one DLA but uses the same frequency as OrinNX 16GB.
So the sparse INT8 peak performance should be 20TOPs.
https://docs.nvidia.com/jetson/archives/r36.3/DeveloperGuide/SD/PlatformPowerAndPerformance/JetsonOrinNanoSeriesJetsonOrinNxSeriesAndJetsonAgxOrinSeries.html#supported-modes-and-power-efficiency
Thanks.
ztm
September 10, 2024, 4:19am
11
Thanks so much!
On my board, m=256,n=256 didn’t help.
=============================
Problem ID: 1
Provider: CUTLASS
OperationKind: spgemm
Operation: cutlass_tensorop_s8_i16864spgemm_s8_256x128_128x3_tn_align16
Status: Success
Verification: ON
Disposition: Not verified
reference_device: Not run
cuBLAS: Not run
cuDNN: Not run
Arguments: --gemm_kind=spgemm --m=256 --n=256 --k=8192 --A=s8:row --B=s8:column --C=s8:row --E=u32:nk2 --alpha=1 \
--beta=0 --split_k_slices=1 --batch_count=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128 \
--cta_k=128 --cluster_m=1 --cluster_n=1 --cluster_k=1 --stages=3 --warps_m=4 --warps_n=2 --warps_k=1 \
--inst_m=16 --inst_n=8 --inst_k=64 --min_cc=80 --max_cc=1024
Bytes: 3473408 bytes
FLOPs: 1073872896 flops
FLOPs/Byte: 309
Runtime: 0.126499 ms
Memory: 25.5723 GiB/s
Math: 8489.21 GFLOP/s
=============================
system
Closed
October 9, 2024, 5:35am
13
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.