As far as I know, Volta’s tensor cores do not support int8, int4, or int1. So it can only be counted by fp16 precision. And I doubt that 48 tensor cores have this level of performance.
XavierNX can support INT8 operations and it also has 2DLA cores for inference.
The 21 TOPS is overall performance for GPU+ 2xDLAs:
TOPS 21 = 12.3 (GPU) + 2*4.5 (each DLA)
I’m quite sure that 384 volta CUDA kernels can’t reach 12.3 TOPS’ speed. What’s the generation of NX’s tensor core? It seems like Turing tensor cores through its INT8 performance.
BTW, can DLAs and GPU work concurrently? I mean, a layer can be implicitly divided into the execution of DLAs and GPU cores? Or I have to manually split the layer’s problem size, and dispatch them to DLAs and GPU cores.
Please note that the performance is tested under maximum performance.
Which is set with the following command:
$ sudo nvpmodel -m 0 $ sudo jetson_clocks
We have a sample to demonstrate the Jetson benchmark.
You can find an example below for running GPU and DLAs together:
It’s better to list models’ FLOPs or MACs.
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.