Env: dockers-image = nvcr.io/nvidia/l4t-jetpack:r35.4.1
cuda = 11.4, tensorrt = 8.5.2
Device: Jetson AGX Orin 64G
the test model and test input(type fp32) are heredla_test.zip (3.7 MB). Since we just compare the result between GPU & DLA, there was no reference output(or just take GPU or DLA result as reference).
the model consists of 10 conv-layers (xavier_uniform init is given to weights, bias init as 0, all have kernel size 3x3, padding=dilation=1, channel = 16 except input of model(is 4), input of model = 1x4x512x512, and all outputs of layers are 1x16x512x512). The calibrating data is 20 tensors generated by np.random.randn(), pytorch_quantization lib is used to calibrate the model with “entropy” method.
We have tested other model and this model is a typical sence.
We used tensorrt-api in cpp to build fp16 engine and setDynamicRange() to build the int8 quantized engine.
For FP16 cases, the max error for a single output pixel sometimes seems relatively large (this occurs in another test, such as GPU result is 66 but the error is 6.125, extremely large for half precision i think), but the total error (sum of err / sum of result from GPU) stays less than 1%, it seems still an acceptable case.
But for INT8 cases, the max error often (or to say usually) went large relatively, (the appearence seems like an error accumulation), the total error went to 10% and even more (30% sometimes). that should be an unacceptable case. Dramatically, sometimes the error was acceptable (about 1% or less), it depends on conv configures, weights’ setting, input range, output range of all the layers…(I mean, if the amax and distribution of output from every layers are resonable, the error may come to be a little number, and a little percent number).
below is the result log, “err histro x: y” means there are y pixels of DLA result have difference x from corresponding pixels of GPU result. Meanwhile, we take the result of DLA from different layers as input of next layers, rebuild the engine that only contains the remaining layers and got a similar appereance. The first layer’s result was always well consistent (less than 0.1%), but after several layers, it came large. log_u16.txt (14.5 KB)
I have no idea whether it is normal.
but such a difference makes DLA unreliable if we want it ro replace GPU in some works.
does anyone have expirence about this?
best regards
we will try with nvcr.io/nvidia/l4t-tensorrt:r10.3.0-devel, seems the latest version modified on June 1, 2025.
Since we do not use trtexec but cpp-api to build the network, the code is coupled with other code and not an open source project, only test log, onnx-file and test input we can now share here.
If repro is difficult for you, we just want to know whether such error(see https://forums.developer.nvidia.com/uploads/short-url/4n0ni384sotkhSLdCczoLvRNg0W.txt)
is a normal case, or you have any expirence or advice to solve it.
Sincerely
Hi SivaRamaKrishnaNV,
Just for an Update.
I have make a cpu logic to calculate the CNN metioned above and find:
if the accumulator for weight x input is “int” type and all compute precision which involved scale and final output (multiple or devide scale) setted to “float32”, then the result is pixel-level accurate with the result from GPU.
but if the compute precision involved scale setted to “float16”, the error appearance is quite similar with the error between DLA and GPU, also seems like error accumulation. In this case, int16 precision for accumulator leads to huge error, it shall due to overflow i think.
the log is shared here.log_cpu.txt (14.7 KB)
So is DLA use float16 precision internally for the computation that involved scale? I mean such as the weight scale, input scale and output scale in convolution.
for example, tensorrt Pseudocode says: rescaled_gemm_out[ :, i, :, :] = gemm_out[ :, i, :, :] * [ output_scale / (input_scale * weights_scale[ i ] ) ]. In INT8 mode, GPU shall use float32 precision to calculate as I know, so does DLA use float16 precision?
hi SivaRamaKrishnaNV,
I have checked on jetpack 6.2, with cuda 12.6 & tensorrt 10.3, the behavior is almost the same. The result of DLA is the same as the result from jetpack 5.1, the result of GPU has slightly changed since different version of cuda & tensorrt.
I will try to decouple the related code and share with you.
Sincerely
Since it has been proved by a friend from another team that the apppearence mentioned above is a common issue when compare the result of DLA with GPU in INT8 mode. We think it is the behaviour called error accumulation. Thus the DLA must be used with caution. It may cause significant error, especially after large number of layers, the result must be checked in usage scenarios.
So this topic will be treated as solved.