YOLOV4 inference on DLA

take YOLOV4 inference on DLA as an example. Is it very time-consuming DLA fallback to GPU?

Hi,

Here is a quick try with XavierNX.

Using the YOLOv4 320x540 model from the below repository:
https://github.com/Tianxiaomo/pytorch-YOLOv4#4-pytorch2onnx

1. Pure GPU performance:

[01/06/2021-15:36:38] [I] Host Latency
[01/06/2021-15:36:38] [I] min: 18.1946 ms (end to end 18.2075 ms)
[01/06/2021-15:36:38] [I] max: 19.0474 ms (end to end 19.0564 ms)
[01/06/2021-15:36:38] [I] mean: 18.5169 ms (end to end 18.5275 ms)
[01/06/2021-15:36:38] [I] median: 18.5977 ms (end to end 18.6089 ms)
[01/06/2021-15:36:38] [I] percentile: 19.0245 ms at 99% (end to end 19.0388 ms at 99%)
[01/06/2021-15:36:38] [I] throughput: 53.9717 qps
[01/06/2021-15:36:38] [I] walltime: 3.03863 s
[01/06/2021-15:36:38] [I] Enqueue Time
[01/06/2021-15:36:38] [I] min: 8.42456 ms
[01/06/2021-15:36:38] [I] max: 11.4478 ms
[01/06/2021-15:36:38] [I] median: 10.5695 ms
[01/06/2021-15:36:38] [I] GPU Compute
[01/06/2021-15:36:38] [I] min: 17.9294 ms
[01/06/2021-15:36:38] [I] max: 18.7791 ms
[01/06/2021-15:36:38] [I] mean: 18.2513 ms
[01/06/2021-15:36:38] [I] median: 18.3324 ms
[01/06/2021-15:36:38] [I] percentile: 18.7535 ms at 99%
[01/06/2021-15:36:38] [I] total compute time: 2.99321 s

2. DLA+fallback performance:

[01/06/2021-15:50:47] [I] Host Latency
[01/06/2021-15:50:47] [I] min: 38.6486 ms (end to end 38.6528 ms)
[01/06/2021-15:50:47] [I] max: 39.8794 ms (end to end 39.8906 ms)
[01/06/2021-15:50:47] [I] mean: 39.0705 ms (end to end 39.0809 ms)
[01/06/2021-15:50:47] [I] median: 39.054 ms (end to end 39.0665 ms)
[01/06/2021-15:50:47] [I] percentile: 39.8794 ms at 99% (end to end 39.8906 ms at 99%)
[01/06/2021-15:50:47] [I] throughput: 25.5874 qps
[01/06/2021-15:50:47] [I] walltime: 3.08745 s
[01/06/2021-15:50:47] [I] Enqueue Time
[01/06/2021-15:50:47] [I] min: 18.6276 ms
[01/06/2021-15:50:47] [I] max: 40.4688 ms
[01/06/2021-15:50:47] [I] median: 38.8071 ms
[01/06/2021-15:50:47] [I] GPU Compute
[01/06/2021-15:50:47] [I] min: 38.3989 ms
[01/06/2021-15:50:47] [I] max: 39.6216 ms
[01/06/2021-15:50:47] [I] mean: 38.8175 ms
[01/06/2021-15:50:47] [I] median: 38.8005 ms
[01/06/2021-15:50:47] [I] percentile: 39.6216 ms at 99%
[01/06/2021-15:50:47] [I] total compute time: 3.06658 s

You can find the device placement of DLA+Fallback below.
Since DLA has limited capacity, most of the layer is running on the GPU.
And the data transfer between processors causes some performance drop.
But please also note that DLA tends to be slower since it is designed for energy efficiency.

[01/06/2021-15:43:43] [I] [TRT] --------------- Layers running on DLA:
[01/06/2021-15:43:43] [I] [TRT] {Conv_4,scale_operand_of_Sub_666,scale_operand_of_Sub_1011,scale_operand_of_Sub_1356}, {Conv_8,Conv_12}, {Add_24,Conv_25}, {Conv_38,Conv_42}, {Add_54,Conv_55}, {Add_676,Add_683,Add_704,Add_711}, {Add_1021,Add_1028,Add_1049,Add_1056}, {Add_1366,Add_1373,Add_1394,Add_1401},
[01/06/2021-15:43:43] [I] [TRT] --------------- Layers running on GPU:
[01/06/2021-15:43:43] [I] [TRT] Conv_0, 2267[Constant], 2262[Constant], 1902[Constant], 1897[Constant], 1537[Constant], 1532[Constant], (Unnamed Layer* 649) [Constant] + (Unnamed Layer* 650) [Shuffle], (Unnamed Layer* 656) [Constant], (Unnamed Layer* 659) [Constant], (Unnamed Layer* 670) [Constant], (Unnamed Layer* 673) [Constant], (Unnamed Layer* 787) [Constant] + (Unnamed Layer* 788) [Shuffle], (Unnamed Layer* 791) [Constant] + (Unnamed Layer* 792) [Shuffle], (Unnamed Layer* 904) [Constant] + (Unnamed Layer* 905) [Shuffle], (Unnamed Layer* 911) [Constant], (Unnamed Layer* 914) [Constant], (Unnamed Layer* 925) [Constant], (Unnamed Layer* 928) [Constant], (Unnamed Layer* 1042) [Constant] + (Unnamed Layer* 1043) [Shuffle], (Unnamed Layer* 1046) [Constant] + (Unnamed Layer* 1047) [Shuffle], (Unnamed Layer* 1159) [Constant] + (Unnamed Layer* 1160) [Shuffle], (Unnamed Layer* 1166) [Constant], (Unnamed Layer* 1169) [Constant], (Unnamed Layer* 1180) [Constant], (Unnamed Layer* 1183) [Constant], (Unnamed Layer* 1297) [Constant] + (Unnamed Layer* 1298) [Shuffle], (Unnamed Layer* 1301) [Constant] + (Unnamed Layer* 1302) [Shuffle], Softplus_1, Cast_763, Cast_768, (Unnamed Layer* 707) [Shuffle], (Unnamed Layer* 713) [Shuffle], Cast_1108, Cast_1113, (Unnamed Layer* 962) [Shuffle], (Unnamed Layer* 968) [Shuffle], Cast_1453, Cast_1458, (Unnamed Layer* 1217) [Shuffle], (Unnamed Layer* 1223) [Shuffle], Tanh_2, Mul_3, Softplus_5, Tanh_6, Mul_7, Softplus_9, Softplus_13, Tanh_10, Tanh_14, Mul_11, Mul_15, Conv_16, Softplus_17, Tanh_18, Mul_19, Conv_20, Softplus_21, Tanh_22, Mul_23, Softplus_26, Tanh_27, Mul_28, Conv_30, Softplus_31, Tanh_32, Mul_33, Conv_34, Softplus_35, Tanh_36, Mul_37, Softplus_39, Softplus_43, Tanh_40, Tanh_44, Mul_41, Mul_45, Conv_46, Softplus_47, Tanh_48, Mul_49, Conv_50, Softplus_51, Tanh_52, Mul_53, Softplus_56, Tanh_57, Mul_58, Conv_59, Softplus_60, Tanh_61, Mul_62, Add_63, Conv_64, Softplus_65, Tanh_66, Mul_67, Conv_69, Softplus_70, Tanh_71, Mul_72, Conv_73, Softplus_74, Tanh_75, Mul_76, Conv_77, Conv_81, Softplus_78, Softplus_82, Tanh_79, Tanh_83, Mul_80, Mul_84, Conv_85, Softplus_86, Tanh_87, Mul_88, Conv_89, Softplus_90, Tanh_91, Mul_92, Add_93, Conv_94, Softplus_95, Tanh_96, Mul_97, Conv_98, Softplus_99, Tanh_100, Mul_101, Add_102, Conv_103, Softplus_104, Tanh_105, Mul_106, Conv_107, Softplus_108, Tanh_109, Mul_110, Add_111, Conv_112, Softplus_113, Tanh_114, Mul_115, Conv_116, Softplus_117, Tanh_118, Mul_119, Add_120, Conv_121, Softplus_122, Tanh_123, Mul_124, Conv_125, Softplus_126, Tanh_127, Mul_128, Add_129, Conv_130, Softplus_131, Tanh_132, Mul_133, Conv_134, Softplus_135, Tanh_136, Mul_137, Add_138, Conv_139, Softplus_140, Tanh_141, Mul_142, Conv_143, Softplus_144, Tanh_145, Mul_146, Add_147, Conv_148, Softplus_149, Tanh_150, Mul_151, Conv_152, Softplus_153, Tanh_154, Mul_155, Add_156, Conv_157, Softplus_158, Tanh_159, Mul_160, Conv_162, Softplus_163, Tanh_164, Mul_165, Conv_166, Conv_503, Softplus_167, LeakyRelu_504, Tanh_168, Mul_169, Conv_170, Conv_174, Softplus_171, Softplus_175, Tanh_172, Tanh_176, Mul_173, Mul_177, Conv_178, Softplus_179, Tanh_180, Mul_181, Conv_182, Softplus_183, Tanh_184, Mul_185, Add_186, Conv_187, Softplus_188, Tanh_189, Mul_190, Conv_191, Softplus_192, Tanh_193, Mul_194, Add_195, Conv_196, Softplus_197, Tanh_198, Mul_199, Conv_200, Softplus_201, Tanh_202, Mul_203, Add_204, Conv_205, Softplus_206, Tanh_207, Mul_208, Conv_209, Softplus_210, Tanh_211, Mul_212, Add_213, Conv_214, Softplus_215, Tanh_216, Mul_217, Conv_218, Softplus_219, Tanh_220, Mul_221, Add_222, Conv_223, Softplus_224, Tanh_225, Mul_226, Conv_227, Softplus_228, Tanh_229, Mul_230, Add_231, Conv_232, Softplus_233, Tanh_234, Mul_235, Conv_236, Softplus_237, Tanh_238, Mul_239, Add_240, Conv_241, Softplus_242, Tanh_243, Mul_244, Conv_245, Softplus_246, Tanh_247, Mul_248, Add_249, Conv_250, Softplus_251, Tanh_252, Mul_253, Conv_255, Softplus_256, Tanh_257, Mul_258, Conv_259, Conv_411, Softplus_260, LeakyRelu_412, Tanh_261, Mul_262, Conv_263, Conv_267, Softplus_264, Softplus_268, Tanh_265, Tanh_269, Mul_266, Mul_270, Conv_271, Softplus_272, Tanh_273, Mul_274, Conv_275, Softplus_276, Tanh_277, Mul_278, Add_279, Conv_280, Softplus_281, Tanh_282, Mul_283, Conv_284, Softplus_285, Tanh_286, Mul_287, Add_288, Conv_289, Softplus_290, Tanh_291, Mul_292, Conv_293, Softplus_294, Tanh_295, Mul_296, Add_297, Conv_298, Softplus_299, Tanh_300, Mul_301, Conv_302, Softplus_303, Tanh_304, Mul_305, Add_306, Conv_307, Softplus_308, Tanh_309, Mul_310, Conv_312, Softplus_313, Tanh_314, Mul_315, Conv_316, LeakyRelu_317, Conv_318, LeakyRelu_319, Conv_320, LeakyRelu_321, MaxPool_323, MaxPool_324, MaxPool_322, 1045 copy, Conv_326, LeakyRelu_327, Conv_328, LeakyRelu_329, Conv_330, LeakyRelu_331, Conv_332, LeakyRelu_333, Reshape_357, Expand_398, Reshape_410, Conv_414, LeakyRelu_415, Conv_416, LeakyRelu_417, Conv_418, LeakyRelu_419, Conv_420, LeakyRelu_421, Conv_422, LeakyRelu_423, Conv_424, LeakyRelu_425, Reshape_449, Expand_490, Reshape_502, Conv_506, LeakyRelu_507, Conv_508, LeakyRelu_509, Conv_510, LeakyRelu_511, Conv_512, LeakyRelu_513, Conv_514, LeakyRelu_515, Conv_516, Conv_519, LeakyRelu_517, LeakyRelu_520, Conv_518, Conv_522, Slice_555, Slice_560, Slice_565, Slice_570, Slice_575, Slice_580, Slice_585, Slice_590, Slice_595, Slice_600, Slice_605, Slice_610, LeakyRelu_523, Sigmoid_662, Conv_524, (Unnamed Layer* 646) [Constant] + (Unnamed Layer* 647) [Shuffle] + Mul_664, Sub_666, Exp_667, Slice_688, Slice_695, Slice_716, Slice_723, Slice_744, Slice_751, LeakyRelu_525, Slice_674, Slice_681, Slice_702, Slice_709, Slice_730, Slice_737, Reshape_644 + Transpose_645, Reshape_629, Sigmoid_668, Reshape_661, Reshape_894, Sigmoid_669, Mul_895, (Unnamed Layer* 662) [Constant] + (Unnamed Layer* 663) [Shuffle] + Mul_690, (Unnamed Layer* 666) [Constant] + (Unnamed Layer* 667) [Shuffle] + Mul_697, (Unnamed Layer* 676) [Constant] + (Unnamed Layer* 677) [Shuffle] + Mul_718, (Unnamed Layer* 680) [Constant] + (Unnamed Layer* 681) [Shuffle] + Mul_725, (Unnamed Layer* 690) [Constant] + (Unnamed Layer* 691) [Shuffle] + Mul_746, (Unnamed Layer* 694) [Constant] + (Unnamed Layer* 695) [Shuffle] + Mul_753, Conv_526, (Unnamed Layer* 684) [Constant] + Add_732, (Unnamed Layer* 687) [Constant] + Add_739, LeakyRelu_527, Conv_528, LeakyRelu_529, Conv_530, LeakyRelu_531, Conv_532, Conv_535, LeakyRelu_533, LeakyRelu_536, Conv_534, Conv_538, Slice_900, Slice_905, Slice_910, Slice_915, Slice_920, Slice_925, Slice_930, Slice_935, Slice_940, Slice_945, Slice_950, Slice_955, LeakyRelu_539, Sigmoid_1007, Conv_540, (Unnamed Layer* 901) [Constant] + (Unnamed Layer* 902) [Shuffle] + Mul_1009, Sub_1011, Exp_1012, Slice_1033, Slice_1040, Slice_1061, Slice_1068, Slice_1089, Slice_1096, LeakyRelu_541, Slice_1019, Slice_1026, Slice_1047, Slice_1054, Slice_1075, Slice_1082, Reshape_989 + Transpose_990, Reshape_974, Sigmoid_1013, Reshape_1006, Reshape_1239, Sigmoid_1014, Mul_1240, (Unnamed Layer* 917) [Constant] + (Unnamed Layer* 918) [Shuffle] + Mul_1035, (Unnamed Layer* 921) [Constant] + (Unnamed Layer* 922) [Shuffle] + Mul_1042, (Unnamed Layer* 931) [Constant] + (Unnamed Layer* 932) [Shuffle] + Mul_1063, (Unnamed Layer* 935) [Constant] + (Unnamed Layer* 936) [Shuffle] + Mul_1070, (Unnamed Layer* 945) [Constant] + (Unnamed Layer* 946) [Shuffle] + Mul_1091, (Unnamed Layer* 949) [Constant] + (Unnamed Layer* 950) [Shuffle] + Mul_1098, Conv_542, (Unnamed Layer* 939) [Constant] + Add_1077, (Unnamed Layer* 942) [Constant] + Add_1084, LeakyRelu_543, Conv_544, LeakyRelu_545, Conv_546, LeakyRelu_547, Conv_548, LeakyRelu_549, Conv_550, Slice_1245, Slice_1250, Slice_1255, Slice_1260, Slice_1265, Slice_1270, Slice_1275, Slice_1280, Slice_1285, Slice_1290, Slice_1295, Slice_1300, Sigmoid_1352, (Unnamed Layer* 1156) [Constant] + (Unnamed Layer* 1157) [Shuffle] + Mul_1354, Sub_1356, Exp_1357, Slice_1378, Slice_1385, Slice_1406, Slice_1413, Slice_1434, Slice_1441, Slice_1364, Slice_1371, Slice_1392, Slice_1399, Slice_1420, Slice_1427, Reshape_1334 + Transpose_1335, Reshape_1319, Sigmoid_1358, Reshape_1351, Reshape_1584, Sigmoid_1359, Mul_1585, 1679 copy, 2044 copy, 2409 copy, (Unnamed Layer* 1172) [Constant] + (Unnamed Layer* 1173) [Shuffle] + Mul_1380, (Unnamed Layer* 1176) [Constant] + (Unnamed Layer* 1177) [Shuffle] + Mul_1387, (Unnamed Layer* 1186) [Constant] + (Unnamed Layer* 1187) [Shuffle] + Mul_1408, (Unnamed Layer* 1190) [Constant] + (Unnamed Layer* 1191) [Shuffle] + Mul_1415, (Unnamed Layer* 1200) [Constant] + (Unnamed Layer* 1201) [Shuffle] + Mul_1436, (Unnamed Layer* 1204) [Constant] + (Unnamed Layer* 1205) [Shuffle] + Mul_1443, 2190 copy, 2218 copy, 2246 copy, 2197 copy, 2225 copy, 2253 copy, (Unnamed Layer* 1194) [Constant] + Add_1422, (Unnamed Layer* 1197) [Constant] + Add_1429, 1446 copy, 1474 copy, 1502 copy, 1453 copy, 1481 copy, 1509 copy, 1524 copy, 1526 copy, 1525 copy, 1527 copy, Div_764, Div_769, Slice_774, Slice_816, Slice_795, Slice_837, Reshape_790, Reshape_832, Reshape_811, Reshape_853, Mul_855, Mul_858, Sub_856, Sub_859, Add_860, Add_861, 1634 copy, 1637 copy, 1638 copy, 1639 copy, Reshape_878, 1811 copy, 1839 copy, 1867 copy, 1818 copy, 1846 copy, 1874 copy, 1889 copy, 1891 copy, 1890 copy, 1892 copy, Div_1109, Div_1114, Slice_1119, Slice_1161, Slice_1140, Slice_1182, Reshape_1135, Reshape_1177, Reshape_1156, Reshape_1198, Mul_1200, Mul_1203, Sub_1201, Sub_1204, Add_1205, Add_1206, 1999 copy, 2002 copy, 2003 copy, 2004 copy, Reshape_1223, 2176 copy, 2204 copy, 2232 copy, 2183 copy, 2211 copy, 2239 copy, 2254 copy, 2256 copy, 2255 copy, 2257 copy, Div_1454, Div_1459, Slice_1464, Slice_1506, Slice_1485, Slice_1527, Reshape_1480, Reshape_1522, Reshape_1501, Reshape_1543, Mul_1545, Mul_1548, Sub_1546, Sub_1549, Add_1550, Add_1551, 2364 copy, 2367 copy, 2368 copy, 2369 copy, Reshape_1568, 1660 copy, 2025 copy, 2390 copy,

Thanks.

1 Like

Thanks

Thanks for your test result. My extra question is, can dla inference reduce the gpu usage effectivly than normal trt inference?

I am tring to reduce the gpu usage of yolov4 inference.

Thanks.

Please try it. If still need suggestion, please help to open a new topic. Thanks