We are using O.D. network of SSD512 and we trying to decide will it work faster for us when using batch=2 over the classic computation of single image at a time.
So we made a test on a desktop machine with 1080Ti, and we saw that the execution time for batch=2 is 160% of the time that takes for a single image. The input tensor was [N=2,C=3,H=512,W=512]
Assuming the execution time of two images, one by one, takes 200% of the time of single image, it means that batch=2 consumes less time per image on average, to be precise: 80% of the original time, those saving 20% of the execution time. This looks promising…
We tried to do the same test on Xavier, but when converting the “.uff” file from the previous test with “trtexec” it crashed after several minutes of building the computational graph with the error:
trtexec: trtexec.cpp:360: void createMemory(const nvinfer1::ICudaEngine&, std::vector<void*>&, const string&): Assertion `(bindingIndex < (int) buffers.size()) && “Input/output name not found in network”’ failed.
So I took the original SSD512 “.uff” file that uses single image as an input [N=???,C=3,H=512,W=512] and run “trtexec” on it with batch=2. The execution time was almost 200% of the execution time of single image, which means that for XAvier, using batch=2 gives no advantage.
We aren’t using any profiling tool because we investigating the “neto” execution time of TensorRT engine, which is a “black box” from our point of view as a users of TensorRT.
Our questions are:
Did we do something wrong in this execution runtime test? Is there a way to overcome the crash in the ".uff" convert to "plan" file?