Hi,
thanks for the reply. I was using nvpmodel 0
already, but not jetson_clocks
. Using the latter actually improved performance, but did not change the weird pre- and postprocessing times.
The implementations are 100% identical and there is no difference in the code no matter which engine I load.
These are the CPU stats and timings (averaged over 1000 measurements) I got:
FP16:
AVG Latency: 33.89ms [PRE: 1.02 CUDA: 18.17 PST: 2.85]
RAM 2692/31929MB (lfb 6800x4MB) SWAP 0/15964MB (cached 0MB) CPU [7%@2265,4%@2265,7%@2265,7%@2265,6%@2265,2%@2265,34%@2265,3%@2265] EMC_FREQ 39%@2133 GR3D_FREQ 82%@1377 APE 150 MTS fg 0% bg 4% AO@35.5C GPU@40.5C Tdiode@38.25C PMIC@50C AUX@35.5C CPU@37C thermal@37.6C Tboard@35C GPU 16107/16348 CPU 1688/1766 SOC 4143/4146 CV 0/0 VDDRQ 1994/2016 SYS5V 3920/3919
RAM 2692/31929MB (lfb 6800x4MB) SWAP 0/15964MB (cached 0MB) CPU [5%@2265,4%@2265,5%@2265,5%@2265,6%@2265,6%@2265,33%@2265,10%@2265] EMC_FREQ 39%@2133 GR3D_FREQ 73%@1377 APE 150 MTS fg 0% bg 4% AO@35.5C GPU@41C Tdiode@38.5C PMIC@50C AUX@35.5C CPU@37C thermal@37.45C Tboard@35C GPU 16261/16346 CPU 1688/1764 SOC 4143/4146 CV 0/0 VDDRQ 1994/2016 SYS5V 3920/3919
RAM 2692/31929MB (lfb 6800x4MB) SWAP 0/15964MB (cached 0MB) CPU [6%@2265,7%@2265,3%@2265,2%@2265,4%@2265,3%@2265,33%@2265,5%@2265] EMC_FREQ 39%@2133 GR3D_FREQ 74%@1377 APE 150 MTS fg 0% bg 4% AO@35.5C GPU@41C Tdiode@38.75C PMIC@50C AUX@35.5C CPU@37C thermal@37.6C Tboard@35C GPU 16261/16344 CPU 1841/1766 SOC 4143/4146 CV 0/0 VDDRQ 1994/2015 SYS5V 3920/3919
Engine Bindings:
[1 3 640 640] ("images") Datatype: 0
[1 3 80 80 85] ("528") Datatype: 1
[1 3 40 40 85] ("596") Datatype: 1
[1 3 20 20 85] ("664") Datatype: 1
[1 25200 85] ("output") Datatype: 0
FP32:
AVG Latency: 59.73ms [PRE: 1.85 CUDA: 45.14 PST: 0.68]
RAM 2761/31929MB (lfb 6802x4MB) SWAP 0/15964MB (cached 0MB) CPU [6%@2265,6%@2265,9%@2265,8%@2265,8%@2265,6%@2265,8%@2265,6%@2265] EMC_FREQ 44%@2133 GR3D_FREQ 99%@1377 APE 150 MTS fg 0% bg 3% AO@37C GPU@43.5C Tdiode@40.25C PMIC@50C AUX@37C CPU@38C thermal@39.25C Tboard@36C GPU 20658/20480 CPU 1224/1261 SOC 4745/4614 CV 0/0 VDDRQ 2600/2600 SYS5V 4160/4130
RAM 2762/31929MB (lfb 6802x4MB) SWAP 0/15964MB (cached 0MB) CPU [6%@2265,6%@2265,5%@2265,6%@2265,4%@2265,6%@2265,10%@2265,9%@2265] EMC_FREQ 44%@2133 GR3D_FREQ 99%@1377 APE 150 MTS fg 0% bg 3% AO@37C GPU@43.5C Tdiode@40.25C PMIC@50C AUX@37C CPU@38C thermal@39.4C Tboard@36C GPU 20505/20480 CPU 1378/1263 SOC 4592/4613 CV 0/0 VDDRQ 2601/2600 SYS5V 4120/4130
RAM 2761/31929MB (lfb 6802x4MB) SWAP 0/15964MB (cached 0MB) CPU [5%@2265,5%@2265,4%@2265,4%@2265,5%@2265,9%@2265,12%@2265,4%@2265] EMC_FREQ 44%@2133 GR3D_FREQ 99%@1377 APE 150 MTS fg 0% bg 2% AO@37C GPU@43.5C Tdiode@40.25C PMIC@50C AUX@37C CPU@38.5C thermal@39.4C Tboard@36C GPU 20505/20481 CPU 1225/1262 SOC 4745/4616 CV 0/0 VDDRQ 2601/2600 SYS5V 4120/4130
Engine Bindings:
[1 3 640 640] ("images") Datatype: 0
[1 3 80 80 85] ("528") Datatype: 0
[1 3 40 40 85] ("594") Datatype: 0
[1 3 20 20 85] ("660") Datatype: 0
[1 25200 85] ("output") Datatype: 0
I find it very odd that for preprocessing, FP16 is faster, while for postprocessing, FP32 is faster.
The engine bindings are identical except for the datatype of the central bindings (which I do not access).
I cannot share the full code sadly, but this is the preprocessing that I do (img
being an OpenCV::Mat
):
float *pFloat = static_cast<float *>(buffers[inputIndex]);
// forEach is significantly faster than all other methods to traverse over the cv::Mat
img.forEach<cv::Vec3b>([&](cv::Vec3b &p, const int *position) -> void {
// p[0-2] contains bgr data, position[0-1] the row-column location
// Incoming data is BGR, so convert to RGB in the process
int index = model_height * position[0] + position[1];
pFloat[index] = p[2] / input_float_divisor;
pFloat[model_size + index] = p[1] / input_float_divisor;
pFloat[2 * model_size + index] = p[0] / input_float_divisor;
});
Engine execution:
// Invoke asynchronous inference
context->enqueueV2(buffers, 0, nullptr);
float *model_output = static_cast<float *>(buffers[outputIndex]);
// wait for inference to finish
cudaStreamSynchronize(0);
Postprocessing:
unsigned long dimensions =
5 + num_classes; // 0,1,2,3 ->box,4->confidence,5-85 -> coco classes confidence
const unsigned long confidenceIndex = 4;
const unsigned long labelStartIndex = 5;
int highest_conf_index = 0;
int highest_conf_label = 0;
float highest_conf = 0.4f; // confidence threshold is 40%
for (int index = 0; index < output_size; index += dimensions) {
float confidence = model_output[index + confidenceIndex];
// for multiple classes, combine the confidence with class confidences
// for single class models, this step can be skipped
if (num_classes > 1) {
if (confidence <= highest_conf) {
continue;
}
for (unsigned long j = labelStartIndex; j < dimensions; ++j) {
float combined_conf = model_output[index + j] * confidence;
if (combined_conf > highest_conf) {
highest_conf = combined_conf;
highest_conf_index = index;
highest_conf_label = j - labelStartIndex;
}
}
} else {
if (confidence > highest_conf) {
highest_conf = confidence;
highest_conf_index = index;
highest_conf_label = 1;
}
}
}
// Evaluate results
if (highest_conf > 0.4f) {
if (TRACE_LOGGING)
ROS_INFO("Detected class %d with confidence %lf", highest_conf_label, highest_conf);
detection->propability = highest_conf;
detection->classID = highest_conf_label;
detection->centerX = model_output[highest_conf_index];
detection->centerY = model_output[highest_conf_index + 1];
detection->width = model_output[highest_conf_index + 2];
detection->height = model_output[highest_conf_index + 3];
} else {
detection->propability = 0.0;
}
Given that input- and output sizes are identical accross both engines, I reall do not see where the difference is coming from…