@Fiona.Chen
Thank you for the quick reply!
I don’t think the issue is with my model. I have verified the engine
file as you can see from the following logs:
usr/src/tensorrt/bin/trtexec --loadEngine=/media/usb/models/reswapper_dynamic.onnx_b16_gpu0_fp32.engine --shapes=target:16x3x128x128,source:16x512
&&&& RUNNING TensorRT.trtexec [TensorRT v100300] # /usr/src/tensorrt/bin/trtexec --loadEngine=/media/usb/models/reswapper_dynamic.onnx_b16_gpu0_fp32.engine --shapes=target:16x3x128x128,source:16x512
[06/04/2025-19:13:30] [I] === Model Options ===
[06/04/2025-19:13:30] [I] Format: *
[06/04/2025-19:13:30] [I] Model:
[06/04/2025-19:13:30] [I] Output:
[06/04/2025-19:13:30] [I]
[06/04/2025-19:13:30] [I] === System Options ===
[06/04/2025-19:13:30] [I] Device: 0
[06/04/2025-19:13:30] [I] DLACore:
[06/04/2025-19:13:30] [I] Plugins:
[06/04/2025-19:13:30] [I] setPluginsToSerialize:
[06/04/2025-19:13:30] [I] dynamicPlugins:
[06/04/2025-19:13:30] [I] ignoreParsedPluginLibs: 0
[06/04/2025-19:13:30] [I]
[06/04/2025-19:13:30] [I] === Inference Options ===
[06/04/2025-19:13:30] [I] Batch: Explicit
[06/04/2025-19:13:30] [I] Input inference shape : source=16x512
[06/04/2025-19:13:30] [I] Input inference shape : target=16x3x128x128
[06/04/2025-19:13:30] [I] Iterations: 10
[06/04/2025-19:13:30] [I] Duration: 3s (+ 200ms warm up)
[06/04/2025-19:13:30] [I] Sleep time: 0ms
[06/04/2025-19:13:30] [I] Idle time: 0ms
[06/04/2025-19:13:30] [I] Inference Streams: 1
[06/04/2025-19:13:30] [I] ExposeDMA: Disabled
[06/04/2025-19:13:30] [I] Data transfers: Enabled
[06/04/2025-19:13:30] [I] Spin-wait: Disabled
[06/04/2025-19:13:30] [I] Multithreading: Disabled
[06/04/2025-19:13:30] [I] CUDA Graph: Disabled
[06/04/2025-19:13:30] [I] Separate profiling: Disabled
[06/04/2025-19:13:30] [I] Time Deserialize: Disabled
[06/04/2025-19:13:30] [I] Time Refit: Disabled
[06/04/2025-19:13:30] [I] NVTX verbosity: 0
[06/04/2025-19:13:30] [I] Persistent Cache Ratio: 0
[06/04/2025-19:13:30] [I] Optimization Profile Index: 0
[06/04/2025-19:13:30] [I] Weight Streaming Budget: 100.000000%
[06/04/2025-19:13:30] [I] Inputs:
[06/04/2025-19:13:30] [I] Debug Tensor Save Destinations:
[06/04/2025-19:13:30] [I] === Reporting Options ===
[06/04/2025-19:13:30] [I] Verbose: Disabled
[06/04/2025-19:13:30] [I] Averages: 10 inferences
[06/04/2025-19:13:30] [I] Percentiles: 90,95,99
[06/04/2025-19:13:30] [I] Dump refittable layers:Disabled
[06/04/2025-19:13:30] [I] Dump output: Disabled
[06/04/2025-19:13:30] [I] Profile: Disabled
[06/04/2025-19:13:30] [I] Export timing to JSON file:
[06/04/2025-19:13:30] [I] Export output to JSON file:
[06/04/2025-19:13:30] [I] Export profile to JSON file:
[06/04/2025-19:13:30] [I]
[06/04/2025-19:13:30] [I] === Device Information ===
[06/04/2025-19:13:30] [I] Available Devices:
[06/04/2025-19:13:30] [I] Device 0: "Orin" UUID: GPU-8d2a93dd-b960-5cb3-86c0-c70c99cd0a0e
[06/04/2025-19:13:30] [I] Selected Device: Orin
[06/04/2025-19:13:30] [I] Selected Device ID: 0
[06/04/2025-19:13:30] [I] Selected Device UUID: GPU-8d2a93dd-b960-5cb3-86c0-c70c99cd0a0e
[06/04/2025-19:13:30] [I] Compute Capability: 8.7
[06/04/2025-19:13:30] [I] SMs: 16
[06/04/2025-19:13:30] [I] Device Global Memory: 62840 MiB
[06/04/2025-19:13:30] [I] Shared Memory per SM: 164 KiB
[06/04/2025-19:13:30] [I] Memory Bus Width: 256 bits (ECC disabled)
[06/04/2025-19:13:30] [I] Application Compute Clock Rate: 1.3 GHz
[06/04/2025-19:13:30] [I] Application Memory Clock Rate: 1.3 GHz
[06/04/2025-19:13:30] [I]
[06/04/2025-19:13:30] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[06/04/2025-19:13:30] [I]
[06/04/2025-19:13:30] [I] TensorRT version: 10.3.0
[06/04/2025-19:13:30] [I] Loading standard plugins
[06/04/2025-19:13:31] [I] [TRT] Loaded engine size: 78 MiB
[06/04/2025-19:13:31] [I] Engine deserialized in 0.0739509 sec.
[06/04/2025-19:13:31] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +289, now: CPU 0, GPU 366 (MiB)
[06/04/2025-19:13:31] [I] Setting persistentCacheLimit to 0 bytes.
[06/04/2025-19:13:31] [I] Set shape of input tensor target to: 16x3x128x128
[06/04/2025-19:13:31] [I] Set shape of input tensor source to: 16x512
[06/04/2025-19:13:31] [I] Created execution context with device memory size: 288.406 MiB
[06/04/2025-19:13:31] [I] Using random values for input target
[06/04/2025-19:13:31] [I] Input binding for target with dimensions 16x3x128x128 is created.
[06/04/2025-19:13:31] [I] Using random values for input source
[06/04/2025-19:13:31] [I] Input binding for source with dimensions 16x512 is created.
[06/04/2025-19:13:31] [I] Output binding for output with dimensions 16x3x128x128 is created.
[06/04/2025-19:13:31] [I] Starting inference
[06/04/2025-19:13:34] [I] Warmup completed 4 queries over 200 ms
[06/04/2025-19:13:34] [I] Timing trace has 54 queries over 3.17112 s
[06/04/2025-19:13:34] [I]
[06/04/2025-19:13:34] [I] === Trace details ===
[06/04/2025-19:13:34] [I] Trace averages of 10 runs:
[06/04/2025-19:13:34] [I] Average on 10 runs - GPU latency: 57.5567 ms - Host latency: 57.8528 ms (enqueue 0.491925 ms)
[06/04/2025-19:13:34] [I] Average on 10 runs - GPU latency: 57.5644 ms - Host latency: 57.8616 ms (enqueue 0.457556 ms)
[06/04/2025-19:13:34] [I] Average on 10 runs - GPU latency: 57.7458 ms - Host latency: 58.0384 ms (enqueue 0.419116 ms)
[06/04/2025-19:13:34] [I] Average on 10 runs - GPU latency: 57.7255 ms - Host latency: 58.0242 ms (enqueue 0.458887 ms)
[06/04/2025-19:13:34] [I] Average on 10 runs - GPU latency: 57.7195 ms - Host latency: 58.0135 ms (enqueue 0.395703 ms)
[06/04/2025-19:13:34] [I]
[06/04/2025-19:13:34] [I] === Performance summary ===
[06/04/2025-19:13:34] [I] Throughput: 17.0287 qps
[06/04/2025-19:13:34] [I] Latency: min = 57.1222 ms, max = 58.554 ms, mean = 57.9498 ms, median = 57.9761 ms, percentile(90%) = 58.3058 ms, percentile(95%) = 58.437 ms, percentile(99%) = 58.554 ms
[06/04/2025-19:13:34] [I] Enqueue Time: min = 0.386963 ms, max = 0.645996 ms, mean = 0.442129 ms, median = 0.415405 ms, percentile(90%) = 0.527618 ms, percentile(95%) = 0.573486 ms, percentile(99%) = 0.645996 ms
[06/04/2025-19:13:34] [I] H2D Latency: min = 0.131836 ms, max = 0.157959 ms, mean = 0.139919 ms, median = 0.13739 ms, percentile(90%) = 0.151489 ms, percentile(95%) = 0.154602 ms, percentile(99%) = 0.157959 ms
[06/04/2025-19:13:34] [I] GPU Compute Time: min = 56.8344 ms, max = 58.262 ms, mean = 57.6553 ms, median = 57.6754 ms, percentile(90%) = 58.0114 ms, percentile(95%) = 58.1472 ms, percentile(99%) = 58.262 ms
[06/04/2025-19:13:34] [I] D2H Latency: min = 0.0932617 ms, max = 0.158691 ms, mean = 0.154602 ms, median = 0.155518 ms, percentile(90%) = 0.157593 ms, percentile(95%) = 0.158081 ms, percentile(99%) = 0.158691 ms
[06/04/2025-19:13:34] [I] Total Host Walltime: 3.17112 s
[06/04/2025-19:13:34] [I] Total GPU Compute Time: 3.11338 s
[06/04/2025-19:13:34] [I] Explanations of the performance metrics are printed in the verbose logs.
[06/04/2025-19:13:34] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v100300] # /usr/src/tensorrt/bin/trtexec --loadEngine=/media/usb/models/reswapper_dynamic.onnx_b16_gpu0_fp32.engine --shapes=target:16x3x128x128,source:16x512
However, when I load this model in deepstream, I get the model’s arch like this:
Opening in BLOCKING MODE
Setting min object dimensions as 16x16 instead of 1x1 to support VIC compute mode.
INFO: [FullDims Engine Info]: layers num: 3
0 INPUT kFLOAT target 3x128x128 min: 1x3x128x128 opt: 16x3x128x128 Max: 16x3x128x128
1 INPUT kFLOAT source 512 min: 1x512 opt: 16x512 Max: 16x512
2 OUTPUT kFLOAT output 3x128x128 min: 0 opt: 0 Max: 0
The main issue is I am not able to verify if the model is taking face objects in batch. It should be taking multiple face objects of single frame at once. I think it would be helpful for you to know that I am using a custom preprocessing for this model. A sample of the relevant code is provided:
// Collect all faces from the current frame(s) in this batch
std::vector<FaceBatchData> current_frame_faces;
for (l_frame = batch_meta->frame_meta_list; l_frame != nullptr; l_frame = l_frame->next) {
NvDsFrameMeta *frame_meta = reinterpret_cast<NvDsFrameMeta *>(l_frame->data);
NvDsMetaList *l_obj = nullptr;
std::cout << "\n=== Processing Frame " << frame_meta->frame_num << " ===" << std::endl;
std::cout << "Starting face collection for frame " << frame_meta->frame_num << std::endl;
for (l_obj = frame_meta->obj_meta_list; l_obj != nullptr; l_obj = l_obj->next) {
NvDsObjectMeta *obj_meta = reinterpret_cast<NvDsObjectMeta *>(l_obj->data);
if (!obj_meta) continue;
std::cout << " Found object_id: " << obj_meta->object_id << std::endl;
if (obj_meta->base_meta.meta_type == NVDS_OBJ_META && obj_meta->unique_component_id == 1) {
keypoints.clear();
guint num_joints = obj_meta->mask_params.size / (sizeof(float) * 2);
for (guint i = 0; i < num_joints; ++i) {
gfloat xc = obj_meta->mask_params.data[i * 2] * (width/640);
gfloat yc = obj_meta->mask_params.data[i * 2 + 1] * (width/640);
keypoints.push_back(cv::Point2f(xc, yc));
}
if (keypoints.size() == 5) {
cv::Mat M, warp_mat;
std::tie(M, warp_mat) = norm_crop2(rgb_image, keypoints, 128);
// Store transformation matrix in object meta
for (int i = 0; i < 6; ++i) {
obj_meta->misc_obj_info[i] = *reinterpret_cast<const gint64*>(&M.at<double>(i));
}
warp_mat /= 255.0f;
// Create batch data for this frame
FaceBatchData face_data;
face_data.face = warp_mat.clone();
face_data.transform_matrix = M.clone();
face_data.obj_meta = obj_meta;
face_data.frame_num = frame_meta->frame_num;
face_data.object_id = obj_meta->object_id;
face_data.batch_index = current_frame_faces.size();
current_frame_faces.push_back(face_data);
std::cout << " Added face to frame batch. Object ID: " << obj_meta->object_id
<< ", Frame faces count: " << current_frame_faces.size() << std::endl;
}
}
}
std::cout << "Finished collecting faces for frame " << frame_meta->frame_num
<< ". Total faces collected: " << current_frame_faces.size() << std::endl;
}
gst_buffer_unmap(inbuf, &in_map_info);
// Process the batch if we have faces in the current frame
if (current_frame_faces.size() > 0) {
std::cout << "\n=== Starting Batch Processing ===" << std::endl;
std::cout << "Total faces to process in batch: " << current_frame_faces.size() << std::endl;
// Process all faces from this frame as a batch
char* base_ptr = reinterpret_cast<char*>(buf->memory_ptr);
size_t planar_size = 128 * 128 * 3 * sizeof(float);
size_t frame_batch_size = current_frame_faces.size();
std::cout << "Creating tensor batch with shape: [" << frame_batch_size << ", 3, 128, 128]" << std::endl;
std::cout << "Planar size per face: " << planar_size << " bytes" << std::endl;
for (size_t i = 0; i < frame_batch_size; ++i) {
FaceBatchData& face_data = current_frame_faces[i];
std::cout << " Processing face " << i << "/" << frame_batch_size
<< " (Object ID: " << face_data.object_id
<< " from frame " << face_data.frame_num << ")" << std::endl;
float* pDst = reinterpret_cast<float*>(base_ptr) + i * (128 * 128 * 3);
float* planar_memory = (float*)malloc(planar_size);
if (!planar_memory) {
std::cerr << "Error: Failed to allocate planar_memory!" << std::endl;
continue;
}
// Convert to planar format
for (int j = 0; j < 128 * 128; j++) {
planar_memory[j] = face_data.face.at<cv::Vec3f>(j)[0]; // R
planar_memory[j + 128 * 128] = face_data.face.at<cv::Vec3f>(j)[1]; // G
planar_memory[j + 2 * 128 * 128] = face_data.face.at<cv::Vec3f>(j)[2]; // B
}
cudaError_t err = cudaMemcpy(pDst, planar_memory, planar_size, cudaMemcpyHostToDevice);
if (err != cudaSuccess) {
std::cerr << "Error: cudaMemcpy failed! " << cudaGetErrorString(err) << std::endl;
} else {
std::cout << " Successfully copied face " << i << " to GPU memory" << std::endl;
}
free(planar_memory);
}
// Store batch metadata for parser to use
ctx->current_batch_size = frame_batch_size;
ctx->batch_object_ids.clear();
for (const auto& face_data : current_frame_faces) {
ctx->batch_object_ids.push_back(face_data.object_id);
}
std::cout << "Stored " << ctx->batch_object_ids.size() << " object IDs for batch" << std::endl;
// Update network input shape with actual batch size
tensorParam.params.network_input_shape[0] = frame_batch_size;
status = ctx->tensor_impl->syncStream();
if (status != NVDSPREPROCESS_SUCCESS) {
std::cerr << "Custom Lib: Cuda Stream Synchronization failed" << std::endl;
acquirer->release(buf);
return status;
}
std::cout << "Successfully processed batch of " << frame_batch_size << " faces" << std::endl;
return NVDSPREPROCESS_SUCCESS;
} else {
// No faces in this frame
std::cout << "No faces found in current frame, skipping..." << std::endl;
acquirer->release(buf);
return NVDSPREPROCESS_TENSOR_NOT_READY;
}
}
Though, the input has been batched in preprocess, the inference is still run for each object which is verified by adding logs in our custom inference function.
Also, there is no increase in performance in comparison to model with batch size =1.
Basically, I am facing implementation issue as the model is working as expected outside of deepstream. Can you please help me with this? Please let me know if you need any other information.
Thank You!