Why is the inference speed of DLA on agx orin much slower than that without DLA?

250854911 · March 10, 2025, 9:37am

Description

I tried to use DLA-based model inference on Jetson Agx Orin Developer Kit and the speed was much lower than non-DLA model inference. I tested both the pointpillar vfe model and the model_bn model from the jetson_dla_tutorial repository.
I will show the steps of my work:
I have an additional question: 1. Regarding the gpu+dla model inference, do I need to make any special modifications to the inference code?

Model conversion:
DLA
/usr/src/tensorrt/bin/trtexec --onnx=model_bn.onnx
–shapes=input:8x3x640x640
–saveEngine=model_bn_.engine
–exportProfile=model_bn_.json
–int8 --useDLACore=0 --allowGPUFallback --useSpinWait --separateProfileRun --verbose > model_bn_.log

Environment

System environment
Device: agx orin developer kit
jatback version: 6.0
cuda: 12.2
tensorrt: 8.6.2

Model inference code
int main(int argc, char** argv)
{
// 参数设置
const char* enginePath = “/project/ansy_dla/model_bn_.engine”; // 引擎文件路径
const int batchSize = 8; // 批处理大小
const int inputSize = 3 * 640 * 640; // 输入维度
const int outputSize = 10; // 输出维度

// 1. 加载TensorRT引擎
initLibNvInferPlugins(&gLogger, "");
std::ifstream engineFile(enginePath, std::ios::binary);
if(!engineFile.good()){
    std::cerr << "无法打开引擎文件: " << enginePath << std::endl;
    return -1;
}

engineFile.seekg(0, std::ios::end);
size_t engineSize = engineFile.tellg();
engineFile.seekg(0, std::ios::beg);
std::vector<char> engineData(engineSize);
engineFile.read(engineData.data(), engineSize);
engineFile.close();

// 创建运行时
IRuntime* runtime = createInferRuntime(gLogger);
ICudaEngine* engine = runtime->deserializeCudaEngine(engineData.data(), engineSize);
IExecutionContext* context = engine->createExecutionContext();

// 2. 准备内存绑定
const int inputIndex = engine->getBindingIndex("input");
const int outputIndex = engine->getBindingIndex("output");

// 分配GPU内存
void* deviceBuffers[2];
cudaMalloc(&deviceBuffers[inputIndex], batchSize * inputSize * sizeof(float));
cudaMalloc(&deviceBuffers[outputIndex], batchSize * outputSize * sizeof(float));

// 3. 生成随机输入数据
std::vector<float> hostInput(batchSize * inputSize);
std::default_random_engine generator;
std::normal_distribution<float> distribution(0.0f, 1.0f); // 正态分布

for(auto& v : hostInput){
    v = distribution(generator);
}

// 复制数据到GPU
cudaMemcpy(deviceBuffers[inputIndex], hostInput.data(),
          batchSize * inputSize * sizeof(float), cudaMemcpyHostToDevice);

// 4. 执行推理
cudaStream_t stream;
cudaStreamCreate(&stream);

auto start = std::chrono::high_resolution_clock::now();

// 异步执行
context->enqueueV2(deviceBuffers, stream, nullptr);

cudaStreamSynchronize(stream);

auto end = std::chrono::high_resolution_clock::now();
float latency = std::chrono::duration<float, std::milli>(end - start).count();

// 5. 获取输出结果
std::vector<float> hostOutput(batchSize * outputSize);
cudaMemcpy(hostOutput.data(), deviceBuffers[outputIndex],
          batchSize * outputSize * sizeof(float), cudaMemcpyDeviceToHost);

// 打印结果
std::cout << "\n总推理耗时: " << latency << " ms" << std::endl;
std::cout << "输出结果示例: ";
for(int i=0; i<10; ++i){
    std::cout << hostOutput[i] << " ";
}
std::cout << std::endl;

// 6. 清理资源
cudaStreamDestroy(stream);
cudaFree(deviceBuffers[inputIndex]);
cudaFree(deviceBuffers[outputIndex]);
context->destroy();
engine->destroy();
runtime->destroy();

return 0;

}

AakankshaS · March 28, 2025, 8:27pm

Hi @250854911 Please reach out to AGX Orin Forum for better assistance.

Topic		Replies	Views
Compute time in DLA slower than expected Jetson AGX Orin dla	5	941	July 28, 2023
DLA-v2 is slower than DLA-v1 Jetson AGX Orin tensorrt , jetson-inference	8	2601	July 6, 2022
Inference time of cuDLA on Jetson AGX Orin Jetson AGX Orin yolo , dla , jetson-orin	5	55	December 19, 2024
TensorRT model inference fully on DLA is slow due to abnormally slow cudaEventSynchronize time Jetson AGX Orin tensorrt , cuda , dla	10	1538	January 17, 2024
The Throughput is too slow in Nvidia jetson AGX ORin DLA Jetson AGX Orin cuda , cudnn , dla	4	498	January 31, 2024
Record traces of DLA Jetson Nano dla	5	19	May 5, 2025
Any performance benefits in using directly cuDLA instead of TensorRT? Jetson AGX Orin tensorrt , dla	3	611	February 9, 2023
Tensorrt Python API has a bug in DLA usage Jetson AGX Xavier tensorrt	11	632	August 17, 2022
Accessing Jetson's DLA from python TensorRT tensorrt , jetson-inference , python	3	2215	December 1, 2020
Jetson Orin AGX DLA does't works normal, infer speed is lower than without DLA Jetson AGX Orin dla	5	49	April 24, 2025

Why is the inference speed of DLA on agx orin much slower than that without DLA?

Description

Environment

Related topics