AI模型推理延时异常

测试模型在Orin 200 TOPS设备上运行情况(20台左右设备测试),模型的平均推理帧数为10 fps,模型推理延时为8 ± 0.5 ms,gpu功耗稳定在12.3±0.2 瓦


运行情况如上图

但是在2台设备上,gpu功耗偏低,模型的推理延时出现了较大波动:8~140 ms之间变化,GPU的功耗为10~11瓦波动

在延时异常的设备上,已经确认的部分:
(1)jetson_clocks使用了最大功率模式
(2)构建了gemm基准测试,两个设备TensorCore的都能达到22.5 TFP16的算力
(3)做基准测试(拷机),设备SOM功耗能达到60瓦,GPU soc有30瓦+
(4)tpc确认为7个开启状态

问题:
模型理论算力相同,推理延时有较大波动,为什么有这种差异呢?是什么原因导致了这种问题呢?如何解决这种问题呢?

追加信息:
(1)通过打印耗时,定位为executeV2函数,进行推理的时间出现了剧烈波动,期望为8 ms ± 0.5 ms,实际波动范围为7.5 ~ 150 ms
(2)在延时异常的设备上,通过trtexec进行模型的profile,三次获取99%延时范围[7.5~16], [7.5~8.4], [7.4~8.4]

Hi,

Could you share more info about your test case?
Especially regarding how you separate tasks from the different devices.
Do you separate it based on data or based on model?

If possible, could you share the source/script you used for this measurement with us?
Thanks.

(1) A center-point model is used in our perception software. It is deployed with c++ TensorRT APIs.
(2) To check the model performance in AGX Orin 200 TOPS, the trtexec is used.
(3) To check the tensor core performance, a simple GEMM function is used. Please reference to the following codes.

include
include <cuda_runtime.h>
include <cublasLt.h>
include <cuda_fp16.h>

// Utility function to initialize matrix data
void init_matrix(half *matrix, int rows, int cols) {
for (int i = 0; i < rows * cols; i++) {
matrix[i] = __float2half(rand() / float(RAND_MAX));
}
}

int main() {
const int M = 4096;
const int N = 4096;
const int K = 4096;

half *A, *B, *C;
cudaMallocManaged(&A, M * K * sizeof(half));
cudaMallocManaged(&B, K * N * sizeof(half));
cudaMallocManaged(&C, M * N * sizeof(half));

// Initialize matrices A and B
init_matrix(A, M, K);
init_matrix(B, K, N);

cublasLtHandle_t handle;
cublasLtCreate(&handle);

cublasLtMatmulDesc_t operationDesc;
cublasLtMatrixLayout_t Adesc, Bdesc, Cdesc;
cublasLtMatmulPreference_t preference;
cublasLtMatmulAlgo_t algo;

cublasLtMatmulDescCreate(&operationDesc, CUBLAS_COMPUTE_16F, CUDA_R_32F);
cublasLtMatrixLayoutCreate(&Adesc, CUDA_R_16F, M, K, M);
cublasLtMatrixLayoutCreate(&Bdesc, CUDA_R_16F, K, N, K);
cublasLtMatrixLayoutCreate(&Cdesc, CUDA_R_16F, M, N, M);
cublasLtMatmulPreferenceCreate(&preference);

float alpha = 1.0f, beta = 0.0f;
void *alphaPtr = &alpha, *betaPtr = &beta;

int returnedResults = 0;
cublasLtMatmulHeuristicResult_t heuristicResult;
cublasLtMatmulAlgoGetHeuristic(handle, operationDesc, Adesc, Bdesc, Cdesc, Cdesc, preference, 1, &heuristicResult, &returnedResults);

if (returnedResults == 0) {
    std::cout << "No suitable algorithm found." << std::endl;
    return 1;
}

algo = heuristicResult.algo;

// Pre-warming
cublasLtMatmul(handle, operationDesc, alphaPtr, A, Adesc, B, Bdesc,
               betaPtr, C, Cdesc, C, Cdesc, &algo, nullptr, 0, 0);

// Measure performance with multiple repetitions
int numRepeats = 100;
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
float milliseconds = 0;

cudaEventRecord(start);
for (int i = 0; i < numRepeats; ++i) {
    cublasLtMatmul(handle, operationDesc, alphaPtr, A, Adesc, B, Bdesc,
                   betaPtr, C, Cdesc, C, Cdesc, &algo, nullptr, 0, 0);
}
cudaEventRecord(stop);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&milliseconds, start, stop);

// Calculate average performance
double ops = 2.0 * M * N * K * numRepeats;
double seconds = milliseconds / 1000.0;
double tops = (ops / seconds) / 1e12; // Convert to tera operations per second

std::cout << "Average Performance: " << tops << " TOPS" << std::endl;

// Cleanup
cublasLtDestroy(handle);
cudaFree(A);
cudaFree(B);
cudaFree(C);

return 0;

}

I have a new discovery:

When --runtime=nvidia have been used, some GPU and CUDA libraries are copied from the host system to our container, such as /usr/lib/aarch64-linux-gnu/tegra.

I have found that the libraries in containers in different devices are different, so I have copied the different files that can run correctly to the devices with abnormal AI inference delay. And the linking have been changed to the files , then the inference delay can meet our expectation.

It looks like a dependency error.

Hi,

Generally, we expect the BSP version container and Jetson native to be identical.
Since the driver might change and cause some unexpected issues.

Do you fix the issue after copying the driver under the /usr/lib/aarch64-linux-gnu/tegra folder?

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.