Could you share more info about your test case?
Especially regarding how you separate tasks from the different devices.
Do you separate it based on data or based on model?
If possible, could you share the source/script you used for this measurement with us?
Thanks.
(1) A center-point model is used in our perception software. It is deployed with c++ TensorRT APIs.
(2) To check the model performance in AGX Orin 200 TOPS, the trtexec is used.
(3) To check the tensor core performance, a simple GEMM function is used. Please reference to the following codes.
// Utility function to initialize matrix data
void init_matrix(half *matrix, int rows, int cols) {
for (int i = 0; i < rows * cols; i++) {
matrix[i] = __float2half(rand() / float(RAND_MAX));
}
}
int main() {
const int M = 4096;
const int N = 4096;
const int K = 4096;
half *A, *B, *C;
cudaMallocManaged(&A, M * K * sizeof(half));
cudaMallocManaged(&B, K * N * sizeof(half));
cudaMallocManaged(&C, M * N * sizeof(half));
// Initialize matrices A and B
init_matrix(A, M, K);
init_matrix(B, K, N);
cublasLtHandle_t handle;
cublasLtCreate(&handle);
cublasLtMatmulDesc_t operationDesc;
cublasLtMatrixLayout_t Adesc, Bdesc, Cdesc;
cublasLtMatmulPreference_t preference;
cublasLtMatmulAlgo_t algo;
cublasLtMatmulDescCreate(&operationDesc, CUBLAS_COMPUTE_16F, CUDA_R_32F);
cublasLtMatrixLayoutCreate(&Adesc, CUDA_R_16F, M, K, M);
cublasLtMatrixLayoutCreate(&Bdesc, CUDA_R_16F, K, N, K);
cublasLtMatrixLayoutCreate(&Cdesc, CUDA_R_16F, M, N, M);
cublasLtMatmulPreferenceCreate(&preference);
float alpha = 1.0f, beta = 0.0f;
void *alphaPtr = &alpha, *betaPtr = β
int returnedResults = 0;
cublasLtMatmulHeuristicResult_t heuristicResult;
cublasLtMatmulAlgoGetHeuristic(handle, operationDesc, Adesc, Bdesc, Cdesc, Cdesc, preference, 1, &heuristicResult, &returnedResults);
if (returnedResults == 0) {
std::cout << "No suitable algorithm found." << std::endl;
return 1;
}
algo = heuristicResult.algo;
// Pre-warming
cublasLtMatmul(handle, operationDesc, alphaPtr, A, Adesc, B, Bdesc,
betaPtr, C, Cdesc, C, Cdesc, &algo, nullptr, 0, 0);
// Measure performance with multiple repetitions
int numRepeats = 100;
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
float milliseconds = 0;
cudaEventRecord(start);
for (int i = 0; i < numRepeats; ++i) {
cublasLtMatmul(handle, operationDesc, alphaPtr, A, Adesc, B, Bdesc,
betaPtr, C, Cdesc, C, Cdesc, &algo, nullptr, 0, 0);
}
cudaEventRecord(stop);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&milliseconds, start, stop);
// Calculate average performance
double ops = 2.0 * M * N * K * numRepeats;
double seconds = milliseconds / 1000.0;
double tops = (ops / seconds) / 1e12; // Convert to tera operations per second
std::cout << "Average Performance: " << tops << " TOPS" << std::endl;
// Cleanup
cublasLtDestroy(handle);
cudaFree(A);
cudaFree(B);
cudaFree(C);
return 0;
When --runtime=nvidia have been used, some GPU and CUDA libraries are copied from the host system to our container, such as /usr/lib/aarch64-linux-gnu/tegra.
I have found that the libraries in containers in different devices are different, so I have copied the different files that can run correctly to the devices with abnormal AI inference delay. And the linking have been changed to the files , then the inference delay can meet our expectation.