Orin cuda graph latency is too long

When the number of cuda graph kernels reaches 700+, after capturing cuda graph, the cuda graph launch delay reaches about 1ms. Is this delay normal? Does the delay increase with the number of kernels? Is it related to the amount of input and output data?

Please run the below command before benchmarking use case:

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

I try it, but it still has 1ms latency.
CPU Freq: 2.2GHz
GPU Freq: 1.3GHz
Do you have any latency data to reference?

We don’t have reference data as not sure how you run the test.
Please provide more details of your application or code, then we can try to do the experiment.
Thanks

ref code:

#include <iostream>
#include <cuda_runtime.h>

#define N 1024
#define NODE_NUM 350
#define RUN_NUM 20

__global__ void gemm_kernel(float *A, float *B, float *C, int n) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    if (row < n && col < n) {
        float value = 0;
        for (int k = 0; k < n; ++k) {
            value += A[row * n + k] * B[k * n + col];
        }
        C[row * n + col] = value;
    }
}

void initialize_matrices(float *A, float *B, float *C, int n) {
    for (int i = 0; i < n * n; ++i) {
        A[i] = static_cast<float>(i % n);
        B[i] = static_cast<float>(i % n);
        C[i] = 0.0f;
    }
}

int main() {
    float *A, *B, *C;
    float *d_A, *d_B, *d_C;

    A = (float *)malloc(N * N * sizeof(float));
    B = (float *)malloc(N * N * sizeof(float));
    C = (float *)malloc(N * N * sizeof(float));

    initialize_matrices(A, B, C, N);

    cudaMalloc(&d_A, N * N * sizeof(float));
    cudaMalloc(&d_B, N * N * sizeof(float));
    cudaMalloc(&d_C, N * N * sizeof(float));

    cudaMemcpy(d_A, A, N * N * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, B, N * N * sizeof(float), cudaMemcpyHostToDevice);

    dim3 block(32, 32);
    dim3 grid((N + block.x - 1) / block.x, (N + block.y - 1) / block.y);
    cudaStream_t stream;
    cudaStreamCreate(&stream);

    // warm up
    for(int32_t warm_up_idx = 0; warm_up_idx < 50; warm_up_idx++) {
      cudaMemcpy(d_A, A, N * N * sizeof(float), cudaMemcpyHostToDevice);
      cudaMemcpy(d_B, B, N * N * sizeof(float), cudaMemcpyHostToDevice);
      for(int32_t node_idx = 0; node_idx < NODE_NUM; node_idx++) {
        gemm_kernel<<<grid, block, 0, stream>>>(d_A, d_B, d_C, N);
        gemm_kernel<<<grid, block, 0, stream>>>(d_C, d_B, d_A, N);
      }
      gemm_kernel<<<grid, block, 0, stream>>>(d_A, d_B, d_C, N);
    }

    cudaGraph_t graph;
    cudaGraphExec_t instance;

    cudaGraphCreate(&graph, 0);

    cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);

    for(int32_t node_idx = 0; node_idx < NODE_NUM; node_idx++) {
      gemm_kernel<<<grid, block, 0, stream>>>(d_A, d_B, d_C, N);
      gemm_kernel<<<grid, block, 0, stream>>>(d_C, d_B, d_A, N);
    }
    gemm_kernel<<<grid, block, 0, stream>>>(d_A, d_B, d_C, N);

    cudaStreamEndCapture(stream, &graph);

    cudaGraphInstantiate(&instance, graph, nullptr, nullptr, 0);

    for(int32_t run_idx = 0; run_idx < RUN_NUM; run_idx++) {
      cudaMemcpy(d_A, A, N * N * sizeof(float), cudaMemcpyHostToDevice);
      cudaMemcpy(d_B, B, N * N * sizeof(float), cudaMemcpyHostToDevice);
      cudaGraphLaunch(instance, stream);
      cudaStreamSynchronize(stream);
    }

    cudaMemcpy(C, d_C, N * N * sizeof(float), cudaMemcpyDeviceToHost);

    cudaGraphDestroy(graph);
    cudaStreamDestroy(stream);
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);
    free(A);
    free(B);
    free(C);

    std::cout << "GEMM completed." << std::endl;

    return 0;
}

compile command:
nvcc cuda_graph_test.cpp -o cuda_graph_test

Then I use nsys to profile it.

Hi,

Thanks for sharing the source. We will test it and provide more info to you.

Have you also tested the sample on other desktop GPUs?
If yes, could you share the latency you measured with us as well?

Thanks.

I only tested on orin.
Recently I found that using nsys caused the cuda graph time to increase. But without nsys, it still takes dozens of us. Is this normal? The reference code is as follows:

auto start = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double, std::milli> duration = start - start;
    std::cout << "run time: " << duration.count() << " ms" << std::endl;
    for(int32_t run_idx = 0; run_idx < RUN_NUM; run_idx++) {
      cudaMemcpy(d_A, A, N * N * sizeof(float), cudaMemcpyHostToDevice);
      cudaMemcpy(d_B, B, N * N * sizeof(float), cudaMemcpyHostToDevice);
      start = std::chrono::high_resolution_clock::now();
      cudaGraphLaunch(instance, stream);
      auto end = std::chrono::high_resolution_clock::now();
      duration += (end - start);
      cudaStreamSynchronize(stream);
    }
    std::cout << "run time: " << duration.count() / RUN_NUM << " ms" << std::endl;

Hi

Could you move the memcpy and synchronze outside of the loop.
So we can check if the latency is caused by 700+ cuda graph launch.

Thanks.

Do you have any reference time for orin cuda graph startup?

The latency data of the local test is about tens of us (about 20us). Is this data normal?

Is this still an issue to support? Any result can be shared?

I just want to ask if you can provide a reference value for the startup latency of cuda graph? What is the approximate magnitude? Is it similar for all chips?

Hi,

Sorry that we don’t have such data for Jetson.
But you can find desktop Ampere GPU’s value below:

Thanks.

Thank you. Is the following code that measures the startup time of cuda graph OK?

Hi,

It’s more recommended to use our profiler (ex. Nsight System).
The profiler will collect the info based on GPU trace and can show more low-level and detailed info.

Thanks.

At first I used nsys, but I found that it was related to nsys configuration. If the cuda graph kernel details were not displayed, the cuda graph startup time would also be 100us+ (700+kernel). Do you have any recommended nsys configurations?

Hi,

Is the 100us for 700+ kernel or just one?
Based on the post above, the first cpu launch latency is around 67us for 4x len=100 cuda graph.

Thanks.

100us for one cuda graph(include 700+ kernel).
What does “4x len=100 cuda graph” mean?

Hi,

4 parallel chains and *Topology length=100.

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.