Separate GPU for Parallel on Jeton AgxOrin

phamthanhdat270198 · August 6, 2024, 3:36am

Hi,
I want to ask about issue related to parallel running model on agxorin. Have nvidia argument to separate the GPU for inference 2 models at the same time. Thanks!!!

AastaLLL · August 6, 2024, 5:19am

Hi,

Jetson only has one GPU but multiplies CUDA cores.

We don’t support dividing CUDA cores for different tasks.
But you can run it with difference CUDA streams to allow parallelism.

Thanks.

phamthanhdat270198 · August 6, 2024, 5:28am

Thanks for your responded. Please give me a example using cuda Stream to parallel inference 2 models

Vào 12:20, Th 3, 6 thg 8, 2024 AastaLLL via NVIDIA Developer Forums <notifications@nvidia.discoursemail.com> đã viết:

phamthanhdat270198 · August 6, 2024, 5:36am

I also using triton to parallel but i got a issue about transfer data time. The Post and Get time to triton take 0.01s. So it affect to total time alot.

Vào 12:28, Th 3, 6 thg 8, 2024 Đạt Phạm <phamthanhdat270198@gmail.com> đã viết:

AastaLLL · August 6, 2024, 5:36am

Hi,

Please find the sample below:

Thanks.

phamthanhdat270198 · August 6, 2024, 7:34am

I want to set specific cudastream ID on Cpp code yolo like --streams=0 or --streams=1, … in trtexec. How can I do this! Thanks!!!

AastaLLL · August 7, 2024, 7:51am

Hi,

Here is the inference call of TensorRT and you can feed the specific cuda stream as input directly:

https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-843/api/c_api/classnvinfer1_1_1_i_execution_context.html#a2f4429652736e8ef6e19f433400108c7


bool nvinfer1::IExecutionContext::enqueueV2 ( void *const *  bindings,
                                              cudaStream_t stream,
                                              cudaEvent_t *  inputConsumed 
)

In trtexec, you can find the source code below (under common folder):

github.com

NVIDIA/TensorRT/blob/release/8.4/samples/common/sampleInference.cpp#L504


      
          //!
          class EnqueueSafe
          {
          public:
              explicit EnqueueSafe(nvinfer1::safe::IExecutionContext& context, void** buffers)
                  : mContext(context)
                  , mBuffers(buffers)
              {
              }
          
              bool operator()(TrtCudaStream& stream) const
              {
                  if (mContext.enqueueV2(mBuffers, stream.get(), nullptr))
                  {
                      return true;
                  }
                  return false;
              }
          
              nvinfer1::safe::IExecutionContext& mContext;
              void** mBuffers{};

Thanks.

phamthanhdat270198 · August 7, 2024, 8:47am

As far as I understand, multiple streams are only used to run multiple data pushed into a model in parallel. rather than running for 2 different models at the same time. am I right?

AastaLLL · August 8, 2024, 6:49am

Hi,

You can modify into the 2 model use case.
The workflow should be similar.

Thanks.

phamthanhdat270198 · August 8, 2024, 6:57am

Did you mean create multiple context in tensorrt?
I also create multi streams like this. when i run single infer take 2ms and run 10 streams take 18 ms. Its seem not parallel there.
‘’’
void multi_stream(IExecutionContext& context, void** buffers,float input, float output, int batchSize, int nStreams){

//cudaStream_t stream;
//cudaStreamCreate(&stream);
cudaStream_t stream[nStreams];
for (int i = 0; i < nStreams; i++)
    cudaStreamCreate(&stream[i]);
//GPU
cudaEvent_t start, stop;
float elapsedTime;
cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start, 0);
for (int i = 0; i < nStreams; i++) {
    cudaMemcpyAsync(buffers[0], input, kBatchSize * 3 * kInputH * kInputW * sizeof(float), cudaMemcpyHostToDevice, stream[i]);
    context.enqueueV2(buffers, stream[i], nullptr);
    cudaMemcpyAsync(output, buffers[1], batchSize * kOutputSize * sizeof(float), cudaMemcpyDeviceToHost, stream[i]);
}

for (int i = 0; i < nStreams; ++i)
    cudaStreamSynchronize(stream[i]);
//

‘’’

AastaLLL · August 12, 2024, 7:20am

Hi,

Could you help monitor the system status when running a single engine?

$ sudo tegrastats

If the GPU is almost fully occupied when running a single engine.
Running two engines parallelly will have a similar performance as running two models sequentially.

Thanks.

phamthanhdat270198 · August 13, 2024, 8:07am

My orin has 64GB. It’s imposible to limit
‘’‘08-13-2024 15:03:17 RAM 5116/62797MB (lfb 13787x4MB) SWAP 0/31398MB (cached 0MB) CPU [2%@729,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,1%@729,1%@729,0%@729,0%@729,1%@729] EMC_FREQ 0%@204 GR3D_FREQ 0%@0 GR3D2_FREQ 0%@0 VIC_FREQ 115 APE 174 CV0@-256C CPU@42.781C Tboard@32C SOC2@39.593C Tdiode@33.25C SOC0@40.718C CV1@-256C GPU@-256C tj@42.781C SOC1@40.125C CV2@-256C VDD_GPU_SOC 1998mW/1998mW VDD_CPU_CV 799mW/759mW VIN_SYS_5V0 2629mW/2629mW NC 0mW/0mW VDDQ_VDD2_1V8AO 302mW/302mW NC 0mW/0mW
08-13-2024 15:03:18 RAM 5319/62797MB (lfb 13765x4MB) SWAP 0/31398MB (cached 0MB) CPU [4%@1231,2%@1897,0%@2201,0%@729,0%@729,0%@729,0%@729,0%@729,0%@2201,0%@2201,1%@2201,35%@2201] EMC_FREQ 0%@2133 GR3D_FREQ 37%@305 GR3D2_FREQ 37%@305 VIC_FREQ 115 APE 174 CV0@-256C CPU@43.218C Tboard@32C SOC2@39.75C Tdiode@33.25C SOC0@40.718C CV1@-256C GPU@39.812C tj@43.218C SOC1@40C CV2@-256C VDD_GPU_SOC 3597mW/2143mW VDD_CPU_CV 1599mW/835mW VIN_SYS_5V0 3438mW/2702mW NC 0mW/0mW VDDQ_VDD2_1V8AO 605mW/329mW NC 0mW/0mW
08-13-2024 15:03:19 RAM 5119/62797MB (lfb 13787x4MB) SWAP 0/31398MB (cached 0MB) CPU [3%@729,4%@729,0%@729,0%@729,0%@729,5%@729,0%@729,1%@729,0%@729,3%@729,8%@729,24%@729] EMC_FREQ 7%@204 GR3D_FREQ 0%@0 GR3D2_FREQ 0%@0 VIC_FREQ 115 APE 174 CV0@-256C CPU@42.906C Tboard@32C SOC2@39.656C Tdiode@33.25C SOC0@40.687C CV1@-256C GPU@-256C tj@42.906C SOC1@40.093C CV2@-256C VDD_GPU_SOC 3198mW/2231mW VDD_CPU_CV 1199mW/865mW VIN_SYS_5V0 3235mW/2746mW NC 0mW/0mW VDDQ_VDD2_1V8AO 403mW/335mW NC 0mW/0mW’‘’

AastaLLL · August 14, 2024, 6:03am

Hi,

Could you update the profiling code like below:

for (int i = 0; i < nStreams; i++) {
    cudaMemcpyAsync(buffers[0], input, kBatchSize * 3 * kInputH * kInputW * sizeof(float), cudaMemcpyHostToDevice, stream[i]);
}
cudaEventRecord(start, 0);
for (int i = 0; i < nStreams; i++) {
    context.enqueueV2(buffers, stream[i], nullptr);
}
cudaEventRecord(stop, 0);
for (int i = 0; i < nStreams; i++) {
    cudaMemcpyAsync(output, buffers[1], batchSize * kOutputSize * sizeof(float), cudaMemcpyDeviceToHost, stream[i]);
}

This will help to check if the issue comes from GPU cores or copy engines.

Thanks.

phamthanhdat270198 · August 14, 2024, 6:52am

Here is my code

void prepare_buffer(ICudaEngine* engine, float** input_buffer_device, float** output_buffer_device, float** output_buffer_host) {
    const int inputIndex = engine->getBindingIndex(kInputTensorName);
    const int outputIndex = engine->getBindingIndex(kOutputTensorName);
    // Create GPU buffers on device
    CUDA_CHECK(cudaMalloc((void**)input_buffer_device, kBatchSize * 3 * kInputH * kInputW * sizeof(float)));
    CUDA_CHECK(cudaMalloc((void**)output_buffer_device, kBatchSize * kOutputSize * sizeof(float)));
    *output_buffer_host = new float[kBatchSize * kOutputSize];
}

void infer(IExecutionContext& context, cudaStream_t& stream, void** buffers,float *input, float* output, int batchSize) {
  // infer on the batch asynchronously, and DMA output back to host
  CUDA_CHECK(cudaMemcpyAsync(buffers[0], input, kBatchSize * 3 * kInputH * kInputW * sizeof(float), cudaMemcpyHostToDevice, stream));
  context.enqueueV2(buffers, stream, nullptr);
  CUDA_CHECK(cudaMemcpyAsync(output, buffers[1], batchSize * kOutputSize * sizeof(float), cudaMemcpyDeviceToHost, stream));
  CUDA_CHECK(cudaStreamSynchronize(stream));
}

void multi_stream(IExecutionContext& context, void** buffers,float *input, float* output, int batchSize, int nStreams){
    
    cudaStream_t stream[nStreams];
    for (int i = 0; i < nStreams; i++)
        cudaStreamCreate(&stream[i]);
    //GPU
    cudaEvent_t start, stop;
    float elapsedTime;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    cudaEventRecord(start, 0);
    for (int i = 0; i < nStreams; i++) {
        cudaMemcpyAsync(buffers[0], input, kBatchSize * 3 * kInputH * kInputW * sizeof(float), cudaMemcpyHostToDevice, stream[i]);
        context.enqueueV2(buffers, stream[i], nullptr);
        cudaMemcpyAsync(output, buffers[1], batchSize * kOutputSize * sizeof(float), cudaMemcpyDeviceToHost, stream[i]);
    }

    for (int i = 0; i < nStreams; ++i)
        cudaStreamSynchronize(stream[i]);

    cudaDeviceSynchronize();
    cudaEventRecord(stop, 0);
    cudaEventSynchronize(stop);
    cudaEventElapsedTime(&elapsedTime, start, stop);
    std::cout << "Whole process took " << elapsedTime << "ms." << std::endl;
    // // Release stream and buffers
    for (int i = 0; i < nStreams; ++i)
        cudaStreamDestroy(stream[i]);
    
}

AastaLLL · August 15, 2024, 7:58am

Hi,

Could you update the profiling code as above?

Currently, the profiling includes both copy and inference.
We can check it further once knowing the bottleneck is from the copy engine or CUDA cores.

Thanks.

phamthanhdat270198 · August 15, 2024, 9:04am

Uploading: nsys.zip…

Here is nsys profile output

phamthanhdat270198 · August 15, 2024, 9:06am

Here is 1 stream infer

[08/15/2024-16:05:01] [I] Starting inference
[08/15/2024-16:05:04] [I] Warmup completed 6 queries over 200 ms
[08/15/2024-16:05:04] [I] Timing trace has 122 queries over 3.05411 s
[08/15/2024-16:05:04] [I] 
[08/15/2024-16:05:04] [I] === Trace details ===
[08/15/2024-16:05:04] [I] Trace averages of 10 runs:
[08/15/2024-16:05:04] [I] Average on 10 runs - GPU latency: 24.9667 ms - Host latency: 25.1563 ms (enqueue 24.8619 ms)
[08/15/2024-16:05:04] [I] Average on 10 runs - GPU latency: 24.9343 ms - Host latency: 25.1093 ms (enqueue 24.8491 ms)
[08/15/2024-16:05:04] [I] Average on 10 runs - GPU latency: 24.9179 ms - Host latency: 25.0934 ms (enqueue 24.8325 ms)
[08/15/2024-16:05:04] [I] Average on 10 runs - GPU latency: 24.9165 ms - Host latency: 25.0878 ms (enqueue 24.8371 ms)
[08/15/2024-16:05:04] [I] Average on 10 runs - GPU latency: 24.9302 ms - Host latency: 25.107 ms (enqueue 24.8424 ms)
[08/15/2024-16:05:04] [I] Average on 10 runs - GPU latency: 25.0102 ms - Host latency: 25.1831 ms (enqueue 24.9284 ms)
[08/15/2024-16:05:04] [I] Average on 10 runs - GPU latency: 24.9297 ms - Host latency: 25.1054 ms (enqueue 24.8451 ms)
[08/15/2024-16:05:04] [I] Average on 10 runs - GPU latency: 25.1203 ms - Host latency: 25.2986 ms (enqueue 25.001 ms)
[08/15/2024-16:05:04] [I] Average on 10 runs - GPU latency: 25.1213 ms - Host latency: 25.3045 ms (enqueue 25.0745 ms)
[08/15/2024-16:05:04] [I] Average on 10 runs - GPU latency: 25.0949 ms - Host latency: 25.2796 ms (enqueue 25.1322 ms)
[08/15/2024-16:05:04] [I] Average on 10 runs - GPU latency: 25.217 ms - Host latency: 25.3957 ms (enqueue 25.1214 ms)
[08/15/2024-16:05:04] [I] Average on 10 runs - GPU latency: 25.1207 ms - Host latency: 25.3087 ms (enqueue 25.0584 ms)
[08/15/2024-16:05:04] [I] 
[08/15/2024-16:05:04] [I] === Performance summary ===
[08/15/2024-16:05:04] [I] Throughput: 39.9462 qps
[08/15/2024-16:05:04] [I] Latency: min = 24.8834 ms, max = 25.7925 ms, mean = 25.1997 ms, median = 25.1017 ms, percentile(90%) = 25.5776 ms, percentile(95%) = 25.6309 ms, percentile(99%) = 25.7078 ms
[08/15/2024-16:05:04] [I] Enqueue Time: min = 24.5378 ms, max = 25.6399 ms, mean = 24.9516 ms, median = 24.8738 ms, percentile(90%) = 25.3293 ms, percentile(95%) = 25.4058 ms, percentile(99%) = 25.5352 ms
[08/15/2024-16:05:04] [I] H2D Latency: min = 0.101562 ms, max = 0.187653 ms, mean = 0.127065 ms, median = 0.123352 ms, percentile(90%) = 0.143311 ms, percentile(95%) = 0.155029 ms, percentile(99%) = 0.168549 ms
[08/15/2024-16:05:04] [I] GPU Compute Time: min = 24.7134 ms, max = 25.5967 ms, mean = 25.0206 ms, median = 24.9184 ms, percentile(90%) = 25.4131 ms, percentile(95%) = 25.446 ms, percentile(99%) = 25.543 ms
[08/15/2024-16:05:04] [I] D2H Latency: min = 0.0339355 ms, max = 0.053772 ms, mean = 0.0520365 ms, median = 0.0522461 ms, percentile(90%) = 0.0529785 ms, percentile(95%) = 0.0532227 ms, percentile(99%) = 0.0534668 ms
[08/15/2024-16:05:04] [I] Total Host Walltime: 3.05411 s
[08/15/2024-16:05:04] [I] Total GPU Compute Time: 3.05251 s
[08/15/2024-16:05:04] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[08/15/2024-16:05:04] [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[08/15/2024-16:05:04] [I] Explanations of the performance metrics are printed in the verbose logs.
[08/15/2024-16:05:04] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --loadEngine=model.plan --streams=1

And 10 streams infer

[08/15/2024-16:05:57] [I] Starting inference
[08/15/2024-16:06:00] [I] Warmup completed 2 queries over 200 ms
[08/15/2024-16:06:00] [I] Timing trace has 128 queries over 3.33013 s
[08/15/2024-16:06:00] [I] 
[08/15/2024-16:06:00] [I] === Trace details ===
[08/15/2024-16:06:00] [I] Trace averages of 10 runs:
[08/15/2024-16:06:00] [I] Average on 10 runs - GPU latency: 47.9082 ms - Host latency: 48.2507 ms (enqueue 40.1021 ms)
[08/15/2024-16:06:00] [I] Average on 10 runs - GPU latency: 29.2733 ms - Host latency: 29.4871 ms (enqueue 24.6858 ms)
[08/15/2024-16:06:00] [I] Average on 10 runs - GPU latency: 30.4185 ms - Host latency: 30.6263 ms (enqueue 24.7083 ms)
[08/15/2024-16:06:00] [I] Average on 10 runs - GPU latency: 29.7797 ms - Host latency: 29.994 ms (enqueue 24.6185 ms)
[08/15/2024-16:06:00] [I] Average on 10 runs - GPU latency: 31.0525 ms - Host latency: 31.2637 ms (enqueue 24.8122 ms)
[08/15/2024-16:06:00] [I] Average on 10 runs - GPU latency: 29.7574 ms - Host latency: 29.9661 ms (enqueue 24.6791 ms)
[08/15/2024-16:06:00] [I] Average on 10 runs - GPU latency: 31.3309 ms - Host latency: 31.5384 ms (enqueue 24.7597 ms)
[08/15/2024-16:06:00] [I] Average on 10 runs - GPU latency: 31.0658 ms - Host latency: 31.2716 ms (enqueue 24.7665 ms)
[08/15/2024-16:06:00] [I] Average on 10 runs - GPU latency: 31.5678 ms - Host latency: 31.7745 ms (enqueue 24.8989 ms)
[08/15/2024-16:06:00] [I] Average on 10 runs - GPU latency: 30.9529 ms - Host latency: 31.1582 ms (enqueue 24.7578 ms)
[08/15/2024-16:06:00] [I] Average on 10 runs - GPU latency: 31.2043 ms - Host latency: 31.4019 ms (enqueue 24.8225 ms)
[08/15/2024-16:06:00] [I] Average on 10 runs - GPU latency: 30.3152 ms - Host latency: 30.5136 ms (enqueue 24.6637 ms)
[08/15/2024-16:06:00] [I] 
[08/15/2024-16:06:00] [I] === Performance summary ===
[08/15/2024-16:06:00] [I] Throughput: 38.437 qps
[08/15/2024-16:06:00] [I] Latency: min = 26.51 ms, max = 127.413 ms, mean = 32.222 ms, median = 31.3437 ms, percentile(90%) = 32.677 ms, percentile(95%) = 33.3088 ms, percentile(99%) = 94.4686 ms
[08/15/2024-16:06:00] [I] Enqueue Time: min = 24.2891 ms, max = 100.153 ms, mean = 25.9526 ms, median = 24.757 ms, percentile(90%) = 25.0737 ms, percentile(95%) = 25.3054 ms, percentile(99%) = 84.879 ms
[08/15/2024-16:06:00] [I] H2D Latency: min = 0.102539 ms, max = 0.571503 ms, mean = 0.159011 ms, median = 0.151367 ms, percentile(90%) = 0.170959 ms, percentile(95%) = 0.18335 ms, percentile(99%) = 0.565536 ms
[08/15/2024-16:06:00] [I] GPU Compute Time: min = 26.3247 ms, max = 126.639 ms, mean = 32.0049 ms, median = 31.1168 ms, percentile(90%) = 32.4714 ms, percentile(95%) = 33.1497 ms, percentile(99%) = 93.8123 ms
[08/15/2024-16:06:00] [I] D2H Latency: min = 0.0339355 ms, max = 0.209015 ms, mean = 0.0581434 ms, median = 0.0563965 ms, percentile(90%) = 0.0610352 ms, percentile(95%) = 0.0632324 ms, percentile(99%) = 0.0848083 ms
[08/15/2024-16:06:00] [I] Total Host Walltime: 3.33013 s
[08/15/2024-16:06:00] [I] Total GPU Compute Time: 4.09662 s
[08/15/2024-16:06:00] [W] * GPU compute time is unstable, with coefficient of variance = 32.2288%.
[08/15/2024-16:06:00] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[08/15/2024-16:06:00] [I] Explanations of the performance metrics are printed in the verbose logs.
[08/15/2024-16:06:00] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --loadEngine=model.plan --streams=10

phamthanhdat270198 · August 16, 2024, 4:13am

here is nsys profile

nsys profile --trace=cuda,nvtx,cublas,cudla,cusparse,cudnn,nvmedia --cuda-graph-trace=node --gpu-metrics-frequency=200000 --gpu-metrics-device=all --cudabacktrace=all --cuda-memory-usage=true  /usr/src/tensorrt/bin/trtexec --loadEngine=model.plan --fp16 --streams=10

here is it’s output

AastaLLL · August 28, 2024, 9:22am

Hi,

The single stream inference takes around 25ms to finish.
While the 10 stream inference takes 29-31ms to finish.

Which means most of the inferences are done in parallel.
Thanks.

system · October 9, 2024, 2:55am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
[Question] trtexec understanding issue TensorRT	4	1097	December 6, 2021
How to parallel inference model in nvidia Xavier nx platform Jetson Xavier NX jetson-inference	7	2015	October 6, 2021
Multi-model parallel inferencing TensorRT	1	444	March 31, 2023
Multithread inference Jetson Xavier NX tensorrt	4	943	August 29, 2021
Multiple threads running inference are causing a slowdown DeepStream SDK tensorrt , jetson	25	1416	August 13, 2023
Model inference on multiple cuda streams with tensorrt api Jetson AGX Orin tensorrt , nsight , nvbugs	23	2713	February 20, 2024
TensorRT multi stream TensorRT	3	2949	February 29, 2024
Batch inference parallelization on tensorrt DeepStream SDK tensorrt	2	547	October 12, 2021
Multiple CUDA streams for one tensorrt Model TensorRT	2	28	March 19, 2026
Batch inference parallelization on tensorrt TensorRT tensorrt , cuda	5	1049	May 5, 2021

Separate GPU for Parallel on Jeton AgxOrin

Related topics