Use both DLA with NvInfer at the same time in the same process

wsmlby · November 5, 2020, 11:16pm

TL;DR:
Can we run the same model in the same process simultaneously on both DLAs? The model partially falls back to GPU.

I have a model that partially runs on DLA(and fallback to GPU for the other part). In order to make it run faster, I try to group 2 inputs into a pair then enqueue them into each DLA, then sync and read the result.

I notice something weird:
Here is the code:

Initialize:

ICudaEngine* engines[2];
for (int i = 0; i < 2; i++) {
// read serialized DLA(with GPU fallback enabled) engine.
IRuntime *runtime = createInferRuntime(gLogger);
                    runtime->setDLACore(i);
                    engines[i] = runtime->deserializeCudaEngine(src.get_address(), src.get_size());
}
cudaStream_t stream[2];
IExecutionContext* contexts[2];
for (int i = 0; i < 2; i++) {
    // allocate input_buffer[i]
     ...
    // allocate out_buffer[i]
     ...
    contexts[i] = engines[i]->createExecutionContext();
    cudaStreamCreateWithFlags(stream[i], cudaStreamNonBlocking);
}

Run:

for (int i = 0; i < 2; i++) {
        void *buffers[2];
        buffers[inputIndex] = input_buffer[i]->gpu_buffer;
        buffers[outputIndex] = out_buffer[i]->gpu_buffer;
        contexts[i]->enqueueV2(buffers, stream[i], nullptr);
        // cudaStreamSynchronize(stream[i]);
    }

for(int i = 0; i < context_.size(); i ++) {
  cudaStreamSynchronize(stream[i]);
}
// read the result.
...

Without the commented-out cudaStreamSynchronize, the processing time is significantly < 2x single run, but the result is just wrong and not stable(same input → different output).
Only if I have the sync immediately after the enqueueV2(thus serialized the 2 inference call), the result is now correct, but then the process time == 2x single run(which make sense since the run is actually serialized).

Am I doing something wrong here? Is that possible the GPU side is sharing some context and overwriting each other here?

AastaLLL · November 6, 2020, 4:50am

Hi,

You will need to create two separate TensorRT engine and each one run on a DLA.
Thanks.

wsmlby · November 6, 2020, 5:09am

Hey @AastaLLL, thanks for your reply!

As you can see, the engines are deserialized as 2 separate engine here. Isn’t that enough? What exactly do you mean by create two separate engines?

I tried to build the engines from the network I parsed from nvonnxparser and build 2 engines targeting each DLA but still the same issue.

AastaLLL · November 9, 2020, 3:58am

Hi,

Could you share a complete sample including the model with us.
We want to reproduce this internally first.

To run multi-stream inference, there is an implementation in our trtexec sample.
It can give you some idea:

/usr/src/tensorrt/samples/trtexec

Thanks.

wsmlby · November 9, 2020, 4:10am

Yes, how do I send the code?

The model is a standard VGG-19 model we use for testing.

All I do is:

dump the model

import torch
model = torch.hub.load(‘pytorch/vision:v0.6.0’, ‘vgg19’, pretrained=True)
dummy_input = torch.randn(1, 3, 224, 224).cuda()
model.eval()
input_names = [ “data” ]
output_names = [ “prob” ]
torch.onnx.export(model.cuda(), dummy_input, “vgg19_cuda.onnx”, verbose=True, input_names=input_names, output_names=output_names)

import the model with NvInfer api:

IBuilder *builder = createInferBuilder(gLogger);

 const auto explicitBatch = 1U << static_cast<uint32_t>(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);

 INetworkDefinition *network = builder->createNetworkV2(explicitBatch);

 nvonnxparser::IParser *parser = nvonnxparser::createParser(*network, gLogger);

 const char *model_path = "vgg19_cuda.onnx";

 parser->parseFromFile(model_path, 0);

 for (int i = 0; i < 2; i++) {

     IBuilderConfig *cfg = builder->createBuilderConfig();

     cfg->setMaxWorkspaceSize(1 << 20);

     cfg->setFlag(BuilderFlag::kGPU_FALLBACK);

     cfg->setFlag(BuilderFlag::kFP16);

     cfg->setDefaultDeviceType(DeviceType::kDLA);

     cfg->setDLACore(i);

     ICudaEngine *engine = builder->buildEngineWithConfig(*network, *cfg);

     cfg->destroy();

     engines.push_back(engine);

     std::cout << "Engine built!" << std::endl;
 }

Then use it as we descripted in the post:
convert a image into the 3x224x224 format, and run through it, print the result at position [688]

We observed different result when running with one engine VS running with 2 at the same time.

wsmlby · November 9, 2020, 4:13am

Can we get a direct contact? I can build a example project that demonstrate this issue.

wsmlby · November 9, 2020, 6:33am

Update: I think you’re right, I believe it is how I setup the memory causing the problem: a simple version I prepared as example actually works well. I change how memory is managed on my code and it works now, with both DLAs run at the same time and it is faster and accurate.

Thanks again.

TL;DR: both DLAs running at the same time works

AastaLLL · November 10, 2020, 2:36am

Good to know it works now.
Thanks for the update.

eyalhir74 · November 12, 2020, 5:23am

Hi @wsmlby - I’m having some problems with this issue as well.
Do you know how much the model runs on the GPU and on the DLAs? did you check that when running on the GPU and DLA together you still get better performance?
Which Jetpack version are you using?

thanks in advance
Eyal

wsmlby · November 12, 2020, 6:30am

Around half and half. It is faster running on 2 DLA than one. Jetpack 4.4.

eyalhir74 · December 6, 2020, 6:35am

Hi @wsmlby,
I am still facing some issues here. Especially when I try to run DLA + GPU.
Could you please elaborate on what did you mean by changing the memory management that solved this issue?
Any chance you can share the code?

thanks
Eyal

wsmlby · January 8, 2021, 7:55pm

Sorry for the late reply. I cannot share the code, but please try to do this from a fix input and simple code, then adding more into it.

I can now run it stably on DLAs, but the result is simply wrong compare to GPU so I am not using DLAs for now as GPU is both faster and correct.

Topic		Replies	Views
DLA and GPU cores at the same time Jetson AGX Xavier dla	20	10692	October 18, 2021
Concurrent DLA and GPU calls fail Jetson AGX Xavier dla	10	981	October 18, 2021
DLA and GPU running at the same time - performance question Jetson AGX Xavier nvbugs , performance , dla	24	3367	October 18, 2021
How to use both DLA engines at the same time Jetson AGX Xavier	3	1148	June 28, 2019
DLA / GPU question Jetson AGX Xavier dla	6	1050	October 18, 2021
Deploy three AI model engines on both DLAs and GPU Jetson AGX Xavier tensorrt , jetson-inference , dla , gpu	4	727	September 26, 2023
When GPU and DLA are used at the same time, the time consumption increases with each other DRIVE AGX Orin General dla , driveos-dl	10	984	March 9, 2023
How does the TRT inference run on both DLA and GPUs? Jetson Orin NX tensorrt , dla	2	947	August 30, 2023
How to run two inferences on different DLAs？ TensorRT	2	653	October 27, 2020
Run GPU and DLAs concurrently Jetson AGX Xavier dla	4	729	October 18, 2021

Use both DLA with NvInfer at the same time in the same process

Related topics