Use both DLA with NvInfer at the same time in the same process

TL;DR:
Can we run the same model in the same process simultaneously on both DLAs? The model partially falls back to GPU.

I have a model that partially runs on DLA(and fallback to GPU for the other part). In order to make it run faster, I try to group 2 inputs into a pair then enqueue them into each DLA, then sync and read the result.

I notice something weird:
Here is the code:

Initialize:

ICudaEngine* engines[2];
for (int i = 0; i < 2; i++) {
// read serialized DLA(with GPU fallback enabled) engine.
IRuntime *runtime = createInferRuntime(gLogger);
                    runtime->setDLACore(i);
                    engines[i] = runtime->deserializeCudaEngine(src.get_address(), src.get_size());
}
cudaStream_t stream[2];
IExecutionContext* contexts[2];
for (int i = 0; i < 2; i++) {
    // allocate input_buffer[i]
     ...
    // allocate out_buffer[i]
     ...
    contexts[i] = engines[i]->createExecutionContext();
    cudaStreamCreateWithFlags(stream[i], cudaStreamNonBlocking);
}

Run:

for (int i = 0; i < 2; i++) {
        void *buffers[2];
        buffers[inputIndex] = input_buffer[i]->gpu_buffer;
        buffers[outputIndex] = out_buffer[i]->gpu_buffer;
        contexts[i]->enqueueV2(buffers, stream[i], nullptr);
        // cudaStreamSynchronize(stream[i]);
    }

for(int i = 0; i < context_.size(); i ++) {
  cudaStreamSynchronize(stream[i]);
}
// read the result.
...

Without the commented-out cudaStreamSynchronize, the processing time is significantly < 2x single run, but the result is just wrong and not stable(same input → different output).
Only if I have the sync immediately after the enqueueV2(thus serialized the 2 inference call), the result is now correct, but then the process time == 2x single run(which make sense since the run is actually serialized).

Am I doing something wrong here? Is that possible the GPU side is sharing some context and overwriting each other here?

Hi,

You will need to create two separate TensorRT engine and each one run on a DLA.
Thanks.

Hey @AastaLLL, thanks for your reply!

As you can see, the engines are deserialized as 2 separate engine here. Isn’t that enough? What exactly do you mean by create two separate engines?

I tried to build the engines from the network I parsed from nvonnxparser and build 2 engines targeting each DLA but still the same issue.

Hi,

Could you share a complete sample including the model with us.
We want to reproduce this internally first.

To run multi-stream inference, there is an implementation in our trtexec sample.
It can give you some idea:

/usr/src/tensorrt/samples/trtexec

Thanks.

Yes, how do I send the code?

The model is a standard VGG-19 model we use for testing.

All I do is:

  1. dump the model

    import torch
    model = torch.hub.load(‘pytorch/vision:v0.6.0’, ‘vgg19’, pretrained=True)
    dummy_input = torch.randn(1, 3, 224, 224).cuda()
    model.eval()
    input_names = [ “data” ]
    output_names = [ “prob” ]
    torch.onnx.export(model.cuda(), dummy_input, “vgg19_cuda.onnx”, verbose=True, input_names=input_names, output_names=output_names)

  2. import the model with NvInfer api:

    IBuilder *builder = createInferBuilder(gLogger);

     const auto explicitBatch = 1U << static_cast<uint32_t>(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);
    
     INetworkDefinition *network = builder->createNetworkV2(explicitBatch);
    
     nvonnxparser::IParser *parser = nvonnxparser::createParser(*network, gLogger);
    
     const char *model_path = "vgg19_cuda.onnx";
    
     parser->parseFromFile(model_path, 0);
    
     for (int i = 0; i < 2; i++) {
    
         IBuilderConfig *cfg = builder->createBuilderConfig();
    
         cfg->setMaxWorkspaceSize(1 << 20);
    
         cfg->setFlag(BuilderFlag::kGPU_FALLBACK);
    
         cfg->setFlag(BuilderFlag::kFP16);
    
         cfg->setDefaultDeviceType(DeviceType::kDLA);
    
         cfg->setDLACore(i);
    
         ICudaEngine *engine = builder->buildEngineWithConfig(*network, *cfg);
    
         cfg->destroy();
    
         engines.push_back(engine);
    
         std::cout << "Engine built!" << std::endl;
     }
    

Then use it as we descripted in the post:
convert a image into the 3x224x224 format, and run through it, print the result at position [688]

We observed different result when running with one engine VS running with 2 at the same time.

Can we get a direct contact? I can build a example project that demonstrate this issue.

Update: I think you’re right, I believe it is how I setup the memory causing the problem: a simple version I prepared as example actually works well. I change how memory is managed on my code and it works now, with both DLAs run at the same time and it is faster and accurate.

Thanks again.

TL;DR: both DLAs running at the same time works

Good to know it works now.
Thanks for the update.

1 Like

Hi @wsmlby - I’m having some problems with this issue as well.
Do you know how much the model runs on the GPU and on the DLAs? did you check that when running on the GPU and DLA together you still get better performance?
Which Jetpack version are you using?

thanks in advance
Eyal

Around half and half. It is faster running on 2 DLA than one. Jetpack 4.4.

Hi @wsmlby,
I am still facing some issues here. Especially when I try to run DLA + GPU.
Could you please elaborate on what did you mean by changing the memory management that solved this issue?
Any chance you can share the code?

thanks
Eyal

Sorry for the late reply. I cannot share the code, but please try to do this from a fix input and simple code, then adding more into it.

I can now run it stably on DLAs, but the result is simply wrong compare to GPU so I am not using DLAs for now as GPU is both faster and correct.