Concurrent DLA and GPU calls fail

Hi,
I’m trying to run two c++ std::threads, each with a different CUDA stream, one running a network on the GPU and one on the DLA.
trtexec runs them separately just fine.

All the deserializing, engine, runtime and context creation for the GPU network is done on one thread, all the deserializing, engine, runtime and context creation for the DLA network is done on another thread.

The first call to both GPU and DLA networks seems to be running correctly (return cudaErrorSuccess). The second call always fail for the DLA with the following error message:

Blockquote
NVMEDIA_DLA : 1928, ERROR: Submit failed.
…/rtExt/dla/native/dlaUtils.cpp (194) - DLA Error in submit: 7 (Failure to submit program to DLA engine.)
FAILED_EXECUTION: std::exception
Blockquote

And any call afterwards results in this:

Blockquote
NVMEDIA_DLA : 885, ERROR: runtime registerEvent failed. err: 0x4.
NVMEDIA_DLA : 1849, ERROR: RequestSubmitEvents failed. status: 0x7.
[Sun Jan 31 2021 07.09.28.982] [sw-inference-TRT-log] [INFO] …/rtExt/dla/native/dlaUtils.cpp (194) - DLA Error in submit: 7 (Failure to submit program to DLA engine.)
[Sun Jan 31 2021 07.09.28.982] [sw-inference-TRT-log] [INFO] FAILED_EXECUTION: std::exception
Blockquote

The code on the thread looks like this:
bool b = _obj->_context->enqueue(_obj->_batchSize,_allBindings, stream, nullptr);

I’m using Jetpack 4.4.1
R32 (release), REVISION: 4.4, GCID: 23942405, BOARD: t186ref, EABI: aarch64, DATE: Fri Oct 16 19:37:08 UTC 2020

Any assistance is greatly appreciated

thanks
Eyal

Hi,

Could you share your source with us so we can check it deeper?
Do you recreate the engine in the second call or just enqueue new data?

Thanks.

Hi @AastaLLL ,
I can not share the code, but I did find the issue, and would be happy to hear what you guys think.
In order for it to work, I had to make sure that all those happen on the same thread:

  • All runtime, deserializing, engine and context creations
  • Set the dla id
  • Use a different CUDA stream for the TRT execution
  • Execute the inference on the same thread as the creation using the same dedicated CUDA stream

There was a test code, running on the main thread, after the networks were built that verified that the networks are ok by running inference on it once.
When I commented this code and adhered to the above restrictions, GPU and DLA threads worked concurrently. As long as this test code was working, all consecutive calls to the GPU went fine and all calls to the DLA failed.

Seems to me that beside the weird error message from the DLA, all operations had to be done on the same thread/cuda stream (I am familiar with similar behavior in “regular” CUDA apps), but would be happy
to hear your thoughts on that matter :)

Furthermore the error message is very un-usefull - and does not give any information as to what might be the problem.

thanks
Eyal

Hi,

May I know the buildtime batchsize and runtime batchsize in your use case first?
If they are not equal, could you set them to be the same and run it again?

There is limitation (buildtime batch == runtime batch) in DLA and will cause a similar error log.

Thanks.

Hi @AastaLLL,
Batchsize is 1 for both build time and runtime.
The error code log is shown in the message above. Its basically this:

Blockquote
NVMEDIA_DLA : 885, ERROR: runtime registerEvent failed. err: 0x4.
NVMEDIA_DLA : 1849, ERROR: RequestSubmitEvents failed. status: 0x7.
[Sun Jan 31 2021 07.09.28.982] [sw-inference-TRT-log] [INFO] …/rtExt/dla/native/dlaUtils.cpp (194) - DLA Error in submit: 7 (Failure to submit program to DLA engine.)
[Sun Jan 31 2021 07.09.28.982] [sw-inference-TRT-log] [INFO] FAILED_EXECUTION: std::exception
Blockquote