Accessing Jetson's DLA from python


Using python, I am using an onnx network on a Jetson NX. I have things running on the GPU but I would like to try to get them to run on the DLA. I see C++ examples on the subject but not python. Could someone provide some guidance? My hope was that tensorrt.Builder.canRunOnDLA() would be available.

TensorRT Version:
Python Version (if applicable): 3.6.

Hi @m.bingham,
I am afraid we do not have any example available publicly.
But you can check the below link
All these flags are available in IBuilderConfig.


I am struggling getting things to run on the DLA with python. Do I need to export the Tensort Rt model with something specific for the DLA other then 16 bit floating points? There are no complaints from the python interpreter but I still see GPU use and at /sys/devices/platform/host1x/15880000.nvdla0/power/runtime_status showing in active. Here is an excerpt of what I am doing.

import tensorrt as trt

def allocate_buffers(engine, batch_size, data_type):
h_input_1 = cuda.pagelocked_empty(batch_size * trt.volume(engine.get_binding_shape(0)), dtype=trt.nptype(data_type))
h_output = cuda.pagelocked_empty(batch_size * trt.volume(engine.get_binding_shape(1)), dtype=trt.nptype(data_type))
# Allocate device memory for inputs and outputs.
d_input_1 = cuda.mem_alloc(h_input_1.nbytes)
d_output = cuda.mem_alloc(h_output.nbytes)
# Create a stream in which to copy inputs/outputs and run inference.
stream = cuda.Stream()
return h_input_1, d_input_1, h_output, d_output, stream

def do_inference(engine, pics_1, h_input_1, d_input_1, h_output, d_output, stream, batch_size, height, width):

   load_images_to_buffer(pics_1, h_input_1)

   with engine.create_execution_context() as context:
       # Transfer input data to the GPU.
       cuda.memcpy_htod_async(d_input_1, h_input_1, stream)

       # Run inference.
       context.profiler = trt.Profiler()
       context.execute(batch_size=1, bindings=[int(d_input_1), int(d_output)])

       # Transfer predictions back from the GPU.
       cuda.memcpy_dtoh_async(h_output, d_output, stream)
       # Synchronize the stream
       # Return the host output.
       out = h_output.reshape((batch_size,-1, height, width))
       return out

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
trt_runtime = trt.Runtime(TRT_LOGGER)

#Move to the DLA
trt.BuilderFlag.GPU_FALLBACK = False
trt.IBuilderConfig.default_device_type = trt.DeviceType.DLA
trt.IBuilderConfig.DLA_core = 0

with open(plan_path, 'rb') as f:
 	engine_data =

engine = trt_runtime.deserialize_cuda_engine(engine_data)

h_input, d_input, h_output, d_output, stream = allocate_buffers(engine, 1, trt.float32)

raw_result = do_inference(engine, pil_img, h_input, d_input, h_output, d_output, stream, 1, 1000, 1)