I defined my own network. Then doing the following:
tensorflow mode -> onnx --> onnx parser(c++ inference)
if no use
::enableDLA(builder.get(), config.get(), mParams.dlaCore);
I almost get 100ms speed up.
if use this code, it means use DLA core, but some layers run on dla, and some run on GPU.
I have no idea now why it has this kind of big difference.
May be the memory copy between dla and gpu make sense, i guess.