Inference time of engine with dynamic batch size is not good?

I have engine with dynamic batch size.
Here is command for profiling engine with BS=1

/usr/src/tensorrt/bin/trtexec --loadEngine=model.engine --exportLayerInfo=graphLO.json --exportProfile=profileL1.json --warmUp=0 --duration=0 --iterations=1000 --shapes=images:1x3x640x640

Inference time for BS=1 is 28.13ms

Here is command for profiling engine with BS=4

/usr/src/tensorrt/bin/trtexec --loadEngine=model.engine --exportLayerInfo=graphLO.json --exportProfile=profileL1.json --warmUp=0 --duration=0 --iterations=1000 --shapes=images:4x3x640x640

Inference time for BS=4 is 95ms, so for BS=1 inference time in this case is about 23 ms.

I think in this case, TensorRT does not support well parallel computing, 23 ms and 28 ms is closely. I expect that inference for BS=4 is same as inference time for BS=1.
Could you give any suggestion about this? Thank you.
@spolisetty @junshengy

Can you try running your model with trtexec command, and share the “”–verbose"" log in case if the issue persist

You can refer below link for all the supported operators list, in case any operator is not supported you need to create a custom plugin to support that operation

Also, request you to share your model and script if not shared already so that we can help you better.

Meanwhile, for some common errors and queries please refer to below link: