Inference time of engine with dynamic batch size is not good?

I have engine with dynamic batch size.
Here is command for profiling engine with BS=1

/usr/src/tensorrt/bin/trtexec --loadEngine=model.engine --exportLayerInfo=graphLO.json --exportProfile=profileL1.json --warmUp=0 --duration=0 --iterations=1000 --shapes=images:1x3x640x640

Inference time for BS=1 is 28.13ms

Here is command for profiling engine with BS=4

/usr/src/tensorrt/bin/trtexec --loadEngine=model.engine --exportLayerInfo=graphLO.json --exportProfile=profileL1.json --warmUp=0 --duration=0 --iterations=1000 --shapes=images:4x3x640x640

Inference time for BS=4 is 95ms, so for BS=1 inference time in this case is about 23 ms.

I think in this case, TensorRT does not support well parallel computing, 23 ms and 28 ms is closely. I expect that inference for BS=4 is same as inference time for BS=1.
Could you give any suggestion about this? Thank you.
