FP16 mode is not running faster than FP32 mode

using Jetson xavier and Jetpack 4.1.1,TensorRT 5.0,caffe model parser
Hello,
I’m running inference on a image face detection network on batch size 1,some custom layers have created to implemented the priorbox layer.From the result, the inference time with the FLOAT format is running a little faster than the HALF.
Is it true? And what reasons will caused to this result? If not, what details I ignored to implement the Half mode?

The running result is:

engine is FP16 mode
image inference consume  time:  15.5089ms
image preprocess consume  time:  67.05ms
engine is FP32 mode
image inference consume  time:  14.7815ms
image preprocess consume  time:  63.854ms

My code to implement FP16:

bool useFp16 = builder->platformHasFastFp16();
	DataType modelDataType = useFp16?DataType::kHALF:DataType::kFLOAT;

	const IBlobNameToTensor* blobNameToTensor = parser->parse(locateFile(deployFile,directories).c_str(),
                                                                  locateFile(modelFile, directories).c_str(), 
                                                                 *network, modelDataType);
     // specify which tensors are outputs
	for (auto& s : outputs)
	{
		network->markOutput(*blobNameToTensor->find(s.c_str()));
	}
	
	//set workspace
	builder->setMaxBatchSize(maxBatchSize);
	builder->setMaxWorkspaceSize(36 << 20);

	//enable fp16 mode
	builder->setFp16Mode(useFp16);
	builder->setStrictTypeConstraints(useFp16);

	ICudaEngine* engine = builder->buildCudaEngine(*network);
	assert(engine);
……

By the way, I have maximized the CPU/GPU clocks first, and the plugin layer has changed the output format from float to half.