Why my inference time is so long when using trtexec - FP16?

Aizzaac · October 22, 2020, 10:14pm

This is part of my code. I am measuring the inference time, using the Inception_v1 model, optimised with “trtexec” + FP16. I used tf2onnx to parse the TensorFlow graph.

I have put 2 timers:

To measure the time when sending data from CPU to GPU
Just the inference

QUESTIONS:

Are the results given in seconds, milliseconds, etc? I am multiplying by 1000
Why do inference time varies so much?
Why is it taking too much time? In this GIT, the average seems to be 8ms
https://github.com/NVIDIA-AI-IOT/tf_to_trt_image_classification

RUN 1

RUN2

RUN 3

SCRIPT

def allocate_buffers(engine, batch_size, data_type):

    """
    This is the function to allocate buffers for input and output in the device (GPU) and host (CPU)
    Args:
      engine : The path to the TensorRT engine. 
      batch_size : The batch size for execution time.
      data_type: The type of the data for input and output, for example trt.float32, np.float32. 
    
    Output:
      h_input: Input in the host (CPU).
      d_input: Input in the device (GPU). 
      h_output: Output in the host (CPU). 
      d_output: Output in the device (GPU). 
      stream: CUDA stream.

    """

    # Determine dimensions and create page-locked memory buffers (which won't be swapped to disk) to hold host inputs/outputs.
    h_input = cuda.pagelocked_empty(batch_size * trt.volume(engine.get_binding_shape(0)), dtype=trt.nptype(data_type))
    h_output = cuda.pagelocked_empty(batch_size * trt.volume(engine.get_binding_shape(1)), dtype=trt.nptype(data_type))
    
    # Allocate device memory for inputs and outputs (the same size as host' input and output).
    d_input = cuda.mem_alloc(h_input.nbytes)
    d_output = cuda.mem_alloc(h_output.nbytes)
    
    # Create a stream in which to copy inputs/outputs between the allocated memory from device and host; and run inference.
    stream = cuda.Stream()
    return h_input, d_input, h_output, d_output, stream




def load_images_to_buffer(pics, pagelocked_buffer):
    
    """
    This is the function to load (preprocessed) images to buffers in the host
    
    """
    preprocessed = np.asarray(pics).ravel()
    np.copyto(pagelocked_buffer, preprocessed)


 

def load_labels(path, encoding='utf-8'):
    """Loads labels from file (with or without index numbers).
    Args:
        path: path to label file.
        encoding: label file encoding.
        
    Returns:
        Dictionary mapping indices to labels.
       
    """
    with open(path, 'r', encoding=encoding) as f:
        lines = f.readlines()
        if not lines:
            return {}

        if lines[0].split(' ', maxsplit=1)[0].isdigit():
            pairs = [line.split(' ', maxsplit=1) for line in lines]
            return {int(index): label.strip() for index, label in pairs}
        else:
            return {index: line.strip() for index, line in enumerate(lines)}
        



def load_engine(trt_runtime, plan_path):
    """
    This function reads the engine from the file .trt and deserializes it
    """
    with open(plan_path, 'rb') as f:
        engine_data = f.read()
    engine = trt_runtime.deserialize_cuda_engine(engine_data)
    return engine

# Modify x in [0:x] to display the numer of desired predictions
def postprocess_inception(output):
    predictions_top = np.argsort(output)[::-1][0:1]
    labels_top = [labels[p] for p in predictions_top]
    scores_top = output[predictions_top]
    return scores_top, labels_top


# Loads labels file
labels = load_labels(labels_path)


# Loads model file
engine = load_engine(trt_runtime, plan_path)


# Allocate buffers in CPU and GPU - calling TX2_classify.py (allocate_buffers) 
h_input, d_input, h_output, d_output, stream = allocate_buffers(engine, 1, trt.float32)

# Context for executing inference using ICudaEngine
context = engine.create_execution_context()


# Image preprocessing
# Open image, convert each pixel to the triple 8-bit value, change size and apply a high-quality
# downsampling filter. Then it create an array of data type FP32 and normalizes it.
image = np.asarray(Image.open(image_path).convert('RGB').resize(size, Image.ANTIALIAS), dtype=np.float32)
image /= 255


# Load the preprocessed image into the CPU's buffer - calling TX2_classify.py (load_images_to_buffer)
load_images_to_buffer(image, h_input)

# Transfer input data from CPU to GPU.
start_0 = time.perf_counter()
cuda.memcpy_htod_async(d_input, h_input, stream)
time_CPUtoGPU = time.perf_counter() - start_0
print("CPUtoGPU(ms):", (time_CPUtoGPU * 1000))
print("\n")


for i in range (count):
    # Run inference.
    #context.profiler = trt.Profiler() ##shows execution time(ms) of each layer
    start_1 = time.perf_counter()
    context.execute(batch_size=1, bindings=[int(d_input), int(d_output)])   
    inference_time = time.perf_counter() - start_1
    print("Inference_time(ms)")
    print(inference_time * 1000)
    print("\n")
    
    # Transfer predictions back from the GPU to the CPU.
    cuda.memcpy_dtoh_async(h_output, d_output, stream)
        
    # Synchronize the stream.
    stream.synchronize()
        
    # Return the CPU output.
    scores = h_output

    pred = postprocess_inception(scores)
    print(pred)
    print("\n")

AastaLLL · October 23, 2020, 3:15am

Hi,

In case you don’t know, pleases maximize the device performance with the following commands first:

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

1. Based on their document below, time.perf_counter() returns the value in seconds.
Since you multiply it with 1000, it should be ms.
https://docs.python.org/3/library/time.html#time.perf_counter

2. Please maximize the clocks rate.
Sometimes inference time will vary if using the dynamic clock mode (for power saving).

3. The time only measures the inference part and do it with fp16 precision.

Thanks.

Aizzaac · October 23, 2020, 2:30pm

Hi @AastaLLL

Thank you for the quick answer. Check my results now:

3. The time only measures the inference part and do it with fp16 precision.

I optimised the model with “trtexec” and FP16.
And put a timer only in the inference:

These are the results using: jetson_clocks and nvpmodel -m 0

RUN1

RUN2

RUN3

Why is the inference #1 slower than the others? It is not measuring the time to move data from CPU to GPU. Is there something else happening?

dusty_nv · October 23, 2020, 3:23pm

The very first time you run inference on a model after loading the application, it typically takes a bit longer as the GPU is initialized, GPU code is loaded, ect. So during benchmarks, there is typically a “warm-up” period where the inferencing is run but ignored. Also if your model is using plugins for TensorRT, those are configured on-demand which takes more time the first run.

Also, benchmarks typically run many inferencing iterations (like trtexec program does) and average the results or perform some mean/median/mode statistics on them.

Topic		Replies	Views
Inference Speed Jetson Xavier NX pytorch	6	885	April 12, 2023
Inference time becomes longer when doing non-continuous fp16 or int8 inference TensorRT tensorrt , jetson-inference	33	3272	March 30, 2023
Low Compute utilization of converted TensorFlow model during inference Jetson TX2	19	1695	October 18, 2021
TensorRt inference is taking 1.5 sec to inference a single frame.i want to speed up my inference.How can i do that TensorRT tensorrt , cuda , jetson-nano	3	762	March 13, 2023
Inference time on jetson nano Jetson AGX Xavier tensorrt , cuda , kernel , jetson-inference	2	940	May 30, 2022
Strange jumping results on FPS and inference time Jetson Nano	9	1175	October 18, 2021
Extremely slow inference in TensorRT for live semantic segmentation model Jetson AGX Xavier tensorrt , tensorflow , jetson-inference	11	4386	April 12, 2022
Inference is so slow with torch1.6 Jetson Xavier NX nvbugs , pytorch	12	3538	October 23, 2020
Inference time increases rapidly when set a high resolution input image TensorRT tensorrt , cuda , ubuntu	1	804	September 13, 2023
Tlt-infer is slow TAO Toolkit	13	830	October 12, 2021

Why my inference time is so long when using trtexec - FP16?

Related topics