Low utilization of Tensor RT cores

I have a very simple autoencoder like test model for image 4x super resolution writen in tensorflow 2.5 and trained in FP32 mode. (nothing spescial)

def upscale_block(x,scale,filters,kernel_size=(3,3)):

    def upscale(x,scale):
        x=layers.Conv2D(filters*4, kernel_size, padding="same", activation="relu")(x)
        x = layers.Lambda(lambda x:tf.nn.depth_to_space(x, scale))(x)
        return x

    if scale==2:
        x=upscale(x,2)
    elif scale==3:
        x=upscale(x,3)
    elif scale==4:
        x=upscale(x,2)
        x=upscale(x,2)
    return x


def autoencoder(scale=4,training=True):
    x_in = layers.Input(shape=(None, None, IMAGE_CHANNELS),name="input")
    x=x_in 
    #encoder
    x=layers.Conv2D(256, (3,3), activation='relu', padding='same',strides=1)(x)
    x=layers.Conv2D(128, (3,3), activation='relu', padding='same',strides=1)(x)
    x=layers.Conv2D(64, (3,3), activation='relu', padding='same',strides=1)(x)
    x=layers.Conv2D(32, (3,3), activation='relu', padding='same',strides=1)(x)

    #decoder
    x=layers.Conv2DTranspose(32, (3,3), activation='relu', padding='same',strides=1)(x)
    x=layers.Conv2DTranspose(64, (3,3), activation='relu', padding='same',strides=1)(x)
    x=layers.Conv2DTranspose(128, (3,3), activation='relu', padding='same',strides=1)(x)
    x=layers.Conv2DTranspose(256, (3,3), activation='relu', padding='same',strides=1)(x)

    x=upscale_block(x,scale=scale,filters=64,kernel_size=(3,3))

    x=layers.Conv2D(IMAGE_CHANNELS, kernel_size=(3,3), activation=None, padding='same',name="output")(x)
return Model(x_in, x)

Then I convert my model to ONXX model:

python -m tf2onnx.convert  --opset 13 --saved-model /weights/saved_model_pb/ --output /weights/onnx/model.onnx

And then convert ONXX to trt with fixed shape 256x256x3 and batchs 1,8,16 (3 diffrent models):

trtexec --onnx=/weights/onnx/model.onnx --saveEngine=/weights/onnx/model-rt.trt --explicitBatch --fp16 --optShapes=\input\:8x256x256x3 --workspace=10000 --verbose

Then I load trt model in-to python and test it. First of all I don’t see any speed up of using batch 8 or 16 instead of batch 1, inference time just 8 or 16 x slower. With batch of 8 I get 50ms raw inference time without any post or pre processing. Then I start profiling my script via “Nvidia Nsight System” and it’s show that tensor cores is active for around 6ms of 50ms total inference time per image (I warm up inference to 10 images before start to record profile data)

def predict(batch): # result gets copied into output
    # Transfer input data to device
    start_time = time.time()
    # cuda.memcpy_htod_async(d_input, batch, stream)
    cuda.memcpy_htod(d_input, batch)
    print(Fore.YELLOW+"==>memcpy_htod_async time %.3fs"%(time.time() - start_time))

    # Execute model
    start_time = time.time()
    # context.execute_async_v2(bindings, stream.handle, None)
    context.execute_v2(bindings)
    print(Fore.YELLOW+"==>execute_async_v2 time %.3fs"%(time.time() - start_time))

    # Transfer predictions back
    start_time = time.time()
    # cuda.memcpy_dtoh_async(output, d_output,stream)
    cuda.memcpy_dtoh(output, d_output)
    print(Fore.YELLOW+"==>memcpy_dtoh_async time %.3fs"%(time.time() - start_time))

    # Syncronize threads
    start_time = time.time()
    # stream.synchronize()
    print(Fore.YELLOW+"==>synchronize time %.3fs"%(time.time() - start_time))
    
    return output

Testing same trt model via trtexec show similar inference time result: 50ms for batch 8.

So my questions, how to make my tensor cores busy, and speed up process?

Environment

TensorRT Version: 7.2.3.4
GPU Type: Quadro RTX A6000
Nvidia Driver Version: 470.57.02
CUDA Version: 11.4
CUDNN Version: 8.2.0
Operating System + Version: ubuntu 18.04
Python Version (if applicable): 3.8.5
TensorFlow Version (if applicable): 2.5
PyTorch Version (if applicable): none
Baremetal or Container (if container which image + tag): nvcr.io/nvidia/tensorflow:21.05-tf2-py3

Relevant Files

model.onnx (5.9 MB)
profile_report.qdrep (1.9 MB)
verbose-output-trtexec.txt (1.1 MB)



Hi,
Request you to share the ONNX model and the script if not shared already so that we can assist you better.
Alongside you can try few things:
https://docs.nvidia.com/deeplearning/tensorrt/quick-start-guide/index.html#onnx-export

  1. validating your model with the below snippet

check_model.py

import sys
import onnx
filename = yourONNXmodel
model = onnx.load(filename)
onnx.checker.check_model(model).
2) Try running your model with trtexec command.

In case you are still facing issue, request you to share the trtexec “”–verbose"" log for further debugging
Thanks!

I alredy attachded model and check it.
model.onnx (5.9 MB)
And in previous message I show how I run model via trtexec command but get similar slow result 50ms. Now I try to get trtexec verbose output and to share.

Verbose output of trtexec :

trtexec --onnx=/weights/onnx/model.onnx --shapes=input:8x256x256x3 --fp16 --workspace=40000 --verbose

verbose-output-trtexec.txt (1.1 MB)

For testing we can use trtexec as it’s show same “slow” result 50+ms.

trtexec --loadEngine=/weights/onnx/model-rt.trt --fp16 --optShapes=\input\:8x256x256x3 --verbose

@kirpasaccessory,

Could you please confirm are you facing this issue on latest TensorRT version as well ?

Yes v8 has same problem.
Also, I found that “upsample” or “deconvultion” layer has not used TRT cores.
Exec commands:


TRT usage:

Hi @kirpasaccessory,

Upsample layer will not use TensorCores because it is not a Gemm (matrix multiplication). For deconv layer, could you please set a much larger workspace (like 1GB)? The TensorCore deconv tactics require pretty large workspace.

Thank you.

What do you recommend for upscaling operation for better performance?

Sorry for the delayed response. Could you please give more details of the query.

I try very simple autoencoder model for testing in tensorflow for 4x upsample (super resolution). After optimize onnx model via trtexec with “–dumpProfile” option I see that model spend 60% of time in last upscale section of the model.(profiling inference of the model also confirm this fact) And in console of trtexec during converting I constantly see “out of memory” messages even if I increase workspace size to 35-40GB. So, how to speed up “upscaling part” and prevent out of memory situation?

trtexec --onnx=/weights/onnx/model.onnx --saveEngine=/weights/onnx/model-rt.trt --explicitBatch --fp16 --optShapes=\input:0\:8x256x256x3 --workspace=35000 --dumpProfile --noBuilderCache --best
    x=layers.Conv2D(128, kernel_size, activation='relu', padding='same',strides=1)(x)
    x=layers.Conv2D(128, kernel_size, activation='relu', padding='same',strides=1)(x)
    le2=x
    x=layers.Conv2D(128, kernel_size, activation='relu', padding='same',strides=2)(x)


    x=layers.Conv2D(256, kernel_size, activation='relu', padding='same',strides=1)(x)
    x=layers.Conv2D(256, kernel_size, activation='relu', padding='same',strides=1)(x)
    le3=x
    x=layers.Conv2D(256, kernel_size, activation='relu', padding='same',strides=2)(x)

    x=layers.Conv2D(512, kernel_size, activation='relu', padding='same',strides=1)(x)
    x=layers.Conv2D(512, kernel_size, activation='relu', padding='same',strides=1)(x)
    le4=x
    x=layers.Conv2D(512, kernel_size, activation='relu', padding='same',strides=2)(x)

    x=layers.Conv2D(1024, kernel_size, activation='relu', padding='same',strides=1)(x)
    x=layers.Conv2D(1024, kernel_size, activation='relu', padding='same',strides=1)(x)
    le5=x
    x=layers.Conv2D(1024, kernel_size, activation='relu', padding='same',strides=2)(x)

    #latent dim
    x=layers.Conv2D(2048, kernel_size, activation='relu', padding='same',strides=1)(x)
    x=layers.Conv2D(2048, kernel_size, activation='relu', padding='same',strides=1)(x)

    #decoder
    x=layers.Conv2DTranspose(1024, kernel_size, activation='relu', padding='same',strides=2)(x)
    x=x+le5
    x=layers.Conv2D(1024, kernel_size, activation='relu', padding='same',strides=1)(x)
    x=layers.Conv2D(1024, kernel_size, activation='relu', padding='same',strides=1)(x)   
    
    x=layers.Conv2DTranspose(512, kernel_size, activation='relu', padding='same',strides=2)(x)
    x=x+le4
    x=layers.Conv2D(512, kernel_size, activation='relu', padding='same',strides=1)(x)
    x=layers.Conv2D(512, kernel_size, activation='relu', padding='same',strides=1)(x)

    x=layers.Conv2DTranspose(256, kernel_size, activation='relu', padding='same',strides=2)(x)
    x=x+le3
    x=layers.Conv2D(256, kernel_size, activation='relu', padding='same',strides=1)(x)
    x=layers.Conv2D(256, kernel_size, activation='relu', padding='same',strides=1)(x)

    x=layers.Conv2DTranspose(128, kernel_size, activation='relu', padding='same',strides=2)(x)
    x=x+le2
    x=layers.Conv2D(128, kernel_size, activation='relu', padding='same',strides=1)(x)
    x=layers.Conv2D(128, kernel_size, activation='relu', padding='same',strides=1)(x)

Hi @kirpasaccessory,

May be currently TRT can achieve only that performance. Please allow us some time to work on this issue.
Thank you for letting us know this issue.