Low utilization of Tensor RT cores

kirpasaccessory · July 23, 2021, 11:03pm

I have a very simple autoencoder like test model for image 4x super resolution writen in tensorflow 2.5 and trained in FP32 mode. (nothing spescial)

def upscale_block(x,scale,filters,kernel_size=(3,3)):

    def upscale(x,scale):
        x=layers.Conv2D(filters*4, kernel_size, padding="same", activation="relu")(x)
        x = layers.Lambda(lambda x:tf.nn.depth_to_space(x, scale))(x)
        return x

    if scale==2:
        x=upscale(x,2)
    elif scale==3:
        x=upscale(x,3)
    elif scale==4:
        x=upscale(x,2)
        x=upscale(x,2)
    return x


def autoencoder(scale=4,training=True):
    x_in = layers.Input(shape=(None, None, IMAGE_CHANNELS),name="input")
    x=x_in 
    #encoder
    x=layers.Conv2D(256, (3,3), activation='relu', padding='same',strides=1)(x)
    x=layers.Conv2D(128, (3,3), activation='relu', padding='same',strides=1)(x)
    x=layers.Conv2D(64, (3,3), activation='relu', padding='same',strides=1)(x)
    x=layers.Conv2D(32, (3,3), activation='relu', padding='same',strides=1)(x)

    #decoder
    x=layers.Conv2DTranspose(32, (3,3), activation='relu', padding='same',strides=1)(x)
    x=layers.Conv2DTranspose(64, (3,3), activation='relu', padding='same',strides=1)(x)
    x=layers.Conv2DTranspose(128, (3,3), activation='relu', padding='same',strides=1)(x)
    x=layers.Conv2DTranspose(256, (3,3), activation='relu', padding='same',strides=1)(x)

    x=upscale_block(x,scale=scale,filters=64,kernel_size=(3,3))

    x=layers.Conv2D(IMAGE_CHANNELS, kernel_size=(3,3), activation=None, padding='same',name="output")(x)
return Model(x_in, x)

Then I convert my model to ONXX model:

python -m tf2onnx.convert  --opset 13 --saved-model /weights/saved_model_pb/ --output /weights/onnx/model.onnx

And then convert ONXX to trt with fixed shape 256x256x3 and batchs 1,8,16 (3 diffrent models):

trtexec --onnx=/weights/onnx/model.onnx --saveEngine=/weights/onnx/model-rt.trt --explicitBatch --fp16 --optShapes=\input\:8x256x256x3 --workspace=10000 --verbose

Then I load trt model in-to python and test it. First of all I don’t see any speed up of using batch 8 or 16 instead of batch 1, inference time just 8 or 16 x slower. With batch of 8 I get 50ms raw inference time without any post or pre processing. Then I start profiling my script via “Nvidia Nsight System” and it’s show that tensor cores is active for around 6ms of 50ms total inference time per image (I warm up inference to 10 images before start to record profile data)

def predict(batch): # result gets copied into output
    # Transfer input data to device
    start_time = time.time()
    # cuda.memcpy_htod_async(d_input, batch, stream)
    cuda.memcpy_htod(d_input, batch)
    print(Fore.YELLOW+"==>memcpy_htod_async time %.3fs"%(time.time() - start_time))

    # Execute model
    start_time = time.time()
    # context.execute_async_v2(bindings, stream.handle, None)
    context.execute_v2(bindings)
    print(Fore.YELLOW+"==>execute_async_v2 time %.3fs"%(time.time() - start_time))

    # Transfer predictions back
    start_time = time.time()
    # cuda.memcpy_dtoh_async(output, d_output,stream)
    cuda.memcpy_dtoh(output, d_output)
    print(Fore.YELLOW+"==>memcpy_dtoh_async time %.3fs"%(time.time() - start_time))

    # Syncronize threads
    start_time = time.time()
    # stream.synchronize()
    print(Fore.YELLOW+"==>synchronize time %.3fs"%(time.time() - start_time))
    
    return output

Testing same trt model via trtexec show similar inference time result: 50ms for batch 8.

So my questions, how to make my tensor cores busy, and speed up process?

Environment

TensorRT Version: 7.2.3.4
GPU Type: Quadro RTX A6000
Nvidia Driver Version: 470.57.02
CUDA Version: 11.4
CUDNN Version: 8.2.0
Operating System + Version: ubuntu 18.04
Python Version (if applicable): 3.8.5
TensorFlow Version (if applicable): 2.5
PyTorch Version (if applicable): none
Baremetal or Container (if container which image + tag): nvcr.io/nvidia/tensorflow:21.05-tf2-py3

Relevant Files

model.onnx (5.9 MB)
profile_report.qdrep (1.9 MB)
verbose-output-trtexec.txt (1.1 MB)

NVES · July 23, 2021, 11:07pm

Hi,
Request you to share the ONNX model and the script if not shared already so that we can assist you better.
Alongside you can try few things:

validating your model with the below snippet

check_model.py

import sys
import onnx
filename = yourONNXmodel
model = onnx.load(filename)
onnx.checker.check_model(model).
2) Try running your model with trtexec command.
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec
In case you are still facing issue, request you to share the trtexec “”–verbose"" log for further debugging
Thanks!

kirpasaccessory · July 23, 2021, 11:12pm

I alredy attachded model and check it.
model.onnx (5.9 MB)
And in previous message I show how I run model via trtexec command but get similar slow result 50ms. Now I try to get trtexec verbose output and to share.

kirpasaccessory · July 23, 2021, 11:57pm

Verbose output of trtexec :

trtexec --onnx=/weights/onnx/model.onnx --shapes=input:8x256x256x3 --fp16 --workspace=40000 --verbose

verbose-output-trtexec.txt (1.1 MB)

kirpasaccessory · July 24, 2021, 1:00am

For testing we can use trtexec as it’s show same “slow” result 50+ms.

trtexec --loadEngine=/weights/onnx/model-rt.trt --fp16 --optShapes=\input\:8x256x256x3 --verbose

spolisetty · July 26, 2021, 3:33pm

@kirpasaccessory,

Could you please confirm are you facing this issue on latest TensorRT version as well ?

kirpasaccessory · July 26, 2021, 4:00pm

Yes v8 has same problem.
Also, I found that “upsample” or “deconvultion” layer has not used TRT cores.
Exec commands:

TRT usage:

spolisetty · August 3, 2021, 5:58am

Hi @kirpasaccessory,

Upsample layer will not use TensorCores because it is not a Gemm (matrix multiplication). For deconv layer, could you please set a much larger workspace (like 1GB)? The TensorCore deconv tactics require pretty large workspace.

Thank you.

kirpasaccessory · August 3, 2021, 1:16pm

What do you recommend for upscaling operation for better performance?

spolisetty · August 10, 2021, 6:25pm

Sorry for the delayed response. Could you please give more details of the query.

kirpasaccessory · August 11, 2021, 4:50pm

I try very simple autoencoder model for testing in tensorflow for 4x upsample (super resolution). After optimize onnx model via trtexec with “–dumpProfile” option I see that model spend 60% of time in last upscale section of the model.(profiling inference of the model also confirm this fact) And in console of trtexec during converting I constantly see “out of memory” messages even if I increase workspace size to 35-40GB. So, how to speed up “upscaling part” and prevent out of memory situation?

trtexec --onnx=/weights/onnx/model.onnx --saveEngine=/weights/onnx/model-rt.trt --explicitBatch --fp16 --optShapes=\input:0\:8x256x256x3 --workspace=35000 --dumpProfile --noBuilderCache --best

    x=layers.Conv2D(128, kernel_size, activation='relu', padding='same',strides=1)(x)
    x=layers.Conv2D(128, kernel_size, activation='relu', padding='same',strides=1)(x)
    le2=x
    x=layers.Conv2D(128, kernel_size, activation='relu', padding='same',strides=2)(x)


    x=layers.Conv2D(256, kernel_size, activation='relu', padding='same',strides=1)(x)
    x=layers.Conv2D(256, kernel_size, activation='relu', padding='same',strides=1)(x)
    le3=x
    x=layers.Conv2D(256, kernel_size, activation='relu', padding='same',strides=2)(x)

    x=layers.Conv2D(512, kernel_size, activation='relu', padding='same',strides=1)(x)
    x=layers.Conv2D(512, kernel_size, activation='relu', padding='same',strides=1)(x)
    le4=x
    x=layers.Conv2D(512, kernel_size, activation='relu', padding='same',strides=2)(x)

    x=layers.Conv2D(1024, kernel_size, activation='relu', padding='same',strides=1)(x)
    x=layers.Conv2D(1024, kernel_size, activation='relu', padding='same',strides=1)(x)
    le5=x
    x=layers.Conv2D(1024, kernel_size, activation='relu', padding='same',strides=2)(x)

    #latent dim
    x=layers.Conv2D(2048, kernel_size, activation='relu', padding='same',strides=1)(x)
    x=layers.Conv2D(2048, kernel_size, activation='relu', padding='same',strides=1)(x)

    #decoder
    x=layers.Conv2DTranspose(1024, kernel_size, activation='relu', padding='same',strides=2)(x)
    x=x+le5
    x=layers.Conv2D(1024, kernel_size, activation='relu', padding='same',strides=1)(x)
    x=layers.Conv2D(1024, kernel_size, activation='relu', padding='same',strides=1)(x)   
    
    x=layers.Conv2DTranspose(512, kernel_size, activation='relu', padding='same',strides=2)(x)
    x=x+le4
    x=layers.Conv2D(512, kernel_size, activation='relu', padding='same',strides=1)(x)
    x=layers.Conv2D(512, kernel_size, activation='relu', padding='same',strides=1)(x)

    x=layers.Conv2DTranspose(256, kernel_size, activation='relu', padding='same',strides=2)(x)
    x=x+le3
    x=layers.Conv2D(256, kernel_size, activation='relu', padding='same',strides=1)(x)
    x=layers.Conv2D(256, kernel_size, activation='relu', padding='same',strides=1)(x)

    x=layers.Conv2DTranspose(128, kernel_size, activation='relu', padding='same',strides=2)(x)
    x=x+le2
    x=layers.Conv2D(128, kernel_size, activation='relu', padding='same',strides=1)(x)
    x=layers.Conv2D(128, kernel_size, activation='relu', padding='same',strides=1)(x)

spolisetty · August 17, 2021, 10:12am

Hi @kirpasaccessory,

May be currently TRT can achieve only that performance. Please allow us some time to work on this issue.
Thank you for letting us know this issue.

spolisetty · October 7, 2021, 6:33am

Hi @kirpasaccessory,

Could you please try on TensorRT 8.2 EA version. Hope this will be fixed.

Thank you.

kirpasaccessory · October 7, 2021, 12:37pm

Do you have docker container for this release?

spolisetty · October 7, 2021, 1:51pm

8.2 EA docker container is not available yet, you may need to setup locally.

spolisetty · October 12, 2021, 9:13am

Hi @kirpasaccessory,

Were you able to test with 8.2 EA

kirpasaccessory · October 12, 2021, 11:51am

I installed new 8.2EA and now start testing and profiling my models. Today will know final results.

spolisetty · October 12, 2021, 1:27pm

Thank you.

kirpasaccessory · October 14, 2021, 11:43pm

Hi!
Basically I don’t see improvement in time of inference in new TRT version. And as you can see tensor cores not fully utilized.

Report 32.qdrep (4.1 MB)

kirpasaccessory · October 16, 2021, 1:42pm

Please take a look here on my video about profiling inference.

Topic		Replies	Views
Tensorrt can not speed up well TensorRT	7	1829	June 29, 2022
How to optimize the tensorRT Engine for Tensor Core? Jetson AGX Orin tensorrt , nvbugs	21	2029	August 2, 2023
Low Compute utilization of converted TensorFlow model during inference Jetson TX2	19	1941	October 18, 2021
TensorRT Batching Speed scales poorly TensorRT tensorrt , cuda	6	1920	September 30, 2021
Conv3D does not use Tensor Cores TensorRT tensorrt , cuda , cudnn	8	1192	October 23, 2020
Tensor RT optimization causes performance downgrade compared to onnx model TensorRT	4	1112	January 26, 2022
Speeding Up Deep Learning Inference Using TensorRT Technical Blog	5	1045	November 9, 2021
Inference Time is not stable TensorRT	10	1906	January 3, 2019
TensorRt inference is taking 1.5 sec to inference a single frame.i want to speed up my inference.How can i do that TensorRT tensorrt , cuda , jetson-nano	3	830	March 13, 2023
TensorRT can not accelarate the onnx model for inferencing TensorRT tensorrt , cuda	3	761	April 17, 2020

Low utilization of Tensor RT cores

Environment

Relevant Files

check_model.py

Related topics