I have a very simple autoencoder like test model for image 4x super resolution writen in tensorflow 2.5 and trained in FP32 mode. (nothing spescial)
def upscale_block(x,scale,filters,kernel_size=(3,3)):
def upscale(x,scale):
x=layers.Conv2D(filters*4, kernel_size, padding="same", activation="relu")(x)
x = layers.Lambda(lambda x:tf.nn.depth_to_space(x, scale))(x)
return x
if scale==2:
x=upscale(x,2)
elif scale==3:
x=upscale(x,3)
elif scale==4:
x=upscale(x,2)
x=upscale(x,2)
return x
def autoencoder(scale=4,training=True):
x_in = layers.Input(shape=(None, None, IMAGE_CHANNELS),name="input")
x=x_in
#encoder
x=layers.Conv2D(256, (3,3), activation='relu', padding='same',strides=1)(x)
x=layers.Conv2D(128, (3,3), activation='relu', padding='same',strides=1)(x)
x=layers.Conv2D(64, (3,3), activation='relu', padding='same',strides=1)(x)
x=layers.Conv2D(32, (3,3), activation='relu', padding='same',strides=1)(x)
#decoder
x=layers.Conv2DTranspose(32, (3,3), activation='relu', padding='same',strides=1)(x)
x=layers.Conv2DTranspose(64, (3,3), activation='relu', padding='same',strides=1)(x)
x=layers.Conv2DTranspose(128, (3,3), activation='relu', padding='same',strides=1)(x)
x=layers.Conv2DTranspose(256, (3,3), activation='relu', padding='same',strides=1)(x)
x=upscale_block(x,scale=scale,filters=64,kernel_size=(3,3))
x=layers.Conv2D(IMAGE_CHANNELS, kernel_size=(3,3), activation=None, padding='same',name="output")(x)
return Model(x_in, x)
Then I convert my model to ONXX model:
python -m tf2onnx.convert --opset 13 --saved-model /weights/saved_model_pb/ --output /weights/onnx/model.onnx
And then convert ONXX to trt with fixed shape 256x256x3 and batchs 1,8,16 (3 diffrent models):
trtexec --onnx=/weights/onnx/model.onnx --saveEngine=/weights/onnx/model-rt.trt --explicitBatch --fp16 --optShapes=\input\:8x256x256x3 --workspace=10000 --verbose
Then I load trt model in-to python and test it. First of all I don’t see any speed up of using batch 8 or 16 instead of batch 1, inference time just 8 or 16 x slower. With batch of 8 I get 50ms raw inference time without any post or pre processing. Then I start profiling my script via “Nvidia Nsight System” and it’s show that tensor cores is active for around 6ms of 50ms total inference time per image (I warm up inference to 10 images before start to record profile data)
def predict(batch): # result gets copied into output
# Transfer input data to device
start_time = time.time()
# cuda.memcpy_htod_async(d_input, batch, stream)
cuda.memcpy_htod(d_input, batch)
print(Fore.YELLOW+"==>memcpy_htod_async time %.3fs"%(time.time() - start_time))
# Execute model
start_time = time.time()
# context.execute_async_v2(bindings, stream.handle, None)
context.execute_v2(bindings)
print(Fore.YELLOW+"==>execute_async_v2 time %.3fs"%(time.time() - start_time))
# Transfer predictions back
start_time = time.time()
# cuda.memcpy_dtoh_async(output, d_output,stream)
cuda.memcpy_dtoh(output, d_output)
print(Fore.YELLOW+"==>memcpy_dtoh_async time %.3fs"%(time.time() - start_time))
# Syncronize threads
start_time = time.time()
# stream.synchronize()
print(Fore.YELLOW+"==>synchronize time %.3fs"%(time.time() - start_time))
return output
Testing same trt model via trtexec show similar inference time result: 50ms for batch 8.
So my questions, how to make my tensor cores busy, and speed up process?
Environment
TensorRT Version: 7.2.3.4
GPU Type: Quadro RTX A6000
Nvidia Driver Version: 470.57.02
CUDA Version: 11.4
CUDNN Version: 8.2.0
Operating System + Version: ubuntu 18.04
Python Version (if applicable): 3.8.5
TensorFlow Version (if applicable): 2.5
PyTorch Version (if applicable): none
Baremetal or Container (if container which image + tag): nvcr.io/nvidia/tensorflow:21.05-tf2-py3
Relevant Files
model.onnx (5.9 MB)
profile_report.qdrep (1.9 MB)
verbose-output-trtexec.txt (1.1 MB)