Hi,
I intend to run inference on pretrained ActionRecognitionNet (resnet18_3d_rgb_hmdb5_32.onnx) using onnxruntime but I fail to get the right prediction. It seems like there is something wrong with my preprocessing step. For normalization, i tried both mean=[0.5, 0.5, 0.5] & std=[0.5, 0.5, 0.5] as well as mean=[0.485, 0.456, 0.406] & std=[0.229, 0.224, 0.225] but to to avail.
Below shows the steps taken for preprocessing by referring SpatialDataset class tao_pytorch_backend/nvidia_tao_pytorch/cv/action_recognition/dataloader/ar_dataset.py at main · NVIDIA/tao_pytorch_backend · GitHub:
import numpy as np
import onnxruntime as ort
from PIL import Image
import torchvision.transforms as transforms
# Initialization
im_transforms = transforms.Compose([
transforms.Resize(int(256)),
transforms.CenterCrop([224, 224]),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.5, 0.5, 0.5], # [0.485, 0.456, 0.406],
std=[0.5, 0.5, 0.5] # [0.229, 0.224, 0.225]
)
])
input_layer_name = 'input_rgb'
output_layer_name = ['fc_pred']
model= ort.InferenceSession('resnet18_3d_rgb_hmdb5_32.onnx', providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])
labels = ['walk', 'ride_bike', 'run', 'fall_floor', 'push']
# Preprocess
im_processed = []
for im_ in im: # im is List[np.ndarray] in BGR format with length of 32
im_ = im_[..., ::-1] # Convert to RGB
im_ = Image.fromarray(im_)
im_processed_ = self.transforms(im_)
im_processed.append(im_processed_)
im_processed = torch.stack(im_processed, 1).numpy()[np.newaxis] # shape (1,3,32,224,224)
# Predict
prediction = model.run(output_layer_name , {input_layer_name : im_processed })[0][0] # shape (5,)
# Postprocess
decoded_prediction = labels[prediction .argmax()]