How to improve the performance of CUDA MPS?

Hi, I am a novice CUDA developer trying to improve Nvidia’s tacotron 2 performance using MPS - https://github.com/NVIDIA/tacotron2

I am using this inference code to test the performance of Tacotron 2 :

import matplotlib
matplotlib.use(“Agg”)
import matplotlib.pylab as plt
import numpy as np
import torch
#import amp
#from apex import amp

from hparams import create_hparams
from model import Tacotron2
from train import load_model
from text import text_to_sequence
from torch.autograd import Variable
import time

if name == “main”:
torch.backends.cudnn.benchmark = True
hparams = create_hparams()
hparams.sampling_rate = 22050
checkpoint_path = “tacotron2_statedict.pt”
model = load_model(hparams)
model.load_state_dict(torch.load(checkpoint_path)[‘state_dict’])
_ = model.eval()
#fp16
#model, _ = amp.initialize(model, , opt_level=“O2”)

text_inputs = torch.randn(64, 163, device='cuda')
text_legths = torch.randn(64, device='cuda')
mels = torch.randn(64, 80, 851, device='cuda')
max_len = 163.0
output_lengths = torch.randn(64, device='cuda')


text = "Lorem Ipsum is simply dummy text of the printing and typesetting industry."
sequence = np.array(text_to_sequence(text, ['english_cleaners']))[None, :]
sequence = torch.autograd.Variable(torch.from_numpy(sequence)).cuda().long()
total = 0.0
res = 0.0
for i in range (2):
    start_time = time.time()
    mel_outputs, mel_outputs_postnet, _, alignments = model.inference(sequence)
    torch.cuda.synchronize()
    end_time = time.time() - start_time
    dims = mel_outputs.size()
    if (i != 0):
        total = total + dims[-1]
        res = res + end_time
TPS = ((total * 256)/ 22050 )/ res

print("TPS = " + str(TPS))

// * this uses torch.backends.cudnn.benchmark, but result does not.

My goal is to increase total TPS of this tacotron 2 program using MPS. The result was like this :

1 process, Not MPS : 9.3 TPS
10 process, Not MPS : 1.938 TPS (average) , Total : 19.38 TPS
20 process (max), Not MPS : 0.73 TPS (average), Total : 14.75 TPS

1 process, MPS : 9.3 TPS
10 process, MPS : 4.859 TPS (average) , Total : 48.59 TPS
20 process (max), MPS : 2.87 TPS (average), Total : 57.45 TPS

My goal is not only optimizing individual result of Tacotron 2, but also total TPS of multiple process.

I found there are two ways to improve performance of MPS - one is CPU pinning using taskset -c ~ (this worked well), and the other one is Volta MPS Execution Resource Provisioning - from this reference : https://docs.nvidia.com/deploy/mps/index.html

I wonder is there any other way to improve CUDA MPS performance except methods that I mentioned.

Thank you,
Tae Young Yeon