Different slowdowns when executing models concurrently

Hi all,

I am a graduate student doing a research project on edge server inference. I noticed an interesting thing:

When I run ResNet50 alone, the p50 latency is 3559 microsec, thruput is 280.95. When I run two instances of ResNet50, latency is 6048, thruput is 330.15. So far so good. I understand that deep down my GPU uses a Time Sliced Scheduler which makes the latency to nearly double, and maybe when there’s one instance it doesn’t fully utilize GPU so there’s a slight thruput increase.

When I run ResNet50 simultaneously with, say, VGG16. That’s different. latency of ResNet becomes 11226, thruput becomes 89.2, which is far from half of the thruput when running ResNet alone.

Why running ResNet with VGG interferes with the running of ResNet so much? Is it because VGG’s kernel functions on GPU is larger, and when time sliced scheduler lets them run roughly the same number of kernels, ResNet occupies less time on GPU?

Thank you very much!

You might wish to take advantage of Triton inference server, rather than running these processes separately/individually.

Hi Robert, thanks for your reply.

Yes actually the data was obtained from Triton’s benchmark tool perf_analyzer. I also ran the models myself with Pytorch and the result is similar.

I don’t know if it’s something related to CUDA driver that causes this to happen.

any help to this?

You might wish to post your question on one of the DL forums:

Thank you, will do!