Triton Server GPU memory copy?

Question regarding potential buffer copies when using nvinferserver plugin:

Our compute server has 4 GPUs (no nvlink). Most plugins of the pipeline (e.g. nvvideoconvert) accept a “gpu-id”. The configuration of nvinferserver does so as well. But Triton Inference Server of course does it’s own independent scheduling across all 4 GPUs.

If we set our pipeline to use “gpu-id = 0”, but triton after receiving the grpc call with the shared cuda memory buffer decides to load balance / schedule the inferencing on e.g. GPU Id 3, will this result in a memory copy from GPU 0 to GPU 3? Or will triton understand that the buffer is already bound to GPU 0 and try to do inferencing on that device?

Thanks for the help.

Do you mean that after setting up gpu-ids for nviferserver, will there still be load balance within Triton?

Yes, more or less. Nvinferserver places the input tensors for the network in cuda shared memory. That means one of two things will necessarily happen:

  • Triton will respect the placement of the input tensor in memory and will only run computation on the same GPU to avoid memory copy
  • Triton will load balance across all 4 GPUs, but that will involve an additional memory copy between GPUs to place the tensor on the correct GPU

I want to know which of the two is correct.

Also posted in triton github issue: Question: scheduling on multiple GPUs when using cuda shared memory · Issue #5687 · triton-inference-server/server · GitHub

If you set gpu_ids for nvinferserver, it will run only on the same GPU. The gpu-id parameter is only valid for the plugins who set it.

Makes sense! Thanks.

Followup question: for nvinferserver the gpu-id is an array/list. But currently it only accepts a single value in that list. Could you comment on that? Is this for future compatibility?

Yes, we add comment on our guide.

Yes, we’ll consider that.

1 Like


According to the triton github issue, this is unfortunately not as easy as suggested. It is confirmed by the Triton developers that Triton will load balance across all GPUs, and if the shared cuda buffer is not on the correct GPU, a memcopy will happen between GPUs. This will potentially degrade performance on a multi-gpu server when using cuda shared memory, which it is in our case.

At this point this is no longer a deepstream issue, so this can be closed.

There is however a critical bug in nvinferserver plugin when using gpu_ids, where the plugin will always allocate memory on GPU 0, no matter what gpu_ids is set to. I will describe this in a new post.

Does this issue occur on the client or server side? Nvinferserver plugin is a client plugin. If you want to change the gpu of server side, you should set the gpus parameter in the config.pbtxt file.

It is client side - in the deepstream plugin. I provide detailed instructions to reproduce.
The bug currently keeps us from using the plugin in production, because memory on GPU 0 is filling up too fast.

Please have a look if possible:

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.