Question regarding potential buffer copies when using nvinferserver plugin:
Our compute server has 4 GPUs (no nvlink). Most plugins of the pipeline (e.g. nvvideoconvert) accept a “gpu-id”. The configuration of nvinferserver does so as well. But Triton Inference Server of course does it’s own independent scheduling across all 4 GPUs.
If we set our pipeline to use “gpu-id = 0”, but triton after receiving the grpc call with the shared cuda memory buffer decides to load balance / schedule the inferencing on e.g. GPU Id 3, will this result in a memory copy from GPU 0 to GPU 3? Or will triton understand that the buffer is already bound to GPU 0 and try to do inferencing on that device?
According to the triton github issue, this is unfortunately not as easy as suggested. It is confirmed by the Triton developers that Triton will load balance across all GPUs, and if the shared cuda buffer is not on the correct GPU, a memcopy will happen between GPUs. This will potentially degrade performance on a multi-gpu server when using cuda shared memory, which it is in our case.
At this point this is no longer a deepstream issue, so this can be closed.
There is however a critical bug in nvinferserver plugin when using gpu_ids, where the plugin will always allocate memory on GPU 0, no matter what gpu_ids is set to. I will describe this in a new post.
It is client side - in the deepstream plugin. I provide detailed instructions to reproduce.
The bug currently keeps us from using the plugin in production, because memory on GPU 0 is filling up too fast.