Nvinferserver always allocates memory on GPU ID 0 and ignores gpu_ids configuration

When using nvinferserver on a multi-server gpu server, configuring gpu_ids to != [0] still always allocates memory on gpu ID 0, additionally to the specified GPU.

• Hardware Platform dGPU
• DeepStream Version 6.2-triton docker
• NVIDIA GPU Driver Version 525.105.17
• Issue Type bug
• How to reproduce the issue ?

Reproduce with the following pipeline on a server with e.g. 4 GPUs in deepstream 6.2-triton docker:

export USE_NEW_NVSTREAMMUX=yes
export VIDEO=/opt/nvidia/deepstream/deepstream/samples/streams/sample_720p.h264

gst-launch-1.0 filesrc location=$VIDEO ! h264parse ! nvv4l2decoder gpu-id=2 ! mux.sink_0 nvstreammux name=mux ! nvinferserver config-file-path="./config_triton_grpc_infer.txt" ! fakesink

Notice nvv4l2decoder gpu-id=2 sets decoding to GPU ID 2. Set gpu_ids: [2] in config_triton_grpc_infer.txt to match that.

Will result in 144 MB on GPU 2, but additional 102 MB on GPU 0:


For nvv4l2decoder gpu-id=1 and nvinferserver gpu_ids: [1]


For nvv4l2decoder gpu-id=0 and nvinferserver gpu_ids: [0]

No additional memory needed now that both gpus are ID 0.


For nvv4l2decoder gpu-id=2 and nvinferserver gpu_ids: [1]

0:00:00.653846585  2030 0x55d37a57e980 WARN           nvinferserver gstnvinferserver.cpp:628:gst_nvinfer_server_submit_input_buffer:<nvinferserver0> error: Memory Compatibility Error:Input surface gpu-id doesn't match with configured gpu-id for element, please allocate input using unified memory, or use same gpu-ids OR, if same gpu-ids are used ensure appropriate Cuda memories are used
0:00:00.653882336  2030 0x55d37a57e980 WARN           nvinferserver gstnvinferserver.cpp:628:gst_nvinfer_server_submit_input_buffer:<nvinferserver0> error: surface-gpu-id=2,nvinferserver0-
[ERROR push 333] push failed [-5]

Which is an expected outcome, because buffers are on gpu 2, but nvinferserver is on gpu 1.


Crosschecking that nvinferserver is the plugin that allocates the additional memory on GPU 0 by omitting it from the pipeline:

gst-launch-1.0 filesrc location=$VIDEO ! h264parse ! nvv4l2decoder gpu-id=3 ! mux.sink_0 nvstreammux name=mux ! fakesink

Only allocates memory on GPU ID 3.

• Requirement details

Specifying gpu_ids should not additionally allocate memory on GPU ID 0. Buffers should be on same GPU and also processing should take place only on that GPU.

If possible please confirm the issue and provide workarounds for Deepstream 6.2.

did you modify config_triton_grpc_infer.txt? if yes, please share the configuration file?

The problem is not related to our config, please use the following deepstream samples to reproduce the same issue (everything already shipped in docker image):

# Start deepstream triton docker, allow all gpus
docker run --gpus all -it -e CUDA_CACHE_DISABLE=0 nvcr.io/nvidia/deepstream:6.2-triton

To start a valid tritonserver follow the sample instructions:

# Init the sample triton model repo
cd /opt/nvidia/deepstream/deepstream/samples
./prepare_ds_triton_model_repo.sh
# Start tritonserver with this sample repo
tritonserver --model-repository=/opt/nvidia/deepstream/deepstream/samples/triton_model_repo

Attach a new bash to the same docker container via docker exec -it CONTAINER-ID /bin/bash and run:

export USE_NEW_NVSTREAMMUX=yes
export VIDEO=/opt/nvidia/deepstream/deepstream/samples/streams/sample_720p.h264
export MODEL_CONFIG=/opt/nvidia/deepstream/deepstream/samples/configs/deepstream-app-triton-grpc/config_infer_plan_engine_primary.txt

gst-launch-1.0 filesrc location=$VIDEO ! h264parse ! nvv4l2decoder gpu-id=0 ! mux.sink_0 nvstreammux name=mux ! nvinferserver config-file-path=$MODEL_CONFIG ! fakesink

This should give you a perfectly working pipeline run until the video file ends.

Now to observe the actual issue please open the sample model config with nano $MODEL_CONFIG and change gpu_ids: [0] to gpu_ids: [2]. Also change the pipeline above to nvv4l2decoder gpu-id=2.

So the new pipeline to observe the issue:

gst-launch-1.0 filesrc location=$VIDEO ! h264parse ! nvv4l2decoder gpu-id=2 ! mux.sink_0 nvstreammux name=mux ! nvinferserver config-file-path=$MODEL_CONFIG ! fakesink

See nvidia-smi now on the same server while this pipeline runs:

The 102MB additional memory on GPU 0 are a critical issue. nvv4l2decoder is set to GPU 2 and nvinferserver is set to GPU 2, but still memory is allocated on GPU 0. Every single deepstream pipeline which uses nvinferserver will allocate 102MB on GPU ID 0, so VRAM of GPU 0 is the overall bottleneck of the system, the other GPUs can not be utilized fully. We need a fix or workaround before we can use nvinferserver plugin in production in its current state.

Furthermore, I don’t understand why nvinferserver would need 100MB of VRAM on GPU 0 in the first place. Memcopying the buffers from GPU 2 to GPU 0 for preprocessing clearly makes no sense when we look at the statement in the nvinferserver documentation:

image

thanks for your sharing, I can reproduce this issue, it is related to nvinferserver, we are investigating. BTW, nvinferserver plugin is opensource.
this command will use one GPU.
gst-launch-1.0 filesrc location=$VIDEO ! h264parse ! nvv4l2decoder gpu-id=2 ! mux.sink_0 nvstreammux name=mux ! fakesink
this command will use two GPU.
gst-launch-1.0 filesrc location=$VIDEO ! h264parse ! nvv4l2decoder gpu-id=2 ! mux.sink_0 nvstreammux name=mux ! nvinferserver config-file-path=$MODEL_CONFIG ! fakesink

Thanks for addressing this issue. I wasn’t aware the gst-nvinferserver plugin was open source, that’s great! Where can I find the sources? We may work on a workaround on our own in the meantime.

in deeptream SDK /opt/nvidia/deepstream/deepstream/sources/gst-plugins/gst-nvinferserver/
and /opt/nvidia/deepstream/deepstream/sources/libs/nvdsinferserver/

1 Like

We couldn’t locate the issue in the source code of the plugin. Is there an update on Nvidias end?

here is a bug needed to fix, need to use cudaSetDevice to set gpuId.
TrtISBackend::specifyInputDims{
int gpuId = 0;
UniqCudaTensorBuf tensor = createGpuTensorBuf(
dims.dims, layer->dataType, dims.batchSize, name, gpuId, false););

}

Hi @fanzh

We have changed the gpuId to 1 to verify, but the memory is still allocated on GPU 0.

    std::cout << "I WAS HERE" << std::endl;

    int gpuId = 1;
    CONTINUE_CUDA_ERR(
        cudaGetDevice(&gpuId), "CudaDeviceMem failed to get dev-id:%d", gpuId);

We can see the debug output, so the change is in effect. 102 MB of memory are still on GPU 0 though.

We have read the documentation on cudaGetDevice and it returns gpuId 0, because that seems to be the device in use for the current context.

So skipping the check we set GPU ID 1 directly:

    int gpuId = 1;
    // CONTINUE_CUDA_ERR(
    //     cudaGetDevice(&gpuId), "CudaDeviceMem failed to get dev-id:%d", gpuId);

    SharedBatchArray allInputs = std::make_shared<BaseBatchArray>();
    for (const auto& in : shapes) {
        ...
        UniqCudaTensorBuf tensor = createGpuTensorBuf(
            dims.dims, layer->dataType, dims.batchSize, name, gpuId, false);
        RETURN_IF_FAILED(
            tensor, NVDSINFER_CUDA_ERROR, "failed to create GPU tensor buffer.");
        ...
    }

But now we have cuda contexts on GPU 0,1 and 2 for a pipeline using only GPU 2 according to config.

We have attached a debug message to createGpuTensorBuf, and get the following output:

CREATING GPU BUFFER ON 1
CREATING GPU BUFFER ON 2
CREATING GPU BUFFER ON 2

So no buffers on GPU 0 allocated there apparently, but still 102MB memory.
Buffer on 1 is from the above lines of code. Buffers on 2 seem to be using the config correctly.

@fanzh please advise on how to proceed here and whether we need to make this an issue with our direct contacts at nvidia to prioritize this issue.
And thanks for the helpful pointers so far.

int gpuId = 1;
cudaSetDevice(gpuId);
Please try this, and need to find all gpu usage bugs.

There are quite a lot of occurences of cudaSetDevice calls in the code.

How long would it take to create a patch for this? Is this something we can get support for? Happy to go through the appropriate channels. We need a solution here for multiple customer projects.

We could work with a git patch that we apply for now, it doesn’t have to be included in a release.

workaround.txt (1.2 KB)
please try this workaround code in ds6.2, especially please rebuild nvdsinferserver and repace the old /opt/nvidia/deepstream/deepstream/lib/libnvds_infer_server.so.

@philipp.schmidt dose the code above work? I tested it on T4 + nvcr.io/nvidia/deepstream:6.2-triton, it works fine.

Hello @fanzh, thanks for the quick help. I will have the opportunity to try in a few hours and let you know asap. Thanks!

Hello @fanzh

I can confirm the patch works, thanks for the solution and great support.
Attached the git patch in case somebody wants to apply this directly with git apply.

libnvdsinferserver.patch (1.5 KB)

Will this fix make it into a DS release anytime soon?

Also really great that this is open source and we can just apply a patch. Thumbs up for that.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.