- Hardware Platform : dGPU, A100 x 8
- System Memory : 2TB
- Ubuntu Version : 22.04
- NVIDIA GPU Driver Version (valid for GPU only) : 535.54.03
- Issue Type( questions, new requirements, bugs) : bugs
- How to reproduce the issue ? (This is for bugs. Including the command line used and other details for reproducing)
- Requirement details (This is for new requirement. Including the logs for the pods, the description for the pods)
Hi, I wanna report and solve an error while deploying VSS blueprint together :)
This is my GPU spec.
System memory is 2TB.
I ran the vss with the following code.
sudo microk8s helm upgrade --install vss-blueprint nvidia-blueprint-vss-2.1.0.tgz --set global.ngcImagePullSecretName=ngc-docker-reg-secret --set vss.applicationSpecs.vss-deployment.containers.vss.startupProbe.failureThreshold=360 -f override.yaml
I attached my override.yaml file. The reason why I used this override file is that nemo-rerank keeps restarting with CUDA allocation error. Please check here for more information.
Here, when checking the log of vss-vss-deployment-POD-NAME,
Defaulted container "vss" out of: vss, check-milvus-up (init), check-neo4j-up (init), check-llm-up (init)
Error from server (BadRequest): container "vss" in pod "vss-vss-deployment-55689d569d-ndwl5" is waiting to start: PodInitializing
never stops. I also attach the log of vss-blueprint-0.
It restarts with the following log.
{"level": "ERROR", "time": "None", "file_name": "None", "file_path": "None", "line_number": "-1", "message": "", "exc_info": "Traceback (most recent call last):\n File \"/usr/lib/python3.10/runpy.py\", line 196, in _run_module_as_main\n return _run_code(code, main_globals, None,\n File \"/usr/lib/python3.10/runpy.py\", line 86, in _run_code\n exec(code, run_globals)\n File \"/opt/nim/llm/nim_llm_sdk/entrypoints/openai/api_server.py\", line 885, in <module>\n trt_llm_engine = create_trt_executor(\n File \"/opt/nim/llm/nim_llm_sdk/trtllm/utils.py\", line 437, in create_trt_executor\n trtllm_exec = trtllm.Executor(\nRuntimeError: [TensorRT-LLM][ERROR] Assertion failed: *std::max_element(mDeviceIds.begin(), mDeviceIds.end()) < mGpusPerNode (/home/jenkins/agent/workspace/LLM/release-0.12/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/runtime/worldConfig.cpp:56)\n1 0x7f03e7eeb874 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 100\n2 0x7f03e95e931f tensorrt_llm::runtime::WorldConfig::WorldConfig(int, int, int, int, std::optional<std::vector<int, std::allocator<int> > > const&) + 3375\n3 0x7f03e95e9ad5 tensorrt_llm::runtime::WorldConfig::mpi(int, std::optional<int>, std::optional<int>, std::optional<std::vector<int, std::allocator<int> > > const&) + 1365\n4 0x7f03e9a0092c tensorrt_llm::executor::Executor::Impl::loadModel(std::optional<std::filesystem::__cxx11::path> const&, std::optional<std::vector<unsigned char, std::allocator<unsigned char> > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool) + 460\n5 0x7f03e9a06c2f tensorrt_llm::executor::Executor::Impl::Impl(std::vector<unsigned char, std::allocator<unsigned char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::optional<std::vector<unsigned char, std::allocator<unsigned char> > > const&, std::optional<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 1663\n6 0x7f03e99fb7fb tensorrt_llm::executor::Executor::Executor(std::vector<unsigned char, std::allocator<unsigned char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 91\n7 0x7f044e76efdb /opt/nim/llm/.venv/lib/python3.10/site-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xbcfdb) [0x7f044e76efdb]\n8 0x7f044e708bb7 /opt/nim/llm/.venv/lib/python3.10/site-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x56bb7) [0x7f044e708bb7]\n9 0x555850525b2e /opt/nim/llm/.venv/bin/python3(+0x15cb2e) [0x555850525b2e]\n10 0x55585051c2db _PyObject_MakeTpCall + 603\n11 0x55585053455b /opt/nim/llm/.venv/bin/python3(+0x16b55b) [0x55585053455b]\n12 0x5558505350c8 _PyObject_Call + 280\n13 0x555850530ad7 /opt/nim/llm/.venv/bin/python3(+0x167ad7) [0x555850530ad7]\n14 0x55585051c68b /opt/nim/llm/.venv/bin/python3(+0x15368b) [0x55585051c68b]\n15 0x7f068420196b /opt/nim/llm/.venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so(+0x45e96b) [0x7f068420196b]\n16 0x55585051c2db _PyObject_MakeTpCall + 603\n17 0x5558505154fa _PyEval_EvalFrameDefault + 29418\n18 0x55585052642c _PyFunction_Vectorcall + 124\n19 0x55585050fb93 _PyEval_EvalFrameDefault + 6531\n20 0x55585050b016 /opt/nim/llm/.venv/bin/python3(+0x142016) [0x55585050b016]\n21 0x5558506008b6 PyEval_EvalCode + 134\n22 0x5558506065fd /opt/nim/llm/.venv/bin/python3(+0x23d5fd) [0x5558506065fd]\n23 0x555850526689 /opt/nim/llm/.venv/bin/python3(+0x15d689) [0x555850526689]\n24 0x55585050e8cc _PyEval_EvalFrameDefault + 1724\n25 0x55585052642c _PyFunction_Vectorcall + 124\n26 0x55585050e8cc _PyEval_EvalFrameDefault + 1724\n27 0x55585052642c _PyFunction_Vectorcall + 124\n28 0x55585061e48d /opt/nim/llm/.venv/bin/python3(+0x25548d) [0x55585061e48d]\n29 0x55585061d138 Py_RunMain + 296\n30 0x5558505f370d Py_BytesMain + 45\n31 0x7f0693ec3d90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f0693ec3d90]\n32 0x7f0693ec3e40 __libc_start_main + 128\n33 0x5558505f3605 _start + 37", "stack_info": "None"}
{"level": "ERROR", "time": "None", "file_name": "None", "file_path": "None", "line_number": "-1", "message": "", "exc_info": "Traceback (most recent call last):\n File \"/usr/lib/python3.10/runpy.py\", line 196, in _run_module_as_main\n return _run_code(code, main_globals, None,\n File \"/usr/lib/python3.10/runpy.py\", line 86, in _run_code\n exec(code, run_globals)\n File \"/opt/nim/llm/nim_llm_sdk/entrypoints/openai/api_server.py\", line 790, in <module>\n engine = AsyncLLMEngineFactory.from_engine_args(engine_args, usage_context=UsageContext.OPENAI_API_SERVER)\n File \"/opt/nim/llm/nim_llm_sdk/engine/async_trtllm_engine_factory.py\", line 43, in from_engine_args\n engine = engine_cls.from_engine_args(engine_args, start_engine_loop, usage_context)\n File \"/opt/nim/llm/nim_llm_sdk/engine/async_trtllm_engine.py\", line 315, in from_engine_args\n return cls(\n File \"/opt/nim/llm/nim_llm_sdk/engine/async_trtllm_engine.py\", line 285, in __init__\n self.engine: _AsyncTRTLLMEngine = self._init_engine(*args, **kwargs)\n File \"/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py\", line 842, in _init_engine\n return engine_class(*args, **kwargs)\n File \"/opt/nim/llm/nim_llm_sdk/engine/async_trtllm_engine.py\", line 137, in __init__\n self._tllm_engine = TrtllmModelRunner(\n File \"/opt/nim/llm/nim_llm_sdk/engine/trtllm_model_runner.py\", line 278, in __init__\n self._tllm_exec, self._cfg = self._create_engine(\n File \"/opt/nim/llm/nim_llm_sdk/engine/trtllm_model_runner.py\", line 585, in _create_engine\n return create_trt_executor(\n File \"/opt/nim/llm/nim_llm_sdk/trtllm/utils.py\", line 437, in create_trt_executor\n trtllm_exec = trtllm.Executor(\nRuntimeError: [TensorRT-LLM][ERROR] Assertion failed: *std::max_element(mDeviceIds.begin(), mDeviceIds.end()) < mGpusPerNode (/home/jenkins/agent/workspace/LLM/release-0.12/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/runtime/worldConfig.cpp:56)\n1 0x7fb9fc6eb874 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 100\n2 0x7fb9fdde931f tensorrt_llm::runtime::WorldConfig::WorldConfig(int, int, int, int, std::optional<std::vector<int, std::allocator<int> > > const&) + 3375\n3 0x7fb9fdde9ad5 tensorrt_llm::runtime::WorldConfig::mpi(int, std::optional<int>, std::optional<int>, std::optional<std::vector<int, std::allocator<int> > > const&) + 1365\n4 0x7fb9fe20092c tensorrt_llm::executor::Executor::Impl::loadModel(std::optional<std::filesystem::__cxx11::path> const&, std::optional<std::vector<unsigned char, std::allocator<unsigned char> > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool) + 460\n5 0x7fb9fe206c2f tensorrt_llm::executor::Executor::Impl::Impl(std::vector<unsigned char, std::allocator<unsigned char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::optional<std::vector<unsigned char, std::allocator<unsigned char> > > const&, std::optional<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 1663\n6 0x7fb9fe1fb7fb tensorrt_llm::executor::Executor::Executor(std::vector<unsigned char, std::allocator<unsigned char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 91\n7 0x7fba62f6efdb /opt/nim/llm/.venv/lib/python3.10/site-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xbcfdb) [0x7fba62f6efdb]\n8 0x7fba62f08bb7 /opt/nim/llm/.venv/lib/python3.10/site-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x56bb7) [0x7fba62f08bb7]\n9 0x559acf13cb2e /opt/nim/llm/.venv/bin/python3(+0x15cb2e) [0x559acf13cb2e]\n10 0x559acf1332db _PyObject_MakeTpCall + 603\n11 0x559acf14b55b /opt/nim/llm/.venv/bin/python3(+0x16b55b) [0x559acf14b55b]\n12 0x559acf14c0c8 _PyObject_Call + 280\n13 0x559acf147ad7 /opt/nim/llm/.venv/bin/python3(+0x167ad7) [0x559acf147ad7]\n14 0x559acf13368b /opt/nim/llm/.venv/bin/python3(+0x15368b) [0x559acf13368b]\n15 0x7fbca0c2d96b /opt/nim/llm/.venv/lib/python3.10/site-packages/torch/lib/libtorch_python.so(+0x45e96b) [0x7fbca0c2d96b]\n16 0x559acf1332db _PyObject_MakeTpCall + 603\n17 0x559acf12c4fa _PyEval_EvalFrameDefault + 29418\n18 0x559acf13d42c _PyFunction_Vectorcall + 124\n19 0x559acf126b93 _PyEval_EvalFrameDefault + 6531\n20 0x559acf13d42c _PyFunction_Vectorcall + 124\n21 0x559acf125abb _PyEval_EvalFrameDefault + 2219\n22 0x559acf13d42c _PyFunction_Vectorcall + 124\n23 0x559acf13251d _PyObject_FastCallDictTstate + 365\n24 0x559acf147555 /opt/nim/llm/.venv/bin/python3(+0x167555) [0x559acf147555]\n25 0x559acf13327c _PyObject_MakeTpCall + 508\n26 0x559acf12c4fa _PyEval_EvalFrameDefault + 29418\n27 0x559acf13d42c _PyFunction_Vectorcall + 124\n28 0x559acf13251d _PyObject_FastCallDictTstate + 365\n29 0x559acf1474b4 /opt/nim/llm/.venv/bin/python3(+0x1674b4) [0x559acf1474b4]\n30 0x559acf13368b /opt/nim/llm/.venv/bin/python3(+0x15368b) [0x559acf13368b]\n31 0x559acf14bebb PyObject_Call + 187\n32 0x559acf127a6e _PyEval_EvalFrameDefault + 10334\n33 0x559acf14b281 /opt/nim/llm/.venv/bin/python3(+0x16b281) [0x559acf14b281]\n34 0x559acf14bf22 PyObject_Call + 290\n35 0x559acf127a6e _PyEval_EvalFrameDefault + 10334\n36 0x559acf13d42c _PyFunction_Vectorcall + 124\n37 0x559acf13251d _PyObject_FastCallDictTstate + 365\n38 0x559acf1474b4 /opt/nim/llm/.venv/bin/python3(+0x1674b4) [0x559acf1474b4]\n39 0x559acf13368b /opt/nim/llm/.venv/bin/python3(+0x15368b) [0x559acf13368b]\n40 0x559acf14bebb PyObject_Call + 187\n41 0x559acf127a6e _PyEval_EvalFrameDefault + 10334\n42 0x559acf14b281 /opt/nim/llm/.venv/bin/python3(+0x16b281) [0x559acf14b281]\n43 0x559acf12b34a _PyEval_EvalFrameDefault + 24890\n44 0x559acf14b281 /opt/nim/llm/.venv/bin/python3(+0x16b281) [0x559acf14b281]\n45 0x559acf126b93 _PyEval_EvalFrameDefault + 6531\n46 0x559acf122016 /opt/nim/llm/.venv/bin/python3(+0x142016) [0x559acf122016]\n47 0x559acf2178b6 PyEval_EvalCode + 134\n48 0x559acf21d5fd /opt/nim/llm/.venv/bin/python3(+0x23d5fd) [0x559acf21d5fd]\n49 0x559acf13d689 /opt/nim/llm/.venv/bin/python3(+0x15d689) [0x559acf13d689]\n50 0x559acf1258cc _PyEval_EvalFrameDefault + 1724\n51 0x559acf13d42c _PyFunction_Vectorcall + 124\n52 0x559acf1258cc _PyEval_EvalFrameDefault + 1724\n53 0x559acf13d42c _PyFunction_Vectorcall + 124\n54 0x559acf23548d /opt/nim/llm/.venv/bin/python3(+0x25548d) [0x559acf23548d]\n55 0x559acf234138 Py_RunMain + 296\n56 0x559acf20a70d Py_BytesMain + 45\n57 0x7fbcb07d4d90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fbcb07d4d90]\n58 0x7fbcb07d4e40 __libc_start_main + 128\n59 0x559acf20a605 _start + 37", "stack_info": "None"}
{"level": "None", "time": "None", "file_name": "None", "file_path": "None", "line_number": "-1", "message": "--------------------------------------------------------------------------", "exc_info": "None", "stack_info": "None"}
{"level": "None", "time": "None", "file_name": "None", "file_path": "None", "line_number": "-1", "message": "Primary job terminated normally, but 1 process returned", "exc_info": "None", "stack_info": "None"}
{"level": "None", "time": "None", "file_name": "None", "file_path": "None", "line_number": "-1", "message": "a non-zero exit code. Per user-direction, the job has been aborted.", "exc_info": "None", "stack_info": "None"}
{"level": "None", "time": "None", "file_name": "None", "file_path": "None", "line_number": "-1", "message": "--------------------------------------------------------------------------", "exc_info": "None", "stack_info": "None"}
{"level": "None", "time": "None", "file_name": "None", "file_path": "None", "line_number": "-1", "message": "--------------------------------------------------------------------------", "exc_info": "None", "stack_info": "None"}
{"level": "None", "time": "None", "file_name": "None", "file_path": "None", "line_number": "-1", "message": "mpirun detected that one or more processes exited with non-zero status, thus causing", "exc_info": "None", "stack_info": "None"}
{"level": "None", "time": "None", "file_name": "None", "file_path": "None", "line_number": "-1", "message": "the job to be terminated. The first process to do so was:", "exc_info": "None", "stack_info": "None"}
{"level": "None", "time": "None", "file_name": "None", "file_path": "None", "line_number": "-1", "message": "", "exc_info": "None", "stack_info": "None"}
{"level": "None", "time": "None", "file_name": "None", "file_path": "None", "line_number": "-1", "message": " Process name: [[23129,1],0]", "exc_info": "None", "stack_info": "None"}
{"level": "None", "time": "None", "file_name": "None", "file_path": "None", "line_number": "-1", "message": " Exit code: 1", "exc_info": "None", "stack_info": "None"}
{"level": "None", "time": "None", "file_name": "None", "file_path": "None", "line_number": "-1", "message": "--------------------------------------------------------------------------", "exc_info": "None", "stack_info": "None"}
error_log_vss_blueprint.txt (67.8 KB)
override.txt (1.3 KB)