LoRA swapping inference Llama-3.1-8b-instruct | Exception: lora format could not be determined

Im looking to use NIM to deploy Llama-3.1-8b-instruct base model with Huggingface trained LoRA adapters. Im deploying the nvcr.io/nim/meta/llama3-8b-instruct:latest image on H100 SXM on lambda labs. You can reproduce this error using the NVIDIA example: Deploy Multilingual LLMs with NVIDIA NIM | NVIDIA Technical Blog.

Im able to deploy this NIM successfully which I check by sending a request to the models endpoint: curl -X GET 'http://IP_ADDRESS:8000/v1/models'. This correctly returns the base model and the two mounted LoRA adapters.

When trying to generate a model response using the completions endpoint I get back an internal server error. The error is Exception: lora format could not be determined. I think this error is new because I have been able to run LoRA swapping successfully with NIM in the past. Below I put the full error from the Docker logs.

What is causing the LoRA adapters not to be recognized as valid connectors? Any help would be greatly appreciated.

INFO 10-15 07:50:06.582 httptools_impl.py:481] 46.231.244.214:64380 - "POST /v1/completions HTTP/1.1" 500
ERROR 10-15 07:50:06.582 httptools_impl.py:416] Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 69, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/entrypoints/openai/api_server.py", line 389, in create_completion
    generator = await openai_serving_completion.create_completion(request, raw_request)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_completion.py", line 169, in create_completion
    async for i, res in result_generator:
  File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 228, in consumer
    raise item
  File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 213, in producer
    async for item in iterator:
  File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/engine/async_trtllm_engine.py", line 391, in generate
    raise e
  File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/engine/async_trtllm_engine.py", line 385, in generate
    async for request_output in stream:
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 77, in __anext__
    raise result
Exception: lora format could not be determined

I actually think the error originates from trtllm_model_runner.py file:

ERROR 2024-10-16 07:46:38.910 trtllm_model_runner.py:517] Error while postprocessing request
Traceback (most recent call last):
  File "/opt/nim/llm/vllm_nvext/engine/trtllm_model_runner.py", line 502, in _requeue_lora
    trtllm_lora = typing.cast(LoraSource, self._lora_source).get_lora(
  File "/opt/nim/llm/vllm_nvext/lora/source.py", line 250, in get_lora
    lora, format = self._load_lora(lora_bytes)
  File "/opt/nim/llm/vllm_nvext/lora/source.py", line 172, in _load_lora
    format = LoraSource._detect_format(raw_lora)
  File "/opt/nim/llm/vllm_nvext/lora/source.py", line 148, in _detect_format
    raise Exception("lora format could not be determined")

Inspecting the code it seems that the _detect_format method is not complete. For huggingface models the preferred way is actually safetensors format for the adapter_model. But this file type is not implemented in the code.

Here is the HF docs: PEFT checkpoint format

By default, the model is saved in the safetensors format, a secure alternative to the bin format, which is known to be susceptible to security vulnerabilities because it uses the pickle utility under the hood. Both formats store the same state_dict though, and are interchangeable.

1 Like

Thanks for raising this @robert-moyai, we’ll see if we can add in support for the safetensors format for the adapters

1 Like