Hi NVIDIA team,
I’ve encountered an issue regarding the maximum input token length while working with the nv-embed-qa-1b-v2 model for text embeddings.
According to the model card (llama-3.2-nv-embedqa-1b-v2 Model by NVIDIA | NVIDIA NIM), the NVIDIA NeMo Retriever Llama3.2 embedding model should support “long documents (up to 8192 tokens)” with dynamic embedding size through Matryoshka Embeddings.
However, when attempting to process text with 597 tokens, I received the following error from the tritonserver:
openai.BadRequestError: Error code: 400 - {'object': 'error', 'message': 'Input length 597 exceeds maximum allowed token size 512', 'detail': {}, 'type': 'invalid_request_error'}
This error message indicates a maximum token limit of 512, which contradicts the documented 8192 token limit. Could you please:
- Clarify the actual maximum token limit for this model
- Explain if there are any specific configuration settings needed to utilize the full 8192 token capacity
- Provide guidance on handling longer documents if the 512 token limit is indeed correct
Environment details:
- Model: nv-embed-qa-1b-v2
- Deployment: Triton Server
- Input: Text document (597 tokens)
Thank you for your assistance in resolving this discrepancy.
Thanks for bringing this to our attention. We have recreated the error and confirm that the wrong tokenizer version was deployed by us. We are working to fix this and I will get back to you as soon as i have an eta on a fix, or we have resolved the problem. Thanks so much for your patience, and for raising this in the forum! Best, Sophie.
1 Like
Thank you for the quick response and confirmation of the tokenizer issue!
While we await the fix, I’d like to raise a related query about another model we’re using: ngc:nim/nvidia/vila-1.5-40b:vila-yi-34b-siglip-stage3_1003_video_v8
, which is used in VSS(NVIDIA’s video search and summarization agent). This is particularly relevant because VSS is using nv-embed-qa-1b-v2 to generate embeddings for the text summaries produced by VILA 1.5.
Currently, we haven’t encountered any token limit by errors with nv-embed-qa when processing video chunk summaries, as our summaries have remained under 512 tokens. However, I’d appreciate clarification on whether:
- This 512 token limit for VILA-1.5 is the intended design specification
- The VILA-1.5 is supposed to generate longer text outputs.
This information would help us better plan our implementation and avoid potential issues in the future.
Thank you for your continued support!
@jhpark26 I’m reaching out to the team who work on VILA to get you an answer to this ASAP! Thanks for your patience.