model: deepseek-ai/deepseek-v4-flash
I’m using DeepSeek through the NVIDIA API endpoint, and I noticed that the usage field returned by the backend contains extremely inflated token counts compared with the actual character count.
Example response:
"usage": {
"inputCharacters": 21864,
"inputTokens": 878912,
"outputCharacters": 84,
"outputTokens": 1914,
"totalTokens": 880826
}
The reported inputTokens value is about 40 tokens per input character, which seems unreasonable. With around 21k input characters, I would expect the token count to be much lower, not close to 879k tokens.
In addition, the model sometimes prints raw DeepSeek tool-calling markers in the response content, such as:
<|DSML|tool_calls
This makes me suspect that the DeepSeek tool-calling format may not be fully parsed or converted correctly by the NVIDIA compatibility layer. It may also be related to the inflated token usage if tool call chunks, hidden tool-call markup, or streaming chunks are being counted repeatedly.
Could you please help confirm:
- Is the
usage.inputTokensvalue returned by the NVIDIA API expected to represent the actual billable token count? - Is NVIDIA currently parsing DeepSeek DSML tool calls into OpenAI-compatible
tool_callsfields? - Could this be a bug in token accounting, especially when tools or streaming are enabled?
- Is there any recommended request format to avoid DSML markers leaking into
content?
This issue makes it difficult to estimate cost and context usage accurately.
Thanks.