Phi-4-Multimodal-Instruct: NetworkError When Sending Requests to Phi-4-Multimodal-Instruct on NVIDIA API (900+ Credits Available)

Description:

Issue Summary:

I am attempting to call the Phi-4-Multimodal-Instruct model via the NVIDIA API (https://build.nvidia.com/microsoft/phi-4-multimodal-instruct) with both image and audio inputs. However, my requests fail with a “NetworkError when attempting to fetch resource.”

Despite having 900+ credits in my account, I am unable to submit successful API requests.

Steps to Reproduce:

  1. Encode an image (image.png) and an audio file (audio.wav) in base64.
  2. Construct a request following NVIDIA’s API documentation.
  3. Send the request to https://integrate.api.nvidia.com/v1/chat/completions.
  4. The request fails with a NetworkError.

Code Snippet:

import requests, base64

invoke_url = "https://integrate.api.nvidia.com/v1/chat/completions"
stream = True

with open("image.png", "rb") as f:
  image_b64 = base64.b64encode(f.read()).decode()
with open("audio.wav", "rb") as f:
  audio_b64 = base64.b64encode(f.read()).decode()

assert len(image_b64) + len(audio_b64) < 180_000, \
  "To upload larger images and/or audios, use the assets API (see docs)"

headers = {
  "Authorization": "Bearer <API_KEY>",  
  "Accept": "text/event-stream" if stream else "application/json"
}

payload = {
  "model": "microsoft/phi-4-multimodal-instruct",
  "messages": [
    {
      "role": "user",
      "content": f'Answer the spoken query about the image.<img src="data:image/png;base64,{image_b64}" /><audio src="data:audio/wav;base64,{audio_b64}" />'
    }
  ],
  "max_tokens": 512,
  "temperature": 0.10,
  "top_p": 0.70,
  "stream": stream
}

response = requests.post(invoke_url, headers=headers, json=payload)

if stream:
    for line in response.iter_lines():
        if line:
            print(line.decode("utf-8"))
else:
    print(response.json())

Error Message:

NetworkError when attempting to fetch resource.

Additional Information:

  • Credits Available: 900+ (not an issue of running out of credits).
  • Text-only requests work, but adding image and audio results in a NetworkError.
  • The payload size is under 180,000 bytes, so it should not require the assets API.
  • I have verified the API key and endpoint URL are correct.
  • I have tested from multiple networks, ruling out local connectivity issues.

Request for Support:

  1. Is there an issue with the NVIDIA API handling multimodal requests with both image and audio?
  2. Are there any API limits or restrictions causing this NetworkError?
  3. Could you provide guidance on troubleshooting or an alternative approach?

Thank you for your assistance!