Models with vlm, structured output and tool_calling

Hi, I am building an agentic application that requires an LLM that accepts image and text as input, has support to tool calling and structured response. Since I am using langchain chatnvidia, I used this snippet to verify models that fit this criteria:

from langchain_nvidia_ai_endpoints import ChatNVIDIA

models = [model.id for model in ChatNVIDIA.get_available_models() if model.model_type == 'vlm' and model.supports_tools and model.supports_structured_output]

print(models)

This gives me an empty list. Is it correct? Don’t we have any models that support these three features?

Also, I saw that some other providers support with_structure_output using function calling instead of json_mode or structured_output. Is this also possible with ChatNVIDIA?

NVIDIA also supports multimodal inputs, meaning you can provide both images and text for the model to reason over. An example model supporting multimodal inputs is nvidia/neva-22b.

https://python.langchain.com/docs/integrations/chat/nvidia_ai_endpoints/#multimodal

Hello! Thank you for your answer, I took a look on this model and it does not seem match my needs as it does not support tool calling. here is a simple reproducible script that yields an error due to the lack of tools support.

from langchain_core.tools import tool
from langchain_nvidia_ai_endpoints import ChatNVIDIA

@tool
def get_weather(city: str) -> str:
    """Get the weather for a given city."""
    return f"The weather in {city} is sunny."

llm = ChatNVIDIA(model="nvidia/neva-22b", api_key="<your_api_key>").bind_tools([get_weather])

llm.invoke("What is the weather in Tokyo?")

As I mentioned I needed a model that supports tool calling, is multimodal and support structured_response.