Cosmos-Reason2-2B running on Jetson Orin Nano

Hi all!

We see a lot of interest in the Cosmos Reason2 model, and it is currently only supported on Jetson AGX Thor and more powerful devices. We wanted to share that we got Cosmos-Reason2-2B quantized running on the full Jetson lineup, including Orin Nano 8GB. This includes memory and latency numbers, instructions, and some practical adjustments we found necessary on these constrained devices.

What’s here:

Questions for the community:

  • What serving stack do others use for VLMs on Jetson? (vLLM / TensorRT-LLM / custom)

  • For vision + reasoning workloads, where do you hit the first bottleneck?

  • Any further memory optimization on Nano series you recommend?

Thanks!

Quickstart (vLLM Jetson container):

-gpu-memory-utilization and --max-num-seqs should be adapted to system specifications (i.e., available RAM).

docker run --rm -it
–network host
–shm-size=8g
–ulimit memlock=-1
–ulimit stack=67108864
–runtime=nvidia
–name=vllm-serve
ghcr.io/nvidia-ai-iot/vllm:latest-jetson-orin
vllm serve “embedl/Cosmos-Reason2-2B-W4A16”
–max-model-len 8192
–gpu-memory-utilization 0.75
–max-num-seqs 2

Hi,

Thanks a lot for sharing.

Usually, we recommend the frameworks that can be found in the link below:

Below is a tutorial to reduce memory usage from display for your reference:

Thanks.