Flash Attention in Jetson Orin

Hello all,
I’m currently evaluating the deployment of a Transformer-based model (using [FlashAttention]) for real-time inference compute time on NVIDIA Jetson hardware.
Could someone from the community share:

  • Experiences running FlashAttention / HuggingFace Transformers on Jetson Orin?
  • Recommendations for model size limits or batch sizes for Orin NX 16GB vs AGX Orin?
  • Any caveats with FlashAttention’s Triton/CUDA kernels or memory usage in these embedded environments?

Thanks in advance!

Hi,

You can find our sharing for different models below:

The prebuilt flash attention package for JetPack 6.2 can be found in the below link:
https://pypi.jetson-ai-lab.dev/jp6/cu126

Thanks.

Thank you very much for sharing this!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.