Hello all,
I’m currently evaluating the deployment of a Transformer-based model (using [FlashAttention]) for real-time inference compute time on NVIDIA Jetson hardware.
Could someone from the community share:
- Experiences running FlashAttention / HuggingFace Transformers on Jetson Orin?
- Recommendations for model size limits or batch sizes for Orin NX 16GB vs AGX Orin?
- Any caveats with FlashAttention’s Triton/CUDA kernels or memory usage in these embedded environments?
Thanks in advance!