PyTorch + CUDA kernels not loading properly, not seeing expected speedups on EdgeTAM

i am trying to build EdgeTAM from source (a model built on top of Facebook’s Segment Anything), and I’m getting a bunch of user warnings. I have CUDA enabled, here is my system information:

PyTorch version: 2.8.0
torchvision version: 0.23.0
CUDA available: True
CUDA version: 12.6
GPU: Orin
cuDNN version: 90300
BFloat16 supported: True
torch.backends.cuda.flash_sdp_enabled: True
torch.backends.cuda.mem_efficient_sdp_enabled: True
torch.backends.cuda.cudnn_sdp_enabled: True

Why are my kernels not loading properly? Does it have to do with my CUDA build? Also, when I convert the model to bfloat16, the userwarnings go away, but the speed of the model is still a magnitude of difference away from the expected speed (that others have replicated).

UserWarning: Memory efficient kernel not used because: (Triggered internally at /opt/pytorch/aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:863.)
out = F.scaled_dot_product_attention(q, k, v, dropout_p=dropout_p)

UserWarning: Memory Efficient attention has been runtime disabled. (Triggered internally at /opt/pytorch/aten/src/ATen/native/transformers/sdp_utils_cpp.h:552.)
out = F.scaled_dot_product_attention(q, k, v, dropout_p=dropout_p)

UserWarning: Flash attention kernel not used because: (Triggered internally at /opt/pytorch/aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:865.)
out = F.scaled_dot_product_attention(q, k, v, dropout_p=dropout_p)

UserWarning: Expected query, key and value to all be of dtype: {Half, BFloat16}. Got Query dtype: float, Key dtype: float, and Value dtype: float instead. (Triggered internally at /opt/pytorch/aten/src/ATen/native/transformers/sdp_utils_cpp.h:91.)
out = F.scaled_dot_product_attention(q, k, v, dropout_p=dropout_p)

UserWarning: CuDNN attention kernel not used because: (Triggered internally at /opt/pytorch/aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:867.)
out = F.scaled_dot_product_attention(q, k, v, dropout_p=dropout_p)

UserWarning: Flash Attention kernel failed due to: No available kernel. Aborting execution.
Falling back to all available kernels for scaled_dot_product_attention (which may have a slower speed).

for anyone who wants to see the code that I am running, here:

import torch

import os

import numpy as np

import matplotlib.pyplot as plt

from PIL import Image

from sam2.build_sam import build_sam2

from sam2.sam2_image_predictor import SAM2ImagePredictor

from sam2.automatic_mask_generator import SAM2AutomaticMaskGenerator

import time




device = torch.device("cuda")




image = Image.open('./left.png')

image = np.array(image.convert("RGB"))




checkpoint = "./checkpoints/edgetam.pt"

model_cfg = "edgetam.yaml"




model = build_sam2(model_cfg, checkpoint, device=device)




mask_generator = SAM2AutomaticMaskGenerator(model)

for i in range(10):

    now = time.time()

    masks = mask_generator.generate(image)

print(time.time() - now, "seconds")

Thank you so much!!

Hi

UserWarning: Flash Attention kernel failed due to: No available kernel. Aborting execution.

The error indicates that the installed flash-attn library doesn’t build with CUDA support.
Please reinstall it with the package shared in the link below and try it again.

Thanks.

Hi!

What is interesting is that I installed all of the relevant wheels I could from that index (thank you for sending!) But, the warnings are still there, unchanged. I did this all in a brand new virtual environment, on a factory reset jetson. I also installed torch and torchvision from that index, so these should all be compatible with each other.

however, I am still getting all of the same userwarnings. Just because I installed them all together in the same uv venv, does it mean that torch can access those kernels? Do i need to import those kernels and explicity use them in my code, or will pytorch automatically use those kernels when it runs its internal code, like scaled_dot_product_attention? This is the function that everything fails on, even after I pip installed each wheel.

Hi,

Could you try our PyTorch container instead?

We fail to test it since we don’t have the required edgetam.pt and edgetam.yaml files.
However, we can load all the dependencies correctly inside the container after installing the sam2 with pip.

Thanks.

Hi! I’m so sorry, here is the link to EdgeRT. This is the link to the checkpoints and a link to the yaml. I will try running Pytorch in the container - it would also be great for you to try EdgeTAM as is - it is only 2 lines different from using sam2 on its own + would be really helpful if you could reproduce my results.

Hi! I ran this in the pytorch container and I got the same performance, maybe a couple of seconds better. But, this model is supposed to run at 10-20ms/image and it is running (at best at 11s) seems like its still not able to access the dependencies.

How did you load the dependencies properly? Did you see the same warnings when you tried running EdgeTAM?

/workspace/EdgeTAM/sam2/modeling/sam/transformer.py:274: UserWarning: Memory efficient kernel not used because: (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:906.)
out = F.scaled_dot_product_attention(q, k, v, dropout_p=dropout_p)
/workspace/EdgeTAM/sam2/modeling/sam/transformer.py:274: UserWarning: Memory Efficient attention has been runtime disabled. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/native/transformers/sdp_utils_cpp.h:552.)
out = F.scaled_dot_product_attention(q, k, v, dropout_p=dropout_p)
/workspace/EdgeTAM/sam2/modeling/sam/transformer.py:274: UserWarning: Flash attention kernel not used because: (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:908.)
out = F.scaled_dot_product_attention(q, k, v, dropout_p=dropout_p)
/workspace/EdgeTAM/sam2/modeling/sam/transformer.py:274: UserWarning: Expected query, key and value to all be of dtype: {Half, BFloat16}. Got Query dtype: float, Key dtype: float, and Value dtype: float instead. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/native/transformers/sdp_utils_cpp.h:91.)
out = F.scaled_dot_product_attention(q, k, v, dropout_p=dropout_p)
/workspace/EdgeTAM/sam2/modeling/sam/transformer.py:274: UserWarning: cuDNN attention kernel not used because: (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:910.)
out = F.scaled_dot_product_attention(q, k, v, dropout_p=dropout_p)
/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py:1786: UserWarning: Flash Attention kernel failed due to: No available kernel. Aborting execution.
Falling back to all available kernels for scaled_dot_product_attention (which may have a slower speed).
return forward_call(*args, **kwargs)

Hi,

We will test it and update more info to you.
Have you maximized the device performance first?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

More, could you run the tegrastats on another console to check the GPU utilization?

$ sudo tegrastats

Thanks.

I have maximized device performance first.

Running jtop shows near 100% utilization during inference. Yet, the kernel warning still hasn’t gone away. I wonder if having the right kernels will mean the same GPU utilization but somehow faster inference. Intuitively that doesnt make sense, but its hard to believe the 13s/image is the max performance.

Hi,

We try to run it with our latest PyTorch container but meet some compatibility issues:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/utils.py", line 644, in _locate
    obj = getattr(obj, part)
          ^^^^^^^^^^^^^^^^^^
AttributeError: module 'sam2.modeling.backbones' has no attribute 'timm'

Which sam2 version do you use?
More, have you tried the sample in the container shared above?

Thanks.

Hi! As i mentioned in my previous response, I installed the pytorch container you linked (25.10-py3-igpu), then reinstalled edgeTAM and ran my example script. That is where i got the set of ‘missing kernels‘ errors. You dont need to install SAM on its own, as EdgeTAM has SAM2 inside of it. You can just follow the instructions on the github repository: GitHub - facebookresearch/EdgeTAM: [CVPR 2025] Official PyTorch implementation of "EdgeTAM: On-Device Track Anything Model" .

Here is my directory setup:

there is an outer directory called edgeTAM which I created. I cd into this, clone the git repo (EdgeTAM), and ```pip install -e .```. Back in the outer directory, I create the script which we are testing with: ex.py. This has a line at the top to make sure we can import things from the cloned directory.

sys.path.insert(0, os.path.join(os.path.dirname(__file__), 'EdgeTAM'))

I’ve attached my requirements.txt. It has more packages installed than what you would strictly need for EdgeTAM, but this setup does work. Make sure you install hydra-core and timm separately if you run into dependency issues. Here are some other details:

I linked my ex.py (but saved as a .txt) if you want to recreate my warnings. You would want to save this in the outer directory (edgeTAM) and run it.

requirements.txt (5.4 KB)

ex.txt (822 Bytes)

Hi,

Sorry for the late update.
If the GPU is already at high utilization, the inference should already be deployed on the GPU.

But, this model is supposed to run at 10-20ms/image and it is running (at best at 11s) seems like its still not able to access the dependencies.

Is this tested on the same Orin Nano device?
If not, the performance might not be identical, since Orin Nano is limited by the 8GiB memory.

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.