PyTorch + CUDA kernels not loading properly, not seeing expected speedups on EdgeTAM

adinesh · November 10, 2025, 6:49am

i am trying to build EdgeTAM from source (a model built on top of Facebook’s Segment Anything), and I’m getting a bunch of user warnings. I have CUDA enabled, here is my system information:

PyTorch version: 2.8.0
torchvision version: 0.23.0
CUDA available: True
CUDA version: 12.6
GPU: Orin
cuDNN version: 90300
BFloat16 supported: True
torch.backends.cuda.flash_sdp_enabled: True
torch.backends.cuda.mem_efficient_sdp_enabled: True
torch.backends.cuda.cudnn_sdp_enabled: True

Why are my kernels not loading properly? Does it have to do with my CUDA build? Also, when I convert the model to bfloat16, the userwarnings go away, but the speed of the model is still a magnitude of difference away from the expected speed (that others have replicated).

UserWarning: Memory efficient kernel not used because: (Triggered internally at /opt/pytorch/aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:863.)
out = F.scaled_dot_product_attention(q, k, v, dropout_p=dropout_p)

UserWarning: Memory Efficient attention has been runtime disabled. (Triggered internally at /opt/pytorch/aten/src/ATen/native/transformers/sdp_utils_cpp.h:552.)
out = F.scaled_dot_product_attention(q, k, v, dropout_p=dropout_p)

UserWarning: Flash attention kernel not used because: (Triggered internally at /opt/pytorch/aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:865.)
out = F.scaled_dot_product_attention(q, k, v, dropout_p=dropout_p)

UserWarning: Expected query, key and value to all be of dtype: {Half, BFloat16}. Got Query dtype: float, Key dtype: float, and Value dtype: float instead. (Triggered internally at /opt/pytorch/aten/src/ATen/native/transformers/sdp_utils_cpp.h:91.)
out = F.scaled_dot_product_attention(q, k, v, dropout_p=dropout_p)

UserWarning: CuDNN attention kernel not used because: (Triggered internally at /opt/pytorch/aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:867.)
out = F.scaled_dot_product_attention(q, k, v, dropout_p=dropout_p)

UserWarning: Flash Attention kernel failed due to: No available kernel. Aborting execution.
Falling back to all available kernels for scaled_dot_product_attention (which may have a slower speed).

for anyone who wants to see the code that I am running, here:

import torch

import os

import numpy as np

import matplotlib.pyplot as plt

from PIL import Image

from sam2.build_sam import build_sam2

from sam2.sam2_image_predictor import SAM2ImagePredictor

from sam2.automatic_mask_generator import SAM2AutomaticMaskGenerator

import time




device = torch.device("cuda")




image = Image.open('./left.png')

image = np.array(image.convert("RGB"))




checkpoint = "./checkpoints/edgetam.pt"

model_cfg = "edgetam.yaml"




model = build_sam2(model_cfg, checkpoint, device=device)




mask_generator = SAM2AutomaticMaskGenerator(model)

for i in range(10):

    now = time.time()

    masks = mask_generator.generate(image)

print(time.time() - now, "seconds")

Thank you so much!!

AastaLLL · November 10, 2025, 9:57am

Hi

UserWarning: Flash Attention kernel failed due to: No available kernel. Aborting execution.

The error indicates that the installed flash-attn library doesn’t build with CUDA support.
Please reinstall it with the package shared in the link below and try it again.

Thanks.

adinesh · November 11, 2025, 2:05am

Hi!

What is interesting is that I installed all of the relevant wheels I could from that index (thank you for sending!) But, the warnings are still there, unchanged. I did this all in a brand new virtual environment, on a factory reset jetson. I also installed torch and torchvision from that index, so these should all be compatible with each other.

however, I am still getting all of the same userwarnings. Just because I installed them all together in the same uv venv, does it mean that torch can access those kernels? Do i need to import those kernels and explicity use them in my code, or will pytorch automatically use those kernels when it runs its internal code, like scaled_dot_product_attention? This is the function that everything fails on, even after I pip installed each wheel.

AastaLLL · November 12, 2025, 8:14am

Hi,

Could you try our PyTorch container instead?

We fail to test it since we don’t have the required edgetam.pt and edgetam.yaml files.
However, we can load all the dependencies correctly inside the container after installing the sam2 with pip.

Thanks.

adinesh · November 13, 2025, 6:33am

Hi! I’m so sorry, here is the link to EdgeRT. This is the link to the checkpoints and a link to the yaml. I will try running Pytorch in the container - it would also be great for you to try EdgeTAM as is - it is only 2 lines different from using sam2 on its own + would be really helpful if you could reproduce my results.

adinesh · November 13, 2025, 7:04pm

Hi! I ran this in the pytorch container and I got the same performance, maybe a couple of seconds better. But, this model is supposed to run at 10-20ms/image and it is running (at best at 11s) seems like its still not able to access the dependencies.

How did you load the dependencies properly? Did you see the same warnings when you tried running EdgeTAM?

/workspace/EdgeTAM/sam2/modeling/sam/transformer.py:274: UserWarning: Memory efficient kernel not used because: (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:906.)
out = F.scaled_dot_product_attention(q, k, v, dropout_p=dropout_p)
/workspace/EdgeTAM/sam2/modeling/sam/transformer.py:274: UserWarning: Memory Efficient attention has been runtime disabled. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/native/transformers/sdp_utils_cpp.h:552.)
out = F.scaled_dot_product_attention(q, k, v, dropout_p=dropout_p)
/workspace/EdgeTAM/sam2/modeling/sam/transformer.py:274: UserWarning: Flash attention kernel not used because: (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:908.)
out = F.scaled_dot_product_attention(q, k, v, dropout_p=dropout_p)
/workspace/EdgeTAM/sam2/modeling/sam/transformer.py:274: UserWarning: Expected query, key and value to all be of dtype: {Half, BFloat16}. Got Query dtype: float, Key dtype: float, and Value dtype: float instead. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/native/transformers/sdp_utils_cpp.h:91.)
out = F.scaled_dot_product_attention(q, k, v, dropout_p=dropout_p)
/workspace/EdgeTAM/sam2/modeling/sam/transformer.py:274: UserWarning: cuDNN attention kernel not used because: (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:910.)
out = F.scaled_dot_product_attention(q, k, v, dropout_p=dropout_p)
/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py:1786: UserWarning: Flash Attention kernel failed due to: No available kernel. Aborting execution.
Falling back to all available kernels for scaled_dot_product_attention (which may have a slower speed).
return forward_call(*args, **kwargs)

AastaLLL · November 17, 2025, 7:36am

Hi,

We will test it and update more info to you.
Have you maximized the device performance first?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

More, could you run the tegrastats on another console to check the GPU utilization?

$ sudo tegrastats

Thanks.

adinesh · November 18, 2025, 2:32am

I have maximized device performance first.

Running jtop shows near 100% utilization during inference. Yet, the kernel warning still hasn’t gone away. I wonder if having the right kernels will mean the same GPU utilization but somehow faster inference. Intuitively that doesnt make sense, but its hard to believe the 13s/image is the max performance.

AastaLLL · November 19, 2025, 6:43am

Hi,

We try to run it with our latest PyTorch container but meet some compatibility issues:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/utils.py", line 644, in _locate
    obj = getattr(obj, part)
          ^^^^^^^^^^^^^^^^^^
AttributeError: module 'sam2.modeling.backbones' has no attribute 'timm'

Which sam2 version do you use?
More, have you tried the sample in the container shared above?

Thanks.

adinesh · November 19, 2025, 6:24pm

Hi! As i mentioned in my previous response, I installed the pytorch container you linked (25.10-py3-igpu), then reinstalled edgeTAM and ran my example script. That is where i got the set of ‘missing kernels‘ errors. You dont need to install SAM on its own, as EdgeTAM has SAM2 inside of it. You can just follow the instructions on the github repository: GitHub - facebookresearch/EdgeTAM: [CVPR 2025] Official PyTorch implementation of "EdgeTAM: On-Device Track Anything Model" .

Here is my directory setup:

there is an outer directory called edgeTAM which I created. I cd into this, clone the git repo (EdgeTAM), and ```pip install -e .```. Back in the outer directory, I create the script which we are testing with: ex.py. This has a line at the top to make sure we can import things from the cloned directory.

sys.path.insert(0, os.path.join(os.path.dirname(__file__), 'EdgeTAM'))

I’ve attached my requirements.txt. It has more packages installed than what you would strictly need for EdgeTAM, but this setup does work. Make sure you install hydra-core and timm separately if you run into dependency issues. Here are some other details:

I linked my ex.py (but saved as a .txt) if you want to recreate my warnings. You would want to save this in the outer directory (edgeTAM) and run it.

requirements.txt (5.4 KB)

ex.txt (822 Bytes)

AastaLLL · December 4, 2025, 7:22am

Hi,

Sorry for the late update.
If the GPU is already at high utilization, the inference should already be deployed on the GPU.

But, this model is supposed to run at 10-20ms/image and it is running (at best at 11s) seems like its still not able to access the dependencies.

Is this tested on the same Orin Nano device?
If not, the performance might not be identical, since Orin Nano is limited by the 8GiB memory.

Thanks.

system · December 30, 2025, 8:14am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
AcceleratorError: CUDA error: no kernel image is available for execution on the device Jetson Orin Nano cuda , jetson-inference	2	41	February 9, 2026
CUDA error: no kernel image is available for execution on the device Jetson Orin Nano cuda , python	5	5349	June 17, 2024
PyTorch CUDA error Jetson Orin Nano cuda , pytorch	3	172	August 25, 2025
Suboptimal PyTorch Performance on Jetson Orin Nano Super Jetson Orin Nano pytorch , cudnn , cublas	2	61	February 5, 2026
How to install torch_tensorrt Jetson Orin Nano tensorrt	3	219	August 20, 2025
[Jetson Orin Nano] RuntimeError: FIND was unable to find an engine to execute this computation after trying 0 plans Jetson Orin Nano cuda , jetson	8	262	August 27, 2025
Jetson orin nano, CUDA kernel Jetson Orin Nano cuda	9	187	September 24, 2025
Unable to install SAM2 on Orin Nano Jetson Orin Nano cuda , generative_ai	22	2919	October 23, 2024
PyTorch CUDA Error on Jetson Orin Nano Super: "no kernel image is available for execution on the device" Jetson Orin Nano cuda , pytorch , cudnn	6	1862	February 20, 2025
No pytorch for jetson orin nano Jetpack 6.2.1 NGC Catalog Feedback pytorch , nano , jetson , jetson-orin	5	341	November 9, 2025

PyTorch + CUDA kernels not loading properly, not seeing expected speedups on EdgeTAM

Related topics