RuntimeError: CUDA error: device kernel image is invalid

sofiastoica001 · July 19, 2025, 9:06am

I am trying to get a repository working (PAPO) but I keep running into

RuntimeError: CUDA error: device kernel image is invalid.

I have looked at the other threads, but the solutions did not help.

My torch is the correct version: 2.6.0+cu124. I used pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124 to install it. In addition, I have tested that it works properly.

my cuda toolkit and runtime are the same:
nvcc --version, I get:
nvcc: NVIDIA (R) Cuda compiler driver

nvcc warning : incompatible redefinition for option 'compiler-bindir', the last value of this option was used
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

and when I do conda list | grep -E "cudatoolkit|cuda-toolkit", I get:
cuda-toolkit 12.4.1 hb982923_0

I can not run sudo because I am on a campus cluster and do not have sudo permission.

So, how can I fix my issue? Forcing conda to use cuda-toolkit=12.4 and cuda-runtime=12.4 when creating the conda enviroment (conda create -n env cuda-toolkit=12.4 cuda-runtime=12.4 python=3.10) did not work.

The issue is happening when I call trainer.init_workers() in

the script:

import json

import ray
from omegaconf import OmegaConf

from ..single_controller.ray import RayWorkerGroup
from ..utils.tokenizer import get_processor, get_tokenizer
from ..workers.fsdp_workers import FSDPWorker
from ..workers.reward import BatchFunctionRewardManager, SequentialFunctionRewardManager
from .config import PPOConfig
from .data_loader import create_dataloader
from .ray_trainer import RayPPOTrainer, ResourcePoolManager, Role


@ray.remote(num_cpus=1)
class Runner:
    """A runner for RL training."""

    def run(self, config: PPOConfig):
        # print config
        print(json.dumps(config.to_dict(), indent=2))

        # instantiate tokenizer
        tokenizer = get_tokenizer(
            config.worker.actor.model.model_path,
            override_chat_template=config.data.override_chat_template,
            trust_remote_code=config.worker.actor.model.trust_remote_code,
            use_fast=True,
        )
        processor = get_processor(
            config.worker.actor.model.model_path,
            override_chat_template=config.data.override_chat_template,
            trust_remote_code=config.worker.actor.model.trust_remote_code,
            use_fast=True,
        )

        ray_worker_group_cls = RayWorkerGroup
        role_worker_mapping = {
            Role.ActorRollout: ray.remote(FSDPWorker),
            Role.Critic: ray.remote(FSDPWorker),
            Role.RefPolicy: ray.remote(FSDPWorker),
        }
        global_pool_id = "global_pool"
        resource_pool_spec = {
            global_pool_id: [config.trainer.n_gpus_per_node] * config.trainer.nnodes,
        }
        mapping = {
            Role.ActorRollout: global_pool_id,
            Role.Critic: global_pool_id,
            Role.RefPolicy: global_pool_id,
        }
        resource_pool_manager = ResourcePoolManager(resource_pool_spec=resource_pool_spec, mapping=mapping)

        if config.worker.reward.reward_type == "sequential":
            RewardManager = SequentialFunctionRewardManager
        elif config.worker.reward.reward_type == "batch":
            RewardManager = BatchFunctionRewardManager
        else:
            raise NotImplementedError(f"Unknown reward type {config.worker.reward.reward_type}.")

        RemoteRewardManager = ray.remote(RewardManager).options(num_cpus=config.worker.reward.num_cpus)
        reward_fn = RemoteRewardManager.remote(config.worker.reward, tokenizer)
        val_reward_fn = RemoteRewardManager.remote(config.worker.reward, tokenizer)

        train_dataloader, val_dataloader = create_dataloader(config.data, tokenizer, processor)

        trainer = RayPPOTrainer(
            config=config,
            tokenizer=tokenizer,
            processor=processor,
            train_dataloader=train_dataloader,
            val_dataloader=val_dataloader,
            role_worker_mapping=role_worker_mapping,
            resource_pool_manager=resource_pool_manager,
            ray_worker_group_cls=ray_worker_group_cls,
            reward_fn=reward_fn,
            val_reward_fn=val_reward_fn,
        )
        trainer.init_workers()
        trainer.fit()

import os
def main():
    cli_args = OmegaConf.from_cli()
    default_config = OmegaConf.structured(PPOConfig())

    if hasattr(cli_args, "config"):
        config_path = cli_args.pop("config", None)
        file_config = OmegaConf.load(config_path)
        default_config = OmegaConf.merge(default_config, file_config)

    ppo_config = OmegaConf.merge(default_config, cli_args)
    ppo_config: PPOConfig = OmegaConf.to_object(ppo_config)
    ppo_config.deep_post_init()

    if not ray.is_initialized():
        runtime_env = {
            "env_vars": {
                "TOKENIZERS_PARALLELISM": "true",
                "NCCL_DEBUG": "WARN",
                "VLLM_LOGGING_LEVEL": "WARN",
                "TORCH_NCCL_AVOID_RECORD_STREAMS": "1",
                "PYTORCH_CUDA_ALLOC_CONF": "expandable_segments:False",
                "PYTHONUNBUFFERED": "1",
            }
        }
        ray.init(
            local_mode = os.getenv("RAY_LOCAL_MODE", "false").lower() == "true",
            runtime_env=runtime_env
        )

    runner = Runner.remote()
    ray.get(runner.run.remote(ppo_config))


if __name__ == "__main__":
    main()

sofiastoica001 · July 23, 2025, 4:23am

Fixed the nvcc warning : incompatible redefinition for option compiler-bindir, the last value of this option was used by doing conda install cuda -c nvidia/label/cuda-12.4.0

Turned out that the issue was incompatible vlm, pytorch, and flash-attn combination. After downloading the correct version of flash-attn (I was using 2.8.1 but it seems that I needed to use flash_attn-2.7.4.post1), I no longer got the error. I found the correct flash-attn by asking for the environment people who got the repo working had.

Hope this can help someone else out too.

system · August 6, 2025, 4:23am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
RuntimeError: CUDA error: no kernel image is available for execution on the device Linux cuda , ubuntu , pytorch	4	43590	September 6, 2021
RuntimeError: CUDA error: no kernel image is available for execution on the device Linux	2	1162	February 7, 2022
RuntimeError: cuda runtime error (8) CUDA Programming and Performance	1	2072	August 29, 2018
RuntimeError: CUDA error: no kernel image is available for execution on the device CUDA Setup and Installation cuda	2	1564	March 3, 2023
Pytorch can't run on 5060 cuDNN cuda , kernel	1	780	July 31, 2025
CUDA 10.1.243 + tensorflow-gpu 2.3.0rc0 (CUDA runtime error: device kernel image invalid)) CUDA Developer Tools	1	2432	August 11, 2020
Pytorch throws CUDA runtime error on WSL2 CUDA on Windows Subsystem for Linux pytorch	1	1975	January 4, 2023
CUDA error: the provided PTX was compiled with an unsupported toolchain CUDA Programming and Performance	6	13525	October 4, 2021
RuntimeError: CUDA error: no kernel image is available for execution on the device Linux cuda	2	2083	July 10, 2022
RuntimeError: CUDA error: no kernel image is available for execution on the device TensorRT cuda , cudnn	2	834	October 26, 2023

RuntimeError: CUDA error: device kernel image is invalid

Related topics