CUDA error: device-side assert triggered

I am using Qwen2 7b and I loaded it like this

model, tokenizer = FastLanguageModel.from_pretrained(

model_name=MODEL_DIR,

max_seq_length=MAX_SEQ_LEN,  

dtype=torch.float16,

load_in_4bit=True,

device_map='cuda:0'

)

I run an usual generation pipeline through this script

inputs = tokenizer(prompt, return_tensors=“pt”, padding=False, truncation=False)

input_ids = inputs[“input_ids”].to(‘cuda:0’)

attention_mask = inputs[“attention_mask”].to(‘cuda:0’)

prompt_len = input_ids.shape[1]

with torch.no_grad():

outputs = model.generate(

    input_ids=input_ids,

    attention_mask=attention_mask,

    temperature=0.7,

    top_p=0.9,

    do_sample=True,

    use_cache=True,

    return_dict_in_generate=True 

)    

outputs.sequences

when I run it for first time it works well and all is good but if I want to generate another response and I try to run this script again it rises an error

AcceleratorError Traceback (most recent call last)
Cell In[24], line 2
1 inputs = tokenizer(prompt, return_tensors=“pt”, padding=False, truncation=False)
----> 2 input_ids = inputs[“input_ids”].to(‘cuda:0’)
3 attention_mask = inputs[“attention_mask”].to(‘cuda:0’)
4 prompt_len = input_ids.shape[1]

AcceleratorError: CUDA error: device-side assert triggered
Search for `cudaErrorAssert’ in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

I follow the link mentionned above looking for cudaErrorAssert and I found

cudaErrorAssert = 710

An assert triggered in device code during kernel execution. The device cannot be used again. All existing allocations are invalid. To continue using CUDA, the process must be terminated and relaunched.

I am using Tesla V100-PCIE-32GB GPU with

| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------±---------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |

I explored the internet looking for a solution to fix this problem. I found these configurations

os.environ[‘TORCH_USE_CUDA_DSA’] = “1”

os.environ[“CUDA_LAUNCH_BLOCKING”] = “1”

But it didn’t fix the issue in my case. I am using two gpus

os.environ[“CUDA_VISIBLE_DEVICES”] = “3,4”

How can I fix this error?

Hi — a couple of suggestions that might help, not sure if any will land but worth a shot.

The “works the first time, fails after” pattern might be worth looking at. Once a device-side assert fires on the GPU, the CUDA context usually stays in a bad state for the rest of the Python process, so any error after the first one can be misleading. Restarting Python between test runs (not just re-running the cell) sometimes makes CUDA_LAUNCH_BLOCKING=1 actually useful, since the first traceback is the real one.

One thing that comes up with V100 + 4-bit loading: bitsandbytes’ 4-bit kernels were mainly built around Turing (sm_75) and newer cards. V100 is Volta (sm_70) and sometimes has partial support that works for a forward pass and then trips on later ones. Might not be your issue, but easy to rule out.

A couple of things that are quick to check:

# GPU and compute capability
nvidia-smi --query-gpu=name,compute_cap --format=csv

# Installed versions
python -c "import torch, bitsandbytes, transformers; print(torch.__version__, torch.version.cuda, bitsandbytes.__version__, transformers.__version__)"

# Tokenizer vs model vocab size — mismatch here can also cause device-side asserts
python -c "from transformers import AutoTokenizer, AutoConfig; t=AutoTokenizer.from_pretrained('YOUR_MODEL_DIR'); c=AutoConfig.from_pretrained('YOUR_MODEL_DIR'); print(len(t), c.vocab_size)"

If the 4-bit path is the problem, two fallbacks that often work on V100:

  • Swap load_in_4bit=True for load_in_8bit=True (8-bit has better Volta support historically).
  • If you have the VRAM for it, skip quantization and load in torch_dtype=torch.float16. A 32GB V100 can usually fit Qwen2-7B in fp16 with some room for context.

Couple of links that might be useful:

Just passing by and figured I’d throw some ideas out in case any of it helps — could be totally off-base since I’m not on your setup. Not sure I’ll be able to recheck or do much follow-up, but good luck either way.

After investigating this issue, I found that the device-side assert error was not actually caused by the cell where it appeared, but by a previous computation step.

The root cause was the presence of NaN/Inf values in tensors, which originated earlier in the training process. Specifically, during loss computation, exploding gradients caused some values to grow beyond the representable range of float16. Since float16 has a limited dynamic range, these large values were converted into NaN or Inf.

These invalid values were then stored in tensors and silently propagated through subsequent operations. Eventually, they triggered a CUDA device-side assert, but at that point the error location was misleading.

You said:

Once a device-side assert fires on the GPU, the CUDA context usually stays in a bad state for the rest of the Python process, so any error after the first one can be misleading.

This explains why the error seemed to originate from the current cell, while the actual issue was introduced earlier.