Segmentation fault when calling .backward() after moving data to GPU (PyTorch + CUDA 12.1)

Hi everyone,
I’m running into a segmentation fault (core dumped) error while training a model using PyTorch on a CUDA-enabled GPU.
I’m not sure what’s going wrong, and would really appreciate any guidance.

My Environment
GPU: 2× NVIDIA GeForce RTX 4060 Ti
Driver Version: 550.120
CUDA Version (Driver-side): 12.4
cuDNN Version: 8902
PyTorch Version: 2.2.0+cu121
Python: 3.10.12
CUDA available: True
Detected CUDA from PyTorch: 12.1
Host OS: Ubuntu 24.04
Docker Image: nvidia/cuda:12.4.1-runtime-ubuntu22.04
Kernel: Linux 6.8.0-55-generic x86_64 with glibc 2.35
Running inside Docker: Yes

The Problem
During training, the script suddenly crashes with a segmentation fault.
The crash does not happen at a specific line every time — sometimes it happens in .backward(), sometimes while creating a tensor on GPU using .to(device).
It usually occurs after a few training batches, not at the very beginning.

Here’s a simplified version of the code:

def train_test_ht_sl(model, train_data, test_data, head_list, tail_list):
    import datetime
    model.scheduler.step()
    print('start training: ', datetime.datetime.now())
    model.train()
    total_loss = 0.0
    slices = train_data.generate_batch(model.batch_size)

    for i, j in zip(slices, np.arange(len(slices))):
        model.optimizer.zero_grad()
        targets, scores = forward(model, i, train_data)

        # targets = torch.from_numpy(np.array(targets)).long().to('cuda:1')
        targets = torch.tensor(targets).long().to(device)
        loss = model.loss_function(scores, targets - 1)
        loss.backward()  # <- crash sometimes happens here
def forward(model, i, data):
    alias_inputs, A, items, mask, targets = data.get_slice(i)

    alias_inputs = torch.tensor(alias_inputs, dtype=torch.long, device=device)
    items = torch.tensor(items, dtype=torch.long, device=device)
    mask = torch.tensor(mask, dtype=torch.long, device=device)

    A_np = np.stack(A)
    A = torch.tensor(A_np, dtype=torch.float, device=device)  # <- or here

    hidden = model(items, A)

    get = lambda i: hidden[i][alias_inputs[i]]
    seq_hidden = torch.stack([get(i) for i in torch.arange(len(alias_inputs)).long()])

    return targets, model.compute_scores(seq_hidden, mask)

Error Excerpt

Training Progress:  20%|██        | 6/30 [16:14<1:04:58, 162.43s/it]
Fatal Python error: Segmentation fault

Current thread 0x00007... (most recent call first):
  <no Python frame>

Thread 0x00007...:
  File "/usr/lib/python3.10/threading.py", line 324 in wait
  ...
  File "/usr/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266 in backward

** My Question**
I’m still new to CUDA programming and PyTorch internals, so I’m not sure:
Why might this segmentation fault occur?
Am I doing something wrong when moving data to the GPU?
Is there a safer or more proper way to handle tensors before calling .backward()?
Any help or explanation would be really appreciated. Thank you in advance!

Hi @sytak ,
This forum talks about issues related to TRT,
However Possible Causes of Segmentation Faults:

  1. Memory Issues:
  • Accessing unallocated memory, writing to read-only memory, or buffer overflows can lead to segmentation faults.
  1. Incompatible Versions:
  • Using incompatible versions of CUDA and PyTorch may result in errors; for instance, your PyTorch version (2.2.0+cu121) might not be optimally aligned with your CUDA version (12.1 detected).
  1. Faulty GPU Hardware:
  • Sometimes, the GPU hardware can be the cause of segmentation faults, especially if it’s malfunctioning.

Possible Solutions:

  1. Review Memory Management:
  • Check your code for any potential memory-related errors (e.g., ensuring tensors are correctly allocated and deallocated).
  1. Update Dependencies:
  • Make sure you are using compatible and up-to-date versions of both CUDA and PyTorch. Consider updating both to their latest releases.
  1. Test on Different Hardware:
  • Run the code on a different CUDA-enabled GPU to determine if the fault is specific to your current hardware.

Additional Tips for Handling CUDA and PyTorch

  • Tensor Management: Ensure that you are managing tensors safely before calling .backward(). This includes verifying that they have the correct shape and allocation on the GPU.
  • Error Handling during Tensor Operations: Implement error handling mechanisms during tensor operations, especially when moving data to and from the GPU.
  • Memory Allocation Check: Monitor the memory usage during training to ensure you are not exceeding the GPU’s memory capacity, which could also lead to segmentation faults.
    In case if issue is still there, please reach out to Pytorch Forums.

Thanks

Hi, thank you for the detailed response.

I’ve carefully reviewed all the suggested causes and solutions:

  1. Memory Issues:
    I’ve checked my code thoroughly for memory management problems. There was at most 10~~20 percent of GPU memory utilization as I monitored it. Plus, the temperature GPU and CPU are somewhat stable to (50 ~ 60 C)

  2. Incompatible Versions:
    I’m aware that I’m using PyTorch 2.2.0+cu121 and CUDA 12.1. I’ve verified compatibility, and both PyTorch and CUDA are up-to-date. I also tested with other version combinations, but the issue persists.

  3. Faulty GPU Hardware:
    I may have to try this, but I want to find the root cause beforehand. It is just my reasoning, but the error happens after nvidia 550 driver were automatically updated on 16th Jan 2025. (I tried other drivers 535 570 open, server, but the error persists)

I’ve also followed the additional tips you mentioned:

  • Ensured all tensors are valid before calling .backward(), including checking their shapes and device allocation.

  • Wrapped tensor operations with error handling blocks.
    → I handled the exception for the shape mismatch, but the error occurred before it caught the exception

  • Monitored memory usage during training to make sure it stays within GPU capacity limits.
    → As I check there was only 1- ~ 2- percent of GPU memory

Is it possible for your end to reproduce this error? It would be really appreciated.
I also reached to the pytorch team, but they haven’t answered yet. so any advice is appreciated for now…

Thanks again for your support.

import torch
import torch.nn as nn
import numpy as np
import math
import time
import traceback
from tqdm import tqdm

#device = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")
device = torch.device("cuda:0")
print(f"[Info] Using device: {device}", flush=True)

# Dummy Config class
class DummyOpt:
    def __init__(self):
        self.hiddenSize = 100
        self.step = 1
        self.batchSize = 100
        self.nonhybrid = False
        self.lr = 0.001
        self.l2 = 1e-5
        self.lr_dc_step = 3
        self.lr_dc = 0.1

# GNN model definition (can be skipped if imported above)
class GNN(nn.Module):
    def __init__(self, hidden_size, step=1):
        super(GNN, self).__init__()
        self.step = step
        self.hidden_size = hidden_size
        self.edge_proj = nn.Linear(hidden_size, hidden_size)
        self.update = nn.Linear(hidden_size * 2, hidden_size)

    def GNNCell(self, A, hidden):
        N = A.shape[1]
        A_in = A[:, :, :N]   # [B, N, N]
        A_out = A[:, :, N:]  # [B, N, N]

        edge_in = torch.matmul(A_in, self.edge_proj(hidden))   # [B, N, H]
        edge_out = torch.matmul(A_out, self.edge_proj(hidden)) # [B, N, H]

        edge_msg = (edge_in + edge_out) / 2

        combined = torch.cat([hidden, edge_msg], dim=-1)  # [B, N, 2H]
        out = self.update(combined)
        return torch.relu(out)

    def forward(self, A, hidden):
        for _ in range(self.step):
            hidden = self.GNNCell(A, hidden)
        return hidden

class SessionGraph(nn.Module):
    def __init__(self, opt, n_node):
        super(SessionGraph, self).__init__()
        self.hidden_size = opt.hiddenSize
        self.batch_size = opt.batchSize
        self.nonhybrid = opt.nonhybrid
        self.embedding = nn.Embedding(n_node, self.hidden_size)
        self.gnn = GNN(self.hidden_size, step=opt.step)
        self.linear_one = nn.Linear(self.hidden_size, self.hidden_size)
        self.linear_two = nn.Linear(self.hidden_size, self.hidden_size)
        self.linear_three = nn.Linear(self.hidden_size, 1)
        self.linear_transform = nn.Linear(self.hidden_size * 2, self.hidden_size)

    def compute_scores(self, hidden, mask):
        ht = hidden[torch.arange(mask.shape[0]), torch.sum(mask, 1) - 1]
        q1 = self.linear_one(ht).unsqueeze(1)
        q2 = self.linear_two(hidden)
        alpha = self.linear_three(torch.sigmoid(q1 + q2))
        a = torch.sum(alpha * hidden * mask.unsqueeze(-1).float(), dim=1)
        if not self.nonhybrid:
            a = self.linear_transform(torch.cat([a, ht], dim=1))
        b = self.embedding.weight[1:]  # exclude padding idx
        scores = torch.matmul(a, b.transpose(1, 0))
        return scores

    def forward(self, inputs, A):
        hidden = self.embedding(inputs)
        hidden = self.gnn(A, hidden)
        return hidden

# Generate dummy input
def generate_dummy_data(batch_size, seq_len, n_node):
    alias_inputs = np.tile(np.arange(seq_len), (batch_size, 1))  # (batch, seq)
    #A = np.random.rand(batch_size, seq_len, seq_len * 2)  # shape: 100 * 10 * 20 (20000)
    A = np.random.rand(batch_size, seq_len, seq_len * 2).astype(np.float32)
    items = np.random.randint(1, n_node, size=(batch_size, seq_len))
    mask = (items != 0).astype(int)
    targets = np.random.randint(1, n_node, size=(batch_size,))
    return alias_inputs, A, items, mask, targets

# Main loop
def run_dummy_loop():
    opt = DummyOpt()
    n_node = 1000
    model = SessionGraph(opt, n_node).to(device)
    model.train()

    num_iterations = 5000
    seq_len = 10

    for i in range(num_iterations):
        try:
            ##########################################
            # Create variables on CPU
            alias_inputs, A, items, mask, targets = generate_dummy_data(opt.batchSize, seq_len, n_node)

            ##########################################
            # Move variables from CPU to GPU
            alias_inputs = torch.tensor(alias_inputs, dtype=torch.long, device=device)
            #A = torch.tensor(A, dtype=torch.float32, device=device)
            A = torch.tensor(A, device=device)  # 이제 dtype 생략 가능
            items = torch.tensor(items, dtype=torch.long, device=device)
            mask = torch.tensor(mask, dtype=torch.long, device=device)
            targets = torch.tensor(targets, dtype=torch.long, device=device)

            # Check for NaNs or Infs
            assert not torch.isnan(targets).any(), "Targets contain NaNs"
            assert not torch.isinf(targets).any(), "Targets contain Infs"

            ##########################################
            # GPU-side computation (GNN message passing)
            hidden = model(items, A)

            # Reorder sequence using loop + indexing
            #seq_hidden = torch.stack([hidden[i][alias_inputs[i]] for i in range(alias_inputs.shape[0])])

            # Reorder sequence using gather -> No error initially, but got illegal instruction at loop 97
            alias_idx = alias_inputs.unsqueeze(-1).expand(-1, -1, hidden.size(2))  # (batch, seq_len, hidden_size)
            seq_hidden = torch.gather(hidden, dim=1, index=alias_idx)

            ##########################################
            # GPU-side computation (GNN message passing)
            scores = model.compute_scores(seq_hidden, mask)

            torch.cuda.synchronize()

            ##########################################
            # GPU-side computation (prediction and loss update)
            loss_fn = nn.CrossEntropyLoss()
            loss = loss_fn(scores, targets - 1)

            assert not torch.isnan(loss).any(), "Loss contains NaNs"
            assert not torch.isnan(loss), "Loss is NaN"

            if scores.shape[0] != targets.shape[0]:
                print(f"[{i}] scores: {scores.shape}, targets: {targets.shape}", flush=True)
                
            if scores.device != targets.device:
                print(f"[{i}] score device: {scores.device}", flush=True)
                print(f"[{i}] target device: {targets.device}", flush=True)
            #assert scores.shape[0] == targets.shape[0], "Mismatch between scores and targets"
            #assert scores.device == targets.device, "scores and targets must be on the same device"
            

            loss.backward()

            ##########################################
            # Move prediction values back to CPU

            # Test NumPy conversion
            result_np = loss.detach().cpu().numpy()  # changed from loss.item() but still causes error
            _ = np.log(np.clip(result_np, 1e-8, None))

            #loss_value = max(loss.item(), 1e-8)
            #_ = math.log(loss_value)

            #loss_value = torch.clamp(loss, min=1e-8)
            #_ = torch.log(loss_value)

            torch.cuda.synchronize()

            if i % 1000 == 0:
                print(f"[{i}] Loss: {result_np:.6f}", flush=True)

            #if i % 1000 == 0:
            #    print(f"[{i}] loss: {loss.item():.6f}, hidden max: {hidden.max().item():.4f}, has NaN: {torch.isnan(hidden).any()}", flush=True)

            #print(f"[{i}] scores shape: {scores.shape}, targets shape: {targets.shape}", flush=True)
            #print(f"[{i}] loss dtype: {loss.dtype}, device: {loss.device}, value: {loss.item():.6f}", flush=True)

            if i % 1000 == 0:
                loss_value = loss.detach()  # no item()
                hidden_max = hidden.max().detach()
                has_nan = torch.isnan(hidden).any().detach()

                print(f"[{i}] loss: {loss_value:.6f}, hidden max: {hidden_max:.4f}, has NaN: {bool(has_nan)}", flush=True)

        except Exception as e:
            print(f"\n🔥 Exception at iteration {i}: {traceback.format_exc()}", flush=True)
            break

for i in tqdm(range(0, 2000), desc='progress'):
    print(f'loop {i}th')
    run_dummy_loop()