RTX Pro 6000 Backwell Card Crash

Hi there folks. I encountered an issue with an RTX Pro 6000 Backwell Max-Q Workstation Edition card.

My server has been running for months without issue. All of a sudden, one of my gpus stopped working and got stuck in an unrecoverable state (Power Off Hard Reset Required). Resetting results in the GPU becoming visible again, but it immediately crashes upon a workload.

I have a second identical GPU to the crashing one but the second GPU doesn’t crash.

I reached out to the support team and they directed me to create a thread here for help before they can process an RMA. (Case Reference Number: 260413-000085)

System

Hardware:

  • CPU: AMD EPYC 9354P
  • MOBO: Asrock GENOAD8X-2T/BCM
  • GPUs:
    • GPU 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition (UUID: GPU-ce3431d8-112d-6587-ea45-a298b6737575)
    • GPU 1: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition (UUID: GPU-ea39a15d-637f-b457-5f9e-6c8946ec2ef3)

Software:

  • OS: Debian 13.4
  • DOCKER: 29.4.0
  • NVIDIA_DRIVER: nvidia-open 595.45.04 (Also tested on 580.95.05)
  • NVIDIA_CUDA: 13.2.0 (Also tested on 13.0.2)
  • NVIDIA_CONTAINER_TOOLKIT: 1.19.0-1 (Also tested on 1.17.8-1)

I use these official instructions to install and update my driver & cuda:

I use these official instructions to install and update my nvidia container toolkit:

Issue:

On April 12, 2026, I began encountering an issue when using VLLM to run AI models.
The AI model would load up into memory, and then VLLM would crash.
I captured a dmesg output which showed a GSP-CrashCat Report.
I found that 1 of my 2 GPUs was unavailable and needed a full hard reset to get the crashed gpu back.
I powered off the system, disconnected power, pressed the dead power button, waited 30 seconds, reconnected power, and then started the server back up.
I tried to launch my normal vllm workload again and received the same error.

VLLM isn’t exactly a great way to verify if a GPU works, so I needed some simple and quick way to reproduce the issue.

While looking at other forum posts here, I saw some posts that referenced this gpu burn tool as a trivial way to add load to a gpu.

git clone https://github.com/wilicc/gpu-burn
cd gpu-burn
docker build -t gpu_burn .
docker run --rm --gpus all gpu_burn

By running this workload, I can trigger my GPU to crash like it did with VLLM.

docker run --rm --gpus 'device=GPU-ea39a15d-637f-b457-5f9e-6c8946ec2ef3' gpu_burn # GPU Crashes
docker run --rm --gpus 'device=GPU-ce3431d8-112d-6587-ea45-a298b6737575' gpu_burn # No Issue

Working GPU:

root@asrock-02:~/gpu-burn# docker run --rm --gpus 'device=GPU-ce3431d8-112d-6587-ea45-a298b6737575' gpu_burn

==========
== CUDA ==
==========

CUDA Version 11.8.0

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

GPU 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition (UUID: GPU-ce3431d8-112d-6587-ea45-a298b6737575)
cuInit returned 0 (no error)
Using compare file: compare.ptx
Burning for 60 seconds.
38.3%  proc'd: 337 (15972 Gflop/s)   errors: 0   temps: 63 C 
        Summary at:   Mon Apr 13 20:17:01 UTC 2026

50.0%  proc'd: 337 (15972 Gflop/s)   errors: 0   temps: 66 C 
        Summary at:   Mon Apr 13 20:17:08 UTC 2026

66.7%  proc'd: 337 (15972 Gflop/s)   errors: 0   temps: 71 C 
        Summary at:   Mon Apr 13 20:17:18 UTC 2026

83.3%  proc'd: 674 (16239 Gflop/s)   errors: 0   temps: 74 C 
        Summary at:   Mon Apr 13 20:17:28 UTC 2026

100.0%  proc'd: 674 (16239 Gflop/s)   errors: 0   temps: 77 C 
        Summary at:   Mon Apr 13 20:17:38 UTC 2026

100.0%  proc'd: 674 (16239 Gflop/s)   errors: 0   temps: 78 C 
Killing processes with SIGTERM (soft kill)
Using compare file: compare.ptx
Burning for 60 seconds.
Initialized device 0 with 97249 MB of memory (96430 MB available, using 86787 MB of it), using FLOATS
Results are 268435456 bytes each, thus performing 337 iterations
Freed memory for dev 0
Uninitted cublas
done

Tested 1 GPUs:
        GPU 0: OK

Failing GPU:

root@asrock-02:~/gpu-burn# docker run --rm --gpus 'device=GPU-ea39a15d-637f-b457-5f9e-6c8946ec2ef3' gpu_burn

==========
== CUDA ==
==========

CUDA Version 11.8.0

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

GPU 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition (UUID: GPU-ea39a15d-637f-b457-5f9e-6c8946ec2ef3)
cuInit returned 0 (no error)
Using compare file: compare.ptx
Burning for 60 seconds.
40.0%  proc'd: 337 (15726 Gflop/s)   errors: 839470969  (WARNING!)  temps: 59 C 
        Summary at:   Mon Apr 13 20:19:04 UTC 2026

58.3%  proc'd: 337 (15726 Gflop/s)   errors: 0   temps: 64 C 
        Summary at:   Mon Apr 13 20:19:15 UTC 2026

58.3%  proc'd: 337 (15726 Gflop/s)   errors: 0   temps: 67 C Failure during compute: Error in SGEMM (gpu_burn-drv.cpp:233): 


No clients are alive!  Aborting
61.7%  proc'd: -1 (15726 Gflop/s)   errors: -1  (DIED!)  temps: 67 C Using compare file: compare.ptx
Burning for 60 seconds.
Initialized device 0 with 97249 MB of memory (96430 MB available, using 86787 MB of it), using FLOATS
Results are 268435456 bytes each, thus performing 337 iterations

With a way to reproduce the issue, I was able to adjust variables and start narrowing down the culprits.

Troubleshooting

Software Update

First, I tried updating my nvidia driver, cuda, and container toolkit.
The issue still occurred, affecting the same single gpu.

The issue started with this software configuration:

  • OS: Debian 13.4
  • DOCKER: 29.4.0
  • NVIDIA_DRIVER: nvidia-open 580.95.05
  • NVIDIA_CUDA: 13.0.2
  • NVIDIA_CONTAINER_TOOLKIT: 1.17.8-1

The issue continued after updating the nvidia driver, cuda, and ctk.

  • OS: Debian 13.4
  • DOCKER: 29.4.0
  • NVIDIA_DRIVER: nvidia-open 595.45.04 (Also tested on 580.95.05)
  • NVIDIA_CUDA: 13.2.0 (Also tested on 13.0.2)
  • NVIDIA_CONTAINER_TOOLKIT: 1.19.0-1 (Also tested on 1.17.8-1)

Physical Connections

Next, I tried:

  • reseating the GPU in the motherboard slot
  • moving the GPU to another PCIE slot
  • swapping the pcie slot and power of the 2 gpus

The issue still persisted, each time affecting the same specific gpu.

Logs

I have attached 2 nvidia-bug-report.sh logs, one generated before the crash, and one generated after.

You can see the crash in the journalctl -b -0 portion of the nvidia-bug-report-after-crash.log.gz.

Apr 13 13:15:03 asrock-02 systemd[1]: nvidia-cdi-refresh.service: Deactivated successfully.
Apr 13 13:15:03 asrock-02 systemd[1]: Finished nvidia-cdi-refresh.service - Refresh NVIDIA CDI specification file.
Apr 13 13:19:17 asrock-02 kernel: NVRM: GPU1 kgspHealthCheck_TU102: ****************************** GSP-CrashCat Report *******************************
Apr 13 13:19:17 asrock-02 kernel: NVRM: GPU1 kgspPrintGspBinBuildId_IMPL: GSP bin buildId: ce2068d24a08608352bb5893f6c46395dcd3146d
Apr 13 13:19:17 asrock-02 kernel: NVRM: GPU at PCI:0000:41:00: GPU-ea39a15d-637f-b457-5f9e-6c8946ec2ef3
Apr 13 13:19:17 asrock-02 kernel: NVRM: GPU Board Serial Number: 1792425054377
Apr 13 13:19:17 asrock-02 kernel: NVRM: Xid (PCI:0000:41:00): 120, GSP task exception: load access page fault (cause:0xd) @ pc:0x15275c0, partition:4#0, task:3
...

Conclusion

I believe the GPU is faulty because:

  • The issue occurs only on 1 of the 2 identical gpus.
  • Swapping the 2 GPU PCIE & Power does not change the GPU that crashes.
  • The failure occurred suddenly on a system that was stable for months prior.

Attached here are the logs.

nvidia-bug-report-before-crash.log.gz (1.3 MB)
nvidia-bug-report-after-crash.log.gz (1.3 MB)

IMO quite obviously a faulty GPU, just as you said.
The fact that the official NV support requires you to post it all here additionally, is a joke itself…

I tried with only the bad GPU connected to the system and the crash still occurred. Here are the logs from that event.

nvidia-bug-report-single-after-crash.log.gz (2.4 MB)
nvidia-bug-report-single-before-crash.log.gz (2.4 MB)

The support team is continuing through processing my claim and one request that was made was to test the GPU in a separate system which I have done.

The GPU continues to crash in this separate system. Posting logs here.

nvidia-bug-report-alternate-system-after-crash.log.gz (309.7 KB)
nvidia-bug-report-alternate-system-before-crash.log.gz (326.3 KB)

I had the exact same issue with RTX 5000 PRO Blackwell and I figured out that the issue is the new kernel version installed by Ubuntu unattended upgrades I switched back from “6.8.0-111.111” version to “6.8.0-110-generic” and it works perfectly now under heavy load