Linux kernel 5.10+ CUDA_ERROR_MISALIGNED_ADDRESS

NVIDIA-SMI 460.91.03
Driver Version 460.91.03
CUDA Version 11.2
Linux kernel 5.10 or 5.11 (tried both)
Ubuntu 20.01 LTS
Dell Precision 7550 Mobile Workstation
Quadro RTX 5000

I am trying to diagnose a (presumably) CUDA problem.

Case 1:
Fresh reboot. Run nothing but web browser and terminal. After some time nvidia-smi reports ERR on power consumption, temperature gets to ~60 C, eventually everything freezes.

Case 2:
Run Blender. Turn on Cuda rendering with cycles. After a while Blender reports a cuda error. Switch to cpu rendering and rendering continues on the cpu. Eventually everything freezes anyway. This happens much more quickly than case 1.

Case 3:
Trying to diagnose what is going on I insalled cupy and wrote a little python script that crunches some numbers on the gpu.

import numpy as np
import cupy as cp
import time
print('Running Test...')
s = time.time()
x_gpu = cp.ones((10000, 10000))
for i in range(2):
    w, v = cp.linalg.eigh(x_gpu)
e = time.time()
print('Time consumed by cupy {}'.format(e-s))

I verified that this indeed runs fine. nvidia-smi reports that it hits 100% gpu utilization, power usage of 80 W, and something like 5 Gb of memory.

It even runs multiple times without error. But eventually we get errors like:

CUSOLVERError: CUSOLVER_STATUS_EXECUTION_FAILED
CUDADriverError: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered

And nvidia-smi reports ERR on the power consumption and 0% usage even if the script is rerun and completes properly.

I then tried running the script with cuda-memcheck as was advised in another thread.

The first time it finds an error I get:

(scripting) o2d@LAP124573:~$ cuda-memcheck python ./cupy_text.py 
========= CUDA-MEMCHECK
Running Test... 
Traceback (most recent call last):
  File "/home/o2d/./cupy_text.py", line 8, in <module>
    w, v = cp.linalg.eigh(x_gpu)
  File "/home/o2d/anaconda3/envs/scripting/lib/python3.9/site-packages/cupy/linalg/_eigenvalue.py", line 133, in eigh
    return _syevd(a, UPLO, True)
  File "/home/o2d/anaconda3/envs/scripting/lib/python3.9/site-packages/cupy/linalg/_eigenvalue.py", line 82, in _syevd
    syevd(
  File "cupy_backends/cuda/libs/cusolver.pyx", line 2079, in cupy_backends.cuda.libs.cusolver.dsyevd
  File "cupy_backends/cuda/libs/cusolver.pyx", line 2088, in cupy_backends.cuda.libs.cusolver.dsyevd
  File "cupy_backends/cuda/libs/cusolver.pyx", line 700, in cupy_backends.cuda.libs.cusolver.check_status
cupy_backends.cuda.libs.cusolver.CUSOLVERError: CUSOLVER_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
  File "cupy_backends/cuda/api/driver.pyx", line 253, in cupy_backends.cuda.api.driver.moduleUnload
  File "cupy_backends/cuda/api/driver.pyx", line 124, in cupy_backends.cuda.api.driver.check_status
cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_MISALIGNED_ADDRESS: misaligned address
Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
Traceback (most recent call last):
  File "cupy_backends/cuda/api/driver.pyx", line 253, in cupy_backends.cuda.api.driver.moduleUnload
  File "cupy_backends/cuda/api/driver.pyx", line 124, in cupy_backends.cuda.api.driver.check_status
cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_MISALIGNED_ADDRESS: misaligned address
========= ERROR SUMMARY: 0 errors

This seems to say that cuda-memcheck doesn’t find any errors even though the python script did.

Rerunning the test we get 74 errors which all appear to be like this

(scripting) o2d@LAP124573:~$ cuda-memcheck python ./cupy_text.py 
========= CUDA-MEMCHECK
Running Test...
========= Invalid __global__ read of size 8
=========     at 0x00000ba0 in void cuds_symv_alg6_stage1_lower<double, int=5, int=8>(int, double const *, unsigned long, double const *, int, double*)
=========     by thread (12,1,0) in block (120,0,0)
=========     Address 0x7f257286120e is misaligned
=========     Device Frame:void cuds_symv_alg6_stage1_lower<double, int=5, int=8>(int, double const *, unsigned long, double const *, int, double*) (void cuds_symv_alg6_stage1_lower<double, int=5, int=8>(int, double const *, unsigned long, double const *, int, double*) : 0xba0)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2b8) [0x2235d8]
=========     Host Frame:/home/o2d/anaconda3/envs/scripting/lib/python3.9/site-packages/cupy_backends/cuda/libs/../../../../../libcusolver.so.10 [0xa767f9]
=========     Host Frame:/home/o2d/anaconda3/envs/scripting/lib/python3.9/site-packages/cupy_backends/cuda/libs/../../../../../libcusolver.so.10 [0xa76887]
=========     Host Frame:/home/o2d/anaconda3/envs/scripting/lib/python3.9/site-packages/cupy_backends/cuda/libs/../../../../../libcusolver.so.10 [0xaacbd5]
=========     Host Frame:/home/o2d/anaconda3/envs/scripting/lib/python3.9/site-packages/cupy_backends/cuda/libs/../../../../../libcusolver.so.10 [0x2690ab]
=========     Host Frame:/home/o2d/anaconda3/envs/scripting/lib/python3.9/site-packages/cupy_backends/cuda/libs/../../../../../libcusolver.so.10 [0x26d1e9]
=========     Host Frame:/home/o2d/anaconda3/envs/scripting/lib/python3.9/site-packages/cupy_backends/cuda/libs/../../../../../libcusolver.so.10 [0x249eb9]
=========     Host Frame:/home/o2d/anaconda3/envs/scripting/lib/python3.9/site-packages/cupy_backends/cuda/libs/../../../../../libcusolver.so.10 [0x243bca]
=========     Host Frame:/home/o2d/anaconda3/envs/scripting/lib/python3.9/site-packages/cupy_backends/cuda/libs/../../../../../libcusolver.so.10 (cusolverDnDsyevd + 0x4c8) [0x25dee8]
=========     Host Frame:/home/o2d/anaconda3/envs/scripting/lib/python3.9/site-packages/cupy_backends/cuda/libs/cusolver.cpython-39-x86_64-linux-gnu.so [0x1da0a]
=========     Host Frame:/home/o2d/anaconda3/envs/scripting/lib/python3.9/site-packages/cupy_backends/cuda/libs/cusolver.cpython-39-x86_64-linux-gnu.so [0x8a724]
=========     Host Frame:python [0x142964]
=========     Host Frame:python (_PyObject_MakeTpCall + 0x37f) [0x13cc0f]
=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x4a4) [0x1c82a4]
=========     Host Frame:python (_PyFunction_Vectorcall + 0x413) [0x1832e3]
=========     Host Frame:python [0xf9fb9]
=========     Host Frame:python [0x182552]
=========     Host Frame:python (_PyFunction_Vectorcall + 0x1e7) [0x1830b7]
=========     Host Frame:python [0xfbff4]
=========     Host Frame:python [0x182552]
=========     Host Frame:python (PyEval_EvalCodeEx + 0x4c) [0x231fcc]
=========     Host Frame:python (PyEval_EvalCode + 0x1b) [0x18357b]
=========     Host Frame:python [0x23207b]
=========     Host Frame:python [0x2684d5]
=========     Host Frame:python [0x10ef34]
=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x1bf) [0x26d1cf]
=========     Host Frame:python (Py_RunMain + 0x3fc) [0x26d91c]
=========     Host Frame:python (Py_BytesMain + 0x39) [0x26daa9]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf3) [0x270b3]
=========     Host Frame:python [0x1efec4]

And then python throws an error at the end

Traceback (most recent call last): 
  File "/home/o2d/./cupy_text.py", line 8, in <module>
    w, v = cp.linalg.eigh(x_gpu)
  File "/home/o2d/anaconda3/envs/scripting/lib/python3.9/site-packages/cupy/linalg/_eigenvalue.py", line 133, in eigh
    return _syevd(a, UPLO, True)
  File "/home/o2d/anaconda3/envs/scripting/lib/python3.9/site-packages/cupy/linalg/_eigenvalue.py", line 82, in _syevd
    syevd(
  File "cupy_backends/cuda/libs/cusolver.pyx", line 2079, in cupy_backends.cuda.libs.cusolver.dsyevd
  File "cupy_backends/cuda/libs/cusolver.pyx", line 2088, in cupy_backends.cuda.libs.cusolver.dsyevd
  File "cupy_backends/cuda/libs/cusolver.pyx", line 700, in cupy_backends.cuda.libs.cusolver.check_status
cupy_backends.cuda.libs.cusolver.CUSOLVERError: CUSOLVER_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
  File "cupy_backends/cuda/api/driver.pyx", line 253, in cupy_backends.cuda.api.driver.moduleUnload
  File "cupy_backends/cuda/api/driver.pyx", line 124, in cupy_backends.cuda.api.driver.check_status
cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
Traceback (most recent call last):
  File "cupy_backends/cuda/api/driver.pyx", line 253, in cupy_backends.cuda.api.driver.moduleUnload
  File "cupy_backends/cuda/api/driver.pyx", line 124, in cupy_backends.cuda.api.driver.check_status
cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
========= ERROR SUMMARY: 74 errors

gpu temperature never climbs much above 60 C. Power consumption appears to be limited to 80 W but trying to use nvidia-smi -pl to lower the power limit gives me an error “Changing power management limit is not supported…”

Disabling the nvidia driver and using the default ubuntu driver seems to solve the problem encountered in case 1 but the whole point of this laptop was to have the cuda capabilities.

Can anyone give me advice? The laptop is still under warranty but I’m not sure whether it is defective or if I have done something stupid (which is highly possible).