Hi there, I am running sanity test and I got this error in the data_validation test:
word content expected
13 a5a5a5a5 3f4c7e6a
14 a5a5a5a5 3f4c1e6a
15 a5a5a5a5 3f4cde6a
16 a5a5a5a5 3f4d5e6a
17 a5a5a5a5 3f4e5e6a
18 a5a5a5a5 3f485e6a
19 a5a5a5a5 3f445e6a
20 a5a5a5a5 3f5c5e6a
21 a5a5a5a5 3f6c5e6a
22 a5a5a5a5 3f0c5e6a
I debug by myself and I found that in function: init_hbuf_walking_bit
, the buf_ptr
cannot be written. The content in it is always 0xffffffff
. And the corresponding gpu memory is never changed. (a5a5a5a5
) in this case. There are no other errors during the whole process. All function calls return success.
Here is the settings on my server(bare-metal machine):
OS: Ubuntu 18.04.6 LTS
linux version: 4.15.0-213-generic
GPU: Tesla T4
Driver:
$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:21:00.0 Off | 0 |
| N/A 29C P8 14W / 70W | 4MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
$ ofed_info -s
MLNX_OFED_LINUX-5.4-3.4.0.0:
module:
$ lsmod|grep nv
nvidia_peermem 16384 0
nvidia_uvm 1216512 6
ib_core 311296 10 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_iser,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
nvidia_drm 57344 0
nvidia_modeset 1241088 1 nvidia_drm
nvidia 56418304 49 nvidia_uvm,nvidia_peermem,gdrdrv,nvidia_modeset
drm_kms_helper 172032 2 mgag200,nvidia_drm
drm 401408 6 drm_kms_helper,nvidia,mgag200,nvidia_drm,ttm