Bug Report: GPU Driver Hang with Specific Workloads on H100 and Nvidia 550, 555

Dear NVIDIA Support,

I am writing to report a recurring issue we have encountered on our H100 system (across different servers), which appears to be related to the NVIDIA GPU drivers 550, 555 and specific interactions with certain workloads. This time that is vllm/vllm-openai:v0.5.3.post1 Docker image. Below are the detailed steps and observations related to the issue:

Issue Summary

Steps to Reproduce

  1. Running Docker Image:
    • The GPU got stuck while running the following Docker container:
    containers:
    - args:
      - huggingface-cli download meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --exclude
        "*consolidated*"* && python3 -m vllm.entrypoints.openai.api_server --model
        meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tensor-parallel-size 8 --gpu-memory-utilization
        0.98 --max-model-len 32768 --enforce-eager
      command:
      - bash
      - -c

Logs and Observations

  1. Container Logs:
$ kubectl -n ef37568f7hgrreeo0puc634h472ucfgre908ar09piv0s logs vllm-0 --tail=10000
Fetching 130 files: 100%|██████████| 130/130 [00:00<00:00, 17674.54it/s]
/root/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3.1-405B-Instruct-FP8/snapshots/a8f01524ffd5c05a7de914a51fae0b5afe738d3b
INFO 07-24 22:01:04 api_server.py:219] vLLM API server version 0.5.3.post1
INFO 07-24 22:01:04 api_server.py:220] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='meta-llama/Meta-Llama-3.1-405B-Instruct-FP8', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=32768, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=8, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.98, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=True, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 07-24 22:01:04 config.py:715] Defaulting to use mp for distributed inference
INFO 07-24 22:01:04 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='meta-llama/Meta-Llama-3.1-405B-Instruct-FP8', speculative_config=None, tokenizer='meta-llama/Meta-Llama-3.1-405B-Instruct-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=fbgemm_fp8, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=meta-llama/Meta-Llama-3.1-405B-Instruct-FP8, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 07-24 22:01:05 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
(VllmWorkerProcess pid=87) INFO 07-24 22:01:05 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=86) INFO 07-24 22:01:05 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=88) INFO 07-24 22:01:05 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=89) INFO 07-24 22:01:05 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=90) INFO 07-24 22:01:05 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=91) INFO 07-24 22:01:05 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=92) INFO 07-24 22:01:05 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
  1. Kernel Logs:
[Wed Jul 24 21:50:32 2024] NVRM: Xid (PCI:0000:00:05): 31, pid='<unknown>', name=<unknown>, Ch 0000000a, intr 00000000. MMU Fault: ENGINE GRAPHICS GPC7 GPCCLIENT_T1_1 faulted @ 0x7392_e2a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[Wed Jul 24 21:50:32 2024] show_signal: 71 callbacks suppressed
[Wed Jul 24 21:50:32 2024] traps: pt_main_thread[1162989] general protection fault ip:73a1db85b941 sp:73a103d34b30 error:0 in libc-2.31.so[73a1db85b000+178000]
[Wed Jul 24 21:50:57 2024] NVRM: GPU at PCI:0000:00:08: GPU-0f2248ea-9a45-5498-0dd8-17d5288e2779
[Wed Jul 24 21:50:57 2024] NVRM: GPU Board Serial Number: 1652923017387
[Wed Jul 24 21:50:57 2024] NVRM: Xid (PCI:0000:00:08): 95, pid='<unknown>', name=<unknown>, Uncontained: FBHUB. RST: Yes, D-RST: No
[Wed Jul 24 21:50:57 2024] NVRM: Xid (PCI:0000:00:08): 95, pid='<unknown>', name=<unknown>, Ch 0000000a
[Wed Jul 24 21:50:57 2024] NVRM: Xid (PCI:0000:00:08): 95, pid=1237622, name=pt_main_thread, Ch 0000000b
[Wed Jul 24 21:50:57 2024] NVRM: Xid (PCI:0000:00:08): 95, pid=1237622, name=pt_main_thread, Ch 0000000c
[Wed Jul 24 21:50:57 2024] NVRM: Xid (PCI:0000:00:08): 95, pid=1237622, name=pt_main_thread, Ch 0000000d
[Wed Jul 24 21:50:57 2024] NVRM: Xid (PCI:0000:00:08): 95, pid=1237622, name=pt_main_thread, Ch 0000000e
[Wed Jul 24 21:50:57 2024] NVRM: Xid (PCI:0000:00:08): 95, pid=1237622, name=pt_main_thread, Ch 0000000f
[Wed Jul 24 21:50:57 2024] NVRM: Xid (PCI:0000:00:08): 95, pid=1237622, name=pt_main_thread, Ch 00000010
[Wed Jul 24 21:50:57 2024] NVRM: Xid (PCI:0000:00:08): 95, pid=1237622, name=pt_main_thread, Ch 00000011
[Wed Jul 24 22:01:08 2024] NVRM: Xid (PCI:0000:00:08): 95, pid='<unknown>', name=<unknown>, Uncontained: FBHUB. RST: Yes, D-RST: No
  1. Driver Version:
[Wed Jun  5 08:45:50 2024] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  555.42.02  Mon May 13 17:24:29 UTC 2024
[Wed Jun  5 08:45:51 2024] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  555.42.02  Mon May 13 16:48:14 UTC 2024
  1. Linux Version:
root@node6:~# lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 22.04.4 LTS
Release:	22.04
Codename:	jammy
root@node6:~# uname -a
Linux node6 6.5.0-35-generic #35~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue May  7 09:00:52 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
  1. Logs collected with nvidia-bug-report.sh:

node6.h100.nvidia-bug-report.log.gz (289.4 KB)

Additional Information

  • The nvidia-bug-report.sh tool also hangs during execution.
  • Retrying with --safe-mode --extra-system-data CLI arguments as suggested worked.
  • We have encountered the same issue with the 550 driver version as well with H100 GPUs across different servers:
    • See the issue is described in Issue 164 on the xai-org/grok-1 GitHub repository. The grok-1 application frequently crashes NVIDIA drivers, particularly with version 550.

Proper nvidia drivers for H100

We are installing the drivers recommended by Ubuntu, i.e. nvidia-driver-555:

root@node6:~# ubuntu-drivers devices
== /sys/devices/pci0000:00/0000:00:0a.0 ==
modalias : pci:v000010DEd00002331sv000010DEsd00001626bc03sc02i00
vendor   : NVIDIA Corporation
driver   : nvidia-driver-550-open - third-party non-free
driver   : nvidia-driver-555 - third-party non-free recommended
driver   : nvidia-driver-535-server - distro non-free
driver   : nvidia-driver-535-server-open - distro non-free
driver   : nvidia-driver-555-open - third-party non-free
driver   : nvidia-driver-535-open - distro non-free
driver   : nvidia-driver-545-open - distro non-free
driver   : nvidia-driver-550 - third-party non-free
driver   : xserver-xorg-video-nouveau - distro free builtin

Are these drivers (version 555) suitable for use with H100 and server workloads (AI processing), or should we consider installing the nvidia-driver-535-server or nvidia-driver-535-server-open versions instead? We are prioritizing stability and performance, with a greater emphasis on stability.

Request for Support

We would appreciate your assistance in diagnosing and resolving this issue. It appears to be related to the GPU driver, possibly involving MMU faults and general protection faults. Any insights or recommendations would be highly valued.

Please let me know if you require any further information or additional logs.

Thank you for your support.

Kind regards,
Andrey Arapov
Overclock Labs, creators of Akash Network

1 Like

Additional info

We’ve downgraded the nvidia drivers 555 back to nvidia-driver-535-server across all H100’s on this cluster as we have been additionally seeing RuntimeError: CUDA error: uncorrectable ECC error encountered errors with vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct --tensor-parallel-size 8 --enable-chunked-prefill=False --distributed-executor-backend ray.

I’ve seen the same issues in configurations where the H100 PCIe devices are passed through to a guest VM in Qemu/KVM. Driver 535.183.06.

I have not been able to reproduce the issue on the bare-metal itself, but once the workloads are applied in-VM the “ECC” errors occur. XID 94 and 95.

Outside of vLLM I’ve been able to trigger it with pytorch-benchmark as well, when using the default number of workers (32) and a large batch size (896/904) with synthetic data. When limiting num_workers to 8, the issue does not seem to appear.

Is your system virtualised or bare-metal?

1 Like

Thank you @eugene.debeste !
The provider’s system is indeed virtualized (with QEMU) and has PCIe H100’s, they are working now on attempting to reproduce the issue by themselves.

And just for the record on the types of errors that we’ve been dealing with (nvidia driver 535.183.01):

XID errors 94, 140, 31, 43, 95, 63 occurred in descending order of frequency:

NVRM: Xid (PCI:0000:00:09): 94, pid='<unknown>', name=<unknown>, Contained: SM (0x1). RST: No, D-RST: No
NVRM: Xid (PCI:0000:00:09): 94, pid=3550278, name=pt_main_thread, Ch 00000008
NVRM: Xid (PCI:0000:00:09): 94, pid=3550278, name=pt_main_thread, Ch 00000009
NVRM: Xid (PCI:0000:00:09): 94, pid=3550278, name=pt_main_thread, Ch 0000000a
NVRM: Xid (PCI:0000:00:09): 94, pid=3550278, name=pt_main_thread, Ch 0000000b
NVRM: Xid (PCI:0000:00:09): 94, pid=3550278, name=pt_main_thread, Ch 0000000c
NVRM: Xid (PCI:0000:00:09): 94, pid=3550278, name=pt_main_thread, Ch 0000000d
NVRM: Xid (PCI:0000:00:09): 94, pid=3550278, name=pt_main_thread, Ch 0000000e
NVRM: Xid (PCI:0000:00:09): 94, pid=3550278, name=pt_main_thread, Ch 0000000f
NVRM: Xid (PCI:0000:00:07): 43, pid=1441160, name=pt_main_thread, Ch 00000008
NVRM: Xid (PCI:0000:00:07): 43, pid=1443554, name=pt_main_thread, Ch 00000008
NVRM: Xid (PCI:0000:00:08): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
NVRM: Xid (PCI:0000:00:08): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
NVRM: Xid (PCI:0000:00:08): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
NVRM: Xid (PCI:0000:00:08): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
NVRM: GPU 0000:00:08.0: RmInitAdapter failed! (0x62:0xb:2404)
NVRM: GPU 0000:00:08.0: rm_init_adapter failed, device minor number 2
NVRM: Xid (PCI:0000:00:08): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
NVRM: Xid (PCI:0000:00:08): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
NVRM: Xid (PCI:0000:00:08): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
NVRM: Xid (PCI:0000:00:08): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
NVRM: GPU 0000:00:08.0: RmInitAdapter failed! (0x62:0xb:2404)
NVRM: GPU 0000:00:08.0: rm_init_adapter failed, device minor number 2
NVRM: Xid (PCI:0000:00:09): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
NVRM: Xid (PCI:0000:00:09): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
NVRM: Xid (PCI:0000:00:09): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
NVRM: Xid (PCI:0000:00:09): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
NVRM: GPU 0000:00:09.0: RmInitAdapter failed! (0x62:0xb:2404)
NVRM: GPU 0000:00:09.0: rm_init_adapter failed, device minor number 3
NVRM: Xid (PCI:0000:00:09): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
NVRM: Xid (PCI:0000:00:09): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
NVRM: Xid (PCI:0000:00:09): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
NVRM: Xid (PCI:0000:00:09): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
NVRM: GPU 0000:00:09.0: RmInitAdapter failed! (0x62:0xb:2404)
NVRM: GPU 0000:00:09.0: rm_init_adapter failed, device minor number 3
NVRM: GPU 0000:00:09.0: RmInitAdapter failed! (0x62:0x40:2404)
NVRM: GPU 0000:00:09.0: rm_init_adapter failed, device minor number 3
NVRM: GPU 0000:00:09.0: RmInitAdapter failed! (0x62:0x40:2404)
NVRM: GPU 0000:00:09.0: rm_init_adapter failed, device minor number 3
NVRM: GPU 0000:00:09.0: RmInitAdapter failed! (0x62:0x40:2404)
NVRM: GPU 0000:00:09.0: rm_init_adapter failed, device minor number 3
NVRM: GPU 0000:00:09.0: RmInitAdapter failed! (0x62:0x40:2404)
NVRM: GPU 0000:00:09.0: rm_init_adapter failed, device minor number 3
NVRM: GPU 0000:00:09.0: RmInitAdapter failed! (0x62:0x40:2404)
NVRM: GPU 0000:00:09.0: rm_init_adapter failed, device minor number 3
NVRM: GPU 0000:00:09.0: RmInitAdapter failed! (0x62:0x40:2404)
NVRM: GPU 0000:00:09.0: rm_init_adapter failed, device minor number 3
NVRM: GPU at PCI:0000:00:08: GPU-835ba0d7-7218-84be-af2e-2beba13420e8
NVRM: GPU Board Serial Number: 1650623020125
NVRM: Xid (PCI:0000:00:08): 31, pid=2908924, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x7d19_12a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
NVRM: GPU at PCI:0000:00:0c: GPU-999bc7c9-951c-cb6c-d72f-ab611abdc2fc
NVRM: GPU Board Serial Number: 1650623020082
NVRM: Xid (PCI:0000:00:0c): 31, pid=2908928, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x7d19_12a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
NVRM: GPU at PCI:0000:00:0b: GPU-a2fba816-ad60-752b-aedd-376ac341745f
NVRM: GPU Board Serial Number: 1650623020058
NVRM: Xid (PCI:0000:00:0b): 31, pid=2908927, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x7d19_12a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
NVRM: GPU at PCI:0000:00:0d: GPU-d8103984-1a5b-388b-d045-02b765eca3cd
NVRM: GPU Board Serial Number: 1650623020016
NVRM: Xid (PCI:0000:00:0d): 31, pid=2908929, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x7d19_16a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
NVRM: GPU at PCI:0000:00:07: GPU-2c390139-d85f-7eee-365e-8a00244025ad
NVRM: GPU Board Serial Number: 1650623011793
NVRM: Xid (PCI:0000:00:07): 31, pid=2908923, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x7d19_12a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
NVRM: GPU at PCI:0000:00:0a: GPU-69a138b9-be4e-603f-ed74-6d10844329f5
NVRM: GPU Board Serial Number: 1650223015186
NVRM: Xid (PCI:0000:00:0a): 31, pid=2908926, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x7d19_12a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
NVRM: GPU at PCI:0000:00:09: GPU-d547b3e3-89c4-0c85-b81d-b6b15e62e10a
NVRM: GPU Board Serial Number: 1650723017029
NVRM: Xid (PCI:0000:00:09): 31, pid=2908925, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x7d19_12a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
NVRM: Xid (PCI:0000:00:0d): 31, pid=3404743, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_6 faulted @ 0x77b6_72a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
NVRM: Xid (PCI:0000:00:0c): 31, pid=3404742, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_4 faulted @ 0x77b6_6ea00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
NVRM: Xid (PCI:0000:00:08): 31, pid=3404738, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_6 faulted @ 0x77b6_6ea00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
NVRM: Xid (PCI:0000:00:09): 31, pid=3404739, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_6 faulted @ 0x77b6_6ea00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
NVRM: Xid (PCI:0000:00:07): 31, pid=3404737, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_6 faulted @ 0x77b6_6ea00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
NVRM: Xid (PCI:0000:00:0b): 31, pid=3404741, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_6 faulted @ 0x77b6_6ea00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
NVRM: Xid (PCI:0000:00:0a): 31, pid=3404740, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_6 faulted @ 0x77b6_6ea00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
NVRM: Xid (PCI:0000:00:0b): 31, pid=3684110, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_6 faulted @ 0x775f_96a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
NVRM: Xid (PCI:0000:00:0a): 31, pid=3684109, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_6 faulted @ 0x775f_96a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
NVRM: Xid (PCI:0000:00:07): 31, pid=3684106, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_6 faulted @ 0x775f_96a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
NVRM: Xid (PCI:0000:00:08): 31, pid=3684107, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_6 faulted @ 0x775f_96a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
NVRM: Xid (PCI:0000:00:0d): 31, pid=3684112, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_6 faulted @ 0x775f_92a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
NVRM: Xid (PCI:0000:00:0c): 31, pid=3684111, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_6 faulted @ 0x775f_96a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
NVRM: Xid (PCI:0000:00:09): 31, pid=3684108, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_6 faulted @ 0x775f_96a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 11 04:48:16 node3 kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module  535.183.01  Sun May 12 19:39:15 UTC 2024
Aug 13 14:12:36 node3 kernel: NVRM: GPU at PCI:0000:00:09: GPU-f9d74506-a27c-4168-bb62-0910f23e9a31
Aug 13 14:12:36 node3 kernel: NVRM: GPU Board Serial Number: 1650623011938
Aug 13 14:12:36 node3 kernel: NVRM: Xid (PCI:0000:00:09): 31, pid=2110660, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_4 faulted @ 0x733f_76a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 13 14:12:36 node3 kernel: NVRM: GPU at PCI:0000:00:07: GPU-1e778966-789f-658a-726c-34e5253f7b31
Aug 13 14:12:36 node3 kernel: NVRM: GPU Board Serial Number: 1650723017460
Aug 13 14:12:36 node3 kernel: NVRM: Xid (PCI:0000:00:07): 31, pid=2110658, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_4 faulted @ 0x733f_76a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 13 14:12:36 node3 kernel: NVRM: GPU at PCI:0000:00:0c: GPU-6bd76402-49c7-14b0-cf4e-9706661f2b14
Aug 13 14:12:36 node3 kernel: NVRM: GPU Board Serial Number: 1650723017058
Aug 13 14:12:36 node3 kernel: NVRM: Xid (PCI:0000:00:0c): 31, pid=2110663, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_4 faulted @ 0x733f_76a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 13 14:12:36 node3 kernel: NVRM: GPU at PCI:0000:00:08: GPU-c8544924-fd73-e0ed-2644-d83eb7dd7658
Aug 13 14:12:36 node3 kernel: NVRM: GPU Board Serial Number: 1650723016962
Aug 13 14:12:36 node3 kernel: NVRM: Xid (PCI:0000:00:08): 31, pid=2110659, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_4 faulted @ 0x733f_76a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 13 14:12:36 node3 kernel: NVRM: GPU at PCI:0000:00:0b: GPU-6e8751ce-fecf-53b6-7682-10facc66681b
Aug 13 14:12:36 node3 kernel: NVRM: GPU Board Serial Number: 1650723017519
Aug 13 14:12:36 node3 kernel: NVRM: Xid (PCI:0000:00:0b): 31, pid=2110662, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_2 faulted @ 0x733f_76a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 13 14:12:36 node3 kernel: NVRM: GPU at PCI:0000:00:0a: GPU-23981421-5eb6-13b9-312b-8e01bbbcec23
Aug 13 14:12:36 node3 kernel: NVRM: GPU Board Serial Number: 1650723017408
Aug 13 14:12:36 node3 kernel: NVRM: Xid (PCI:0000:00:0a): 31, pid=2110661, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_2 faulted @ 0x733f_76a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 13 14:12:36 node3 kernel: NVRM: GPU at PCI:0000:00:0d: GPU-000e1b97-b118-337c-71a2-e67b64f05220
Aug 13 14:12:36 node3 kernel: NVRM: GPU Board Serial Number: 1650723016956
Aug 13 14:12:36 node3 kernel: NVRM: Xid (PCI:0000:00:0d): 31, pid=2110664, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_4 faulted @ 0x733f_76a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 18 19:42:40 node3 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid='<unknown>', name=<unknown>, Contained: SM (0x1). RST: No, D-RST: No
Aug 18 19:42:40 node3 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=875206, name=pt_main_thread, Ch 00000008
Aug 18 19:42:40 node3 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=875206, name=pt_main_thread, Ch 00000009
Aug 18 19:42:40 node3 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=875206, name=pt_main_thread, Ch 0000000a
Aug 18 19:42:40 node3 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=875206, name=pt_main_thread, Ch 0000000b
Aug 18 19:42:40 node3 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=875206, name=pt_main_thread, Ch 0000000c
Aug 18 19:42:40 node3 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=875206, name=pt_main_thread, Ch 0000000d
Aug 18 19:42:40 node3 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=875206, name=pt_main_thread, Ch 0000000e
Aug 18 19:42:40 node3 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=875206, name=pt_main_thread, Ch 0000000f
Aug 19 18:38:13 node3 kernel: NVRM: GPU at PCI:0000:00:09: GPU-f9d74506-a27c-4168-bb62-0910f23e9a31
Aug 19 18:38:13 node3 kernel: NVRM: GPU Board Serial Number: 1650623011938
Aug 19 18:38:13 node3 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid='<unknown>', name=<unknown>, Contained: SM (0x1). RST: No, D-RST: No
Aug 19 18:38:13 node3 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=2189357, name=pt_main_thread, Ch 00000008
Aug 19 18:38:13 node3 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=2189357, name=pt_main_thread, Ch 00000009
Aug 19 18:38:13 node3 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=2189357, name=pt_main_thread, Ch 0000000a
Aug 19 18:38:13 node3 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=2189357, name=pt_main_thread, Ch 0000000b
Aug 19 18:38:13 node3 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=2189357, name=pt_main_thread, Ch 0000000c
Aug 19 18:38:13 node3 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=2189357, name=pt_main_thread, Ch 0000000d
Aug 19 18:38:13 node3 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=2189357, name=pt_main_thread, Ch 0000000e
Aug 19 18:38:13 node3 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=2189357, name=pt_main_thread, Ch 0000000f
Jul 27 20:33:31 node4 kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module  535.183.01  Sun May 12 19:39:15 UTC 2024
Jul 28 08:50:35 node4 kernel: NVRM: GPU at PCI:0000:00:07: GPU-3b6ec030-5adc-1847-e155-79f635584b4e
Jul 28 08:50:35 node4 kernel: NVRM: GPU Board Serial Number: 1652923017935
Jul 28 08:50:35 node4 kernel: NVRM: Xid (PCI:0000:00:07): 94, pid='<unknown>', name=<unknown>, Contained: CE User Channel (0xb). RST: No, D-RST: No
Jul 28 08:50:35 node4 kernel: NVRM: Xid (PCI:0000:00:07): 94, pid=941956, name=python, Ch 00000008
Jul 28 08:50:35 node4 kernel: NVRM: Xid (PCI:0000:00:07): 94, pid=941956, name=python, Ch 00000009
Jul 28 08:50:35 node4 kernel: NVRM: Xid (PCI:0000:00:07): 94, pid=941956, name=python, Ch 0000000a
Jul 28 08:50:35 node4 kernel: NVRM: Xid (PCI:0000:00:07): 94, pid=941956, name=python, Ch 0000000b
Jul 28 08:50:35 node4 kernel: NVRM: Xid (PCI:0000:00:07): 94, pid=941956, name=python, Ch 0000000c
Jul 28 08:50:35 node4 kernel: NVRM: Xid (PCI:0000:00:07): 94, pid=941956, name=python, Ch 0000000d
Jul 28 08:50:35 node4 kernel: NVRM: Xid (PCI:0000:00:07): 94, pid=941956, name=python, Ch 0000000e
Jul 28 08:50:35 node4 kernel: NVRM: Xid (PCI:0000:00:07): 94, pid=941956, name=python, Ch 0000000f
Jul 28 08:51:33 node4 kernel: NVRM: GPU at PCI:0000:00:08: GPU-c24840ec-8de1-83d5-b126-08000173ae32
Jul 28 08:51:33 node4 kernel: NVRM: GPU Board Serial Number: 1652923018111
Jul 28 08:51:33 node4 kernel: NVRM: Xid (PCI:0000:00:08): 95, pid='<unknown>', name=<unknown>, Uncontained: FBHUB. RST: Yes, D-RST: No
Jul 28 09:15:32 node4 kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module  535.183.01  Sun May 12 19:39:15 UTC 2024
Jul 31 23:14:17 node4 kernel: NVRM: GPU at PCI:0000:00:05: GPU-efbdfde9-5798-a6e7-4c46-12518fa15375
Jul 31 23:14:17 node4 kernel: NVRM: GPU Board Serial Number: 1650423013443
Jul 31 23:14:17 node4 kernel: NVRM: Xid (PCI:0000:00:05): 43, pid=1045242, name=pt_main_thread, Ch 00000008
Jul 31 23:15:25 node4 kernel: NVRM: Xid (PCI:0000:00:05): 43, pid=1084984, name=pt_main_thread, Ch 00000008
Aug 05 03:54:11 node4 kernel: NVRM: GPU at PCI:0000:00:07: GPU-3b6ec030-5adc-1847-e155-79f635584b4e
Aug 05 03:54:11 node4 kernel: NVRM: GPU Board Serial Number: 1652923017935
Aug 05 03:54:11 node4 kernel: NVRM: Xid (PCI:0000:00:07): 31, pid=2512125, name=python3.10, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_8 faulted @ 0x73ac_1c000000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_WRITE
Aug 09 13:32:51 node4 kernel: NVRM: GPU at PCI:0000:00:08: GPU-c24840ec-8de1-83d5-b126-08000173ae32
Aug 09 13:32:51 node4 kernel: NVRM: GPU Board Serial Number: 1652923018111
Aug 09 13:32:51 node4 kernel: NVRM: Xid (PCI:0000:00:08): 94, pid='<unknown>', name=<unknown>, Contained: SM (0x1). RST: No, D-RST: No
Aug 09 13:32:51 node4 kernel: NVRM: Xid (PCI:0000:00:07): 94, pid='<unknown>', name=<unknown>, Contained: SM (0x1). RST: No, D-RST: No
Aug 09 13:32:51 node4 kernel: NVRM: Xid (PCI:0000:00:08): 94, pid=3780430, name=pt_main_thread, Ch 00000008
Aug 09 13:32:51 node4 kernel: NVRM: Xid (PCI:0000:00:07): 94, pid=3780429, name=pt_main_thread, Ch 00000008
Aug 09 13:32:51 node4 kernel: NVRM: Xid (PCI:0000:00:08): 94, pid=3780430, name=pt_main_thread, Ch 00000009
Aug 09 13:32:51 node4 kernel: NVRM: Xid (PCI:0000:00:07): 94, pid=3780429, name=pt_main_thread, Ch 00000009
...
...
Aug 10 11:04:07 node4 kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module  535.183.01  Sun May 12 19:39:15 UTC 2024
Aug 12 22:04:37 node4 kernel: NVRM: GPU at PCI:0000:00:08: GPU-3b6ec030-5adc-1847-e155-79f635584b4e
Aug 12 22:04:37 node4 kernel: NVRM: GPU Board Serial Number: 1652923017935
Aug 12 22:04:37 node4 kernel: NVRM: Xid (PCI:0000:00:08): 31, pid=3253951, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_8 faulted @ 0x7b33_d2a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 12 22:04:37 node4 kernel: NVRM: GPU at PCI:0000:00:09: GPU-c24840ec-8de1-83d5-b126-08000173ae32
Aug 12 22:04:37 node4 kernel: NVRM: GPU Board Serial Number: 1652923018111
Aug 12 22:04:37 node4 kernel: NVRM: Xid (PCI:0000:00:09): 31, pid=3253952, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_8 faulted @ 0x7b33_d2a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 12 22:04:37 node4 kernel: NVRM: GPU at PCI:0000:00:0d: GPU-f790ae43-c5a4-fe11-d524-657843e0c85d
Aug 12 22:04:37 node4 kernel: NVRM: GPU Board Serial Number: 1650623011704
Aug 12 22:04:37 node4 kernel: NVRM: Xid (PCI:0000:00:0d): 31, pid=3253956, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_8 faulted @ 0x7b33_cea00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 12 22:04:37 node4 kernel: NVRM: GPU at PCI:0000:00:07: GPU-adbf98be-b5d4-cdff-4807-e5c096c81db8
Aug 12 22:04:37 node4 kernel: NVRM: GPU Board Serial Number: 1652923017890
Aug 12 22:04:37 node4 kernel: NVRM: Xid (PCI:0000:00:07): 31, pid=3253950, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_8 faulted @ 0x7b33_d2a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 12 22:04:37 node4 kernel: NVRM: GPU at PCI:0000:00:0c: GPU-95733769-1b5a-ab5f-ca42-8da0237cf8d7
Aug 12 22:04:37 node4 kernel: NVRM: GPU Board Serial Number: 1652923017827
Aug 12 22:04:37 node4 kernel: NVRM: Xid (PCI:0000:00:0c): 31, pid=3253955, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_8 faulted @ 0x7b33_d2a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 12 22:04:37 node4 kernel: NVRM: GPU at PCI:0000:00:0a: GPU-3747688c-1804-eadc-5bf7-525bf0e97233
Aug 12 22:04:37 node4 kernel: NVRM: GPU Board Serial Number: 1652923017985
Aug 12 22:04:37 node4 kernel: NVRM: Xid (PCI:0000:00:0a): 31, pid=3253953, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_8 faulted @ 0x7b33_d2a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 12 22:04:37 node4 kernel: NVRM: GPU at PCI:0000:00:0b: GPU-c2a5fa97-ed5f-41b7-8afc-3107c9aeabb2
Aug 12 22:04:37 node4 kernel: NVRM: GPU Board Serial Number: 1652923017448
Aug 12 22:04:37 node4 kernel: NVRM: Xid (PCI:0000:00:0b): 31, pid=3253954, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_8 faulted @ 0x7b33_d2a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 18 21:58:34 node4 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid='<unknown>', name=<unknown>, Contained: SM (0x1). RST: No, D-RST: No
Aug 18 21:58:34 node4 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=3226355, name=pt_main_thread, Ch 00000008
Aug 18 21:58:34 node4 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=3226355, name=pt_main_thread, Ch 00000009
Aug 18 21:58:34 node4 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=3226355, name=pt_main_thread, Ch 0000000a
Aug 18 21:58:34 node4 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=3226355, name=pt_main_thread, Ch 0000000b
Aug 18 21:58:34 node4 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=3226355, name=pt_main_thread, Ch 0000000c
Aug 18 21:58:34 node4 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=3226355, name=pt_main_thread, Ch 0000000d
Aug 18 21:58:34 node4 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=3226355, name=pt_main_thread, Ch 0000000e
Aug 18 21:58:34 node4 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=3226355, name=pt_main_thread, Ch 0000000f
Aug 18 22:11:01 node4 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid='<unknown>', name=<unknown>, Contained: SM (0x1). RST: No, D-RST: No
Aug 18 22:11:01 node4 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=3282927, name=pt_main_thread, Ch 00000008
Aug 18 22:11:01 node4 kernel: NVRM: Xid (PCI:0000:00:09): 94, pid=3282927, name=pt_main_thread, Ch 00000009
...
Jul 27 21:03:39 node6 kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module  535.183.01  Sun May 12 19:39:15 UTC 2024
Jul 28 12:57:48 node6 kernel: NVRM: GPU at PCI:0000:00:05: GPU-7033aba4-bd61-b232-aefd-82b60b5bad52
Jul 28 12:57:48 node6 kernel: NVRM: GPU Board Serial Number: 1652923017484
Jul 28 12:57:48 node6 kernel: NVRM: Xid (PCI:0000:00:05): 31, pid=58448, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_14 faulted @ 0x76f5_6aa00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Jul 28 15:58:00 node6 kernel: NVRM: Xid (PCI:0000:00:05): 31, pid=1143621, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_5 faulted @ 0x7898_3aa00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Jul 29 07:59:23 node6 kernel: NVRM: Xid (PCI:0000:00:05): 31, pid=1305565, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_1 faulted @ 0x7c86_0ea00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Jul 29 15:45:07 node6 kernel: NVRM: Xid (PCI:0000:00:05): 31, pid=2283656, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_1 faulted @ 0x708b_bea00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Jul 29 18:03:48 node6 kernel: NVRM: Xid (PCI:0000:00:05): 31, pid=2747406, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_5 faulted @ 0x7266_5aa00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Jul 31 10:16:22 node6 kernel: NVRM: Xid (PCI:0000:00:05): 31, pid=2875767, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_3 faulted @ 0x7c41_2aa00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Jul 31 11:50:40 node6 kernel: NVRM: Xid (PCI:0000:00:05): 31, pid=981508, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_5 faulted @ 0x7464_02a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 01 04:16:15 node6 kernel: NVRM: Xid (PCI:0000:00:05): 31, pid=1062351, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_1 faulted @ 0x748c_16a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 01 08:19:20 node6 kernel: NVRM: Xid (PCI:0000:00:05): 31, pid=2032503, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_5 faulted @ 0x7534_0aa00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 01 12:18:15 node6 kernel: NVRM: Xid (PCI:0000:00:05): 31, pid=2237507, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_3 faulted @ 0x7cbe_cea00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 01 16:34:35 node6 kernel: NVRM: Xid (PCI:0000:00:05): 31, pid=2466105, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_1 faulted @ 0x755e_e6a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 01 20:40:52 node6 kernel: NVRM: Xid (PCI:0000:00:05): 31, pid=2698993, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_5 faulted @ 0x7602_c6a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 02 17:09:41 node6 kernel: NVRM: GPU at PCI:0000:00:0c: GPU-0f4c17d8-d2a5-bb1c-610a-48700ce11a3a
Aug 02 17:09:41 node6 kernel: NVRM: GPU Board Serial Number: 1650623011855
Aug 02 17:09:41 node6 kernel: NVRM: Xid (PCI:0000:00:0c): 31, pid=2906358, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_1 faulted @ 0x7145_16a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 02 17:09:41 node6 kernel: NVRM: GPU at PCI:0000:00:0b: GPU-9bda9cc7-dc78-2b6f-f206-6ab8ea8dcad5
Aug 02 17:09:41 node6 kernel: NVRM: GPU Board Serial Number: 1652923017933
Aug 02 17:09:41 node6 kernel: NVRM: Xid (PCI:0000:00:0b): 31, pid=2906357, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_1 faulted @ 0x7145_16a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 02 17:09:41 node6 kernel: NVRM: GPU at PCI:0000:00:0a: GPU-bd281832-e44f-e4ee-377b-d5807fc3a5eb
Aug 02 17:09:41 node6 kernel: NVRM: GPU Board Serial Number: 1650623011647
Aug 02 17:09:41 node6 kernel: NVRM: Xid (PCI:0000:00:0a): 31, pid=2906356, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_5 faulted @ 0x7145_16a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 02 17:09:41 node6 kernel: NVRM: GPU at PCI:0000:00:07: GPU-286883c5-ad43-eb3e-2259-257aeb552296
Aug 02 17:09:41 node6 kernel: NVRM: GPU Board Serial Number: 1650623011605
Aug 02 17:09:41 node6 kernel: NVRM: Xid (PCI:0000:00:07): 31, pid=2906353, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_5 faulted @ 0x7145_16a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 02 17:09:41 node6 kernel: NVRM: GPU at PCI:0000:00:09: GPU-87f9a74c-f99a-9888-8026-350fb3070740
Aug 02 17:09:41 node6 kernel: NVRM: GPU Board Serial Number: 1650623011670
Aug 02 17:09:41 node6 kernel: NVRM: Xid (PCI:0000:00:09): 31, pid=2906355, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_1 faulted @ 0x7145_16a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 02 17:09:41 node6 kernel: NVRM: GPU at PCI:0000:00:06: GPU-d371d7d9-87e9-62fd-dc48-ded1953889ae
Aug 02 17:09:41 node6 kernel: NVRM: GPU Board Serial Number: 1650623011690
Aug 02 17:09:41 node6 kernel: NVRM: Xid (PCI:0000:00:06): 31, pid=2906352, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_5 faulted @ 0x7145_16a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 02 17:09:41 node6 kernel: NVRM: GPU at PCI:0000:00:08: GPU-0f2248ea-9a45-5498-0dd8-17d5288e2779
Aug 02 17:09:41 node6 kernel: NVRM: GPU Board Serial Number: 1652923017387
Aug 02 17:09:41 node6 kernel: NVRM: Xid (PCI:0000:00:08): 31, pid=2906354, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_3 faulted @ 0x7145_16a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 02 19:01:44 node6 kernel: NVRM: Xid (PCI:0000:00:08): 31, pid=4004299, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_1 faulted @ 0x7cc4_8aa00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 02 19:01:44 node6 kernel: NVRM: Xid (PCI:0000:00:0a): 31, pid=4004301, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_5 faulted @ 0x7cc4_8aa00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 02 19:01:44 node6 kernel: NVRM: Xid (PCI:0000:00:09): 31, pid=4004300, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_1 faulted @ 0x7cc4_8ea00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 02 19:01:44 node6 kernel: NVRM: Xid (PCI:0000:00:07): 31, pid=4004298, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_5 faulted @ 0x7cc4_8ea00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 02 19:01:44 node6 kernel: NVRM: Xid (PCI:0000:00:06): 31, pid=4004297, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_1 faulted @ 0x7cc4_8ea00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 02 19:01:44 node6 kernel: NVRM: Xid (PCI:0000:00:0c): 31, pid=4004303, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_5 faulted @ 0x7cc4_8aa00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 02 19:01:44 node6 kernel: NVRM: Xid (PCI:0000:00:0b): 31, pid=4004302, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_1 faulted @ 0x7cc4_8ea00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 04 11:26:59 node6 kernel: NVRM: Xid (PCI:0000:00:05): 31, pid=4116123, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_5 faulted @ 0x7276_16a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 04 23:04:39 node6 kernel: NVRM: Xid (PCI:0000:00:05): 31, pid=1982354, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_3 faulted @ 0x70d6_4ea00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 04 23:12:58 node6 kernel: NVRM: Xid (PCI:0000:00:05): 31, pid=2581605, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_5 faulted @ 0x798b_26a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 05 19:21:06 node6 kernel: NVRM: Xid (PCI:0000:00:05): 31, pid=2591483, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_1 faulted @ 0x729f_3ea00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 07 21:19:17 node6 kernel: NVRM: Xid (PCI:0000:00:05): 31, pid=187395, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_10 faulted @ 0x7c8c_3aa00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 08 00:07:39 node6 kernel: NVRM: Xid (PCI:0000:00:05): 31, pid=2271239, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_9 faulted @ 0x7305_92a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 08 01:32:25 node6 kernel: NVRM: Xid (PCI:0000:00:08): 31, pid=2418069, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_4 faulted @ 0x71b1_56a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 08 01:32:25 node6 kernel: NVRM: Xid (PCI:0000:00:07): 31, pid=2418068, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_1 faulted @ 0x71b1_56a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 08 01:32:25 node6 kernel: NVRM: Xid (PCI:0000:00:0c): 31, pid=2418075, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_7 faulted @ 0x71b1_56a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 08 01:32:25 node6 kernel: NVRM: Xid (PCI:0000:00:06): 31, pid=2418067, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_6 faulted @ 0x71b1_56a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 08 01:32:25 node6 kernel: NVRM: Xid (PCI:0000:00:0a): 31, pid=2418073, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_15 faulted @ 0x71b1_56a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 08 01:32:25 node6 kernel: NVRM: Xid (PCI:0000:00:09): 31, pid=2418070, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_3 faulted @ 0x71b1_56a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 08 01:32:25 node6 kernel: NVRM: Xid (PCI:0000:00:0b): 31, pid=2418074, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_13 faulted @ 0x71b1_56a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 08 08:25:55 node6 kernel: NVRM: Xid (PCI:0000:00:05): 31, pid=2514252, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_10 faulted @ 0x730d_f6a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 08 14:37:17 node6 kernel: NVRM: Xid (PCI:0000:00:0b): 31, pid=2836999, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_14 faulted @ 0x7a38_b6a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 08 14:37:17 node6 kernel: NVRM: Xid (PCI:0000:00:09): 31, pid=2836997, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_15 faulted @ 0x7a38_b6a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 08 14:37:17 node6 kernel: NVRM: Xid (PCI:0000:00:0c): 31, pid=2837000, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_2 faulted @ 0x7a38_b6a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 08 14:37:17 node6 kernel: NVRM: Xid (PCI:0000:00:08): 31, pid=2836996, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_13 faulted @ 0x7a38_b6a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 08 14:37:17 node6 kernel: NVRM: Xid (PCI:0000:00:06): 31, pid=2836994, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_12 faulted @ 0x7a38_b6a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 08 14:37:17 node6 kernel: NVRM: Xid (PCI:0000:00:07): 31, pid=2836995, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_8 faulted @ 0x7a38_b6a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 08 14:37:17 node6 kernel: NVRM: Xid (PCI:0000:00:0a): 31, pid=2836998, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_11 faulted @ 0x7a38_b6a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
...
Jul 28 20:21:31 node7 kernel: NVRM: GPU at PCI:0000:00:08: GPU-174dc3d1-ee4a-ab76-e7c5-089c14a3b4b2
Jul 28 20:21:31 node7 kernel: NVRM: GPU Board Serial Number: 1650723017142
Jul 28 20:21:31 node7 kernel: NVRM: Xid (PCI:0000:00:08): 95, pid='<unknown>', name=<unknown>, Uncontained: FBHUB. RST: Yes, D-RST: No
Jul 28 20:21:31 node7 kernel: NVRM: Xid (PCI:0000:00:08): 95, pid=1678362, name=python3, Ch 00000008
Jul 28 20:21:32 node7 kernel: NVRM: Xid (PCI:0000:00:08): 95, pid=1678362, name=python3, Ch 00000009
Jul 28 20:21:32 node7 kernel: NVRM: Xid (PCI:0000:00:08): 95, pid=1678362, name=python3, Ch 0000000a
Jul 28 20:21:32 node7 kernel: NVRM: Xid (PCI:0000:00:08): 95, pid=1678362, name=python3, Ch 0000000b
Jul 28 20:21:32 node7 kernel: NVRM: Xid (PCI:0000:00:08): 95, pid=1678362, name=python3, Ch 0000000c
Jul 28 20:21:32 node7 kernel: NVRM: Xid (PCI:0000:00:08): 95, pid=1678362, name=python3, Ch 0000000d
Jul 28 20:21:32 node7 kernel: NVRM: Xid (PCI:0000:00:08): 95, pid=1678362, name=python3, Ch 0000000e
Jul 28 20:21:32 node7 kernel: NVRM: Xid (PCI:0000:00:08): 95, pid=1678362, name=python3, Ch 0000000f
Jul 29 07:54:16 node7 kernel: NVRM: Xid (PCI:0000:00:07): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
Jul 29 07:54:17 node7 kernel: NVRM: Xid (PCI:0000:00:07): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
Jul 29 07:54:18 node7 kernel: NVRM: Xid (PCI:0000:00:07): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
Jul 29 07:54:19 node7 kernel: NVRM: Xid (PCI:0000:00:07): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
Jul 29 07:54:19 node7 kernel: NVRM: GPU 0000:00:07.0: RmInitAdapter failed! (0x62:0xb:2404)
Jul 29 07:54:19 node7 kernel: NVRM: GPU 0000:00:07.0: rm_init_adapter failed, device minor number 2
Jul 29 07:54:20 node7 kernel: NVRM: Xid (PCI:0000:00:07): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
Jul 29 07:54:21 node7 kernel: NVRM: Xid (PCI:0000:00:07): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
Jul 29 07:54:22 node7 kernel: NVRM: Xid (PCI:0000:00:07): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
Jul 29 07:54:23 node7 kernel: NVRM: Xid (PCI:0000:00:07): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
Jul 29 07:54:23 node7 kernel: NVRM: GPU 0000:00:07.0: RmInitAdapter failed! (0x62:0xb:2404)
Jul 29 07:54:23 node7 kernel: NVRM: GPU 0000:00:07.0: rm_init_adapter failed, device minor number 2
Jul 29 07:54:24 node7 kernel: NVRM: Xid (PCI:0000:00:08): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
Jul 29 07:54:25 node7 kernel: NVRM: Xid (PCI:0000:00:08): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
Jul 29 07:54:26 node7 kernel: NVRM: Xid (PCI:0000:00:08): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
Jul 29 07:54:27 node7 kernel: NVRM: Xid (PCI:0000:00:08): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
Jul 29 07:54:27 node7 kernel: NVRM: GPU 0000:00:08.0: RmInitAdapter failed! (0x62:0xb:2404)
Jul 29 07:54:27 node7 kernel: NVRM: GPU 0000:00:08.0: rm_init_adapter failed, device minor number 3
Jul 29 07:54:28 node7 kernel: NVRM: Xid (PCI:0000:00:08): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
Jul 29 07:54:29 node7 kernel: NVRM: Xid (PCI:0000:00:08): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
Jul 29 07:54:30 node7 kernel: NVRM: Xid (PCI:0000:00:08): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
Jul 29 07:54:31 node7 kernel: NVRM: Xid (PCI:0000:00:08): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:0, LTC:0, MMU:0, PCIE:0
Jul 29 07:54:31 node7 kernel: NVRM: GPU 0000:00:08.0: RmInitAdapter failed! (0x62:0xb:2404)
Jul 29 07:54:31 node7 kernel: NVRM: GPU 0000:00:08.0: rm_init_adapter failed, device minor number 3
...
...
Aug 10 10:12:31 node8 kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module  535.183.01  Sun May 12 19:39:15 UTC 2024
Aug 18 19:14:44 node8 kernel: NVRM: GPU at PCI:0000:00:0d: GPU-624baca6-99f9-1923-de58-c1c3e2127948
Aug 18 19:14:44 node8 kernel: NVRM: GPU Board Serial Number: 1650723017118
Aug 18 19:14:44 node8 kernel: NVRM: Xid (PCI:0000:00:0d): 31, pid=321832, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_2 faulted @ 0x728e_82a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 18 19:14:44 node8 kernel: NVRM: GPU at PCI:0000:00:0b: GPU-c08c7e03-968e-a9d4-41df-6bd227312ccc
Aug 18 19:14:44 node8 kernel: NVRM: GPU Board Serial Number: 1652923017329
Aug 18 19:14:44 node8 kernel: NVRM: Xid (PCI:0000:00:0b): 31, pid=321830, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_2 faulted @ 0x728e_86a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 18 19:14:44 node8 kernel: NVRM: GPU at PCI:0000:00:0c: GPU-7b0ead78-b96a-ac40-c015-6e57d219eb1b
Aug 18 19:14:44 node8 kernel: NVRM: GPU Board Serial Number: 1650623011835
Aug 18 19:14:44 node8 kernel: NVRM: Xid (PCI:0000:00:0c): 31, pid=321831, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_2 faulted @ 0x728e_86a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 18 19:14:44 node8 kernel: NVRM: GPU at PCI:0000:00:07: GPU-264bcba1-c564-3380-77da-2c0a01a37e90
Aug 18 19:14:44 node8 kernel: NVRM: GPU Board Serial Number: 1652923018062
Aug 18 19:14:44 node8 kernel: NVRM: Xid (PCI:0000:00:07): 31, pid=321826, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_2 faulted @ 0x728e_86a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 18 19:14:44 node8 kernel: NVRM: GPU at PCI:0000:00:08: GPU-74e4ee32-858d-0f8d-f2ad-ba474a3d4819
Aug 18 19:14:44 node8 kernel: NVRM: GPU Board Serial Number: 1652923017434
Aug 18 19:14:44 node8 kernel: NVRM: Xid (PCI:0000:00:08): 31, pid=321827, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_2 faulted @ 0x728e_86a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 18 19:14:44 node8 kernel: NVRM: GPU at PCI:0000:00:0a: GPU-b5d19501-13d9-ca5d-6b6a-8139f44d9bbd
Aug 18 19:14:44 node8 kernel: NVRM: GPU Board Serial Number: 1652923017315
Aug 18 19:14:44 node8 kernel: NVRM: Xid (PCI:0000:00:0a): 31, pid=321829, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_2 faulted @ 0x728e_86a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 18 19:14:44 node8 kernel: NVRM: GPU at PCI:0000:00:09: GPU-fd5efc82-dde2-5451-7959-99dc0d46a6b3
Aug 18 19:14:44 node8 kernel: NVRM: GPU Board Serial Number: 1652923017698
Aug 18 19:14:44 node8 kernel: NVRM: Xid (PCI:0000:00:09): 31, pid=321828, name=pt_main_thread, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_2 faulted @ 0x728e_86a00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
Aug 18 20:45:34 node8 kernel: NVRM: Xid (PCI:0000:00:08): 94, pid='<unknown>', name=<unknown>, Contained: SM (0x1). RST: No, D-RST: No
Aug 18 20:45:34 node8 kernel: NVRM: Xid (PCI:0000:00:08): 94, pid=719935, name=pt_main_thread, Ch 00000008
Aug 18 20:45:34 node8 kernel: NVRM: Xid (PCI:0000:00:08): 94, pid=719935, name=pt_main_thread, Ch 00000009
Aug 18 20:45:34 node8 kernel: NVRM: Xid (PCI:0000:00:08): 94, pid=719935, name=pt_main_thread, Ch 0000000a
Aug 18 20:45:34 node8 kernel: NVRM: Xid (PCI:0000:00:08): 94, pid=719935, name=pt_main_thread, Ch 0000000b
Aug 18 20:45:34 node8 kernel: NVRM: Xid (PCI:0000:00:08): 94, pid=719935, name=pt_main_thread, Ch 0000000c
Aug 18 20:45:34 node8 kernel: NVRM: Xid (PCI:0000:00:08): 94, pid=719935, name=pt_main_thread, Ch 0000000d
Aug 18 20:45:34 node8 kernel: NVRM: Xid (PCI:0000:00:08): 94, pid=719935, name=pt_main_thread, Ch 0000000e
Aug 18 20:45:34 node8 kernel: NVRM: Xid (PCI:0000:00:08): 94, pid=719935, name=pt_main_thread, Ch 0000000f