Gromacs on T4 issues

We have random T4 nodes with an issue. Upon requesting gpu allocation via slurm (which works on other nodes), randomly the T4 nodes lock up when trying to access GPU, prevent code from running or prevent nvidia-smi from returning results.

Load on system goes high as irq process uses up 100% cpu and logs a bunch
of messages similar to below.

Could it be a driver version issue?

cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 440.64.00 Wed Feb 26
16:26:08 UTC 2020
GCC version: gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC)

To support better performance out of gromacs user application clocks were
recently turned on with

nvidia-smi -pm ENABLED
nvidia-smi -acp UNRESTRICTED

Output from dmesg

[6458428.253918] [] ? __fput+0xec/0x260
[6458428.259662] [] ? ____fput+0xe/0x10
[6458428.265402] [] ? task_work_run+0xbb/0xe0
[6458428.271664] [] ? do_exit+0x2d1/0xa40
[6458428.277582] [] ?
poll_select_copy_remaining+0x150/0x150
[6458428.285192] [] ? do_group_exit+0x3f/0xa0
[6458428.291460] [] ? get_signal_to_deliver+0x1ce/0x5e0
[6458428.298614] [] ? do_signal+0x57/0x6e0
[6458428.304626] [] ? ktime_get_ts64+0x52/0xf0
[6458428.310979] [] ? do_notify_resume+0x72/0xc0
[6458428.317507] [] ? int_signal+0x12/0x17
[6458548.032408] INFO: task python:575657 blocked for more than 120 seconds.
[6458548.039880] “echo 0 > /proc/sys/kernel/hung_task_timeout_secs”
disables this message.

Any input will be valued!
Thank you!

Best,
Delilah

Hi Everyone,
Please offer some input.
This exact bug is becoming more of an issue now on all of our T4s.
Any advise will be appreciated!
Thank you!
Best,
Delilah