Gpu tesla t4 suddenly has slow processing no more that 1 % solved after reboot it

I have a vm using vmware software, the gpu are running under a vgpu system, the os is ubuntu 22.04 and running cuda version 12.4 with nvidia driver 550.90.07, on this machine i have a gpu Tesla t4

when I starting running a script using pytorch using cuda:0 as a device, it runs normally, I noticed how the gpu works as expected, but after 15-20 minutes suddenly the gpu start working really slowly and when I said slow is like using les than the 1% of the processing power, if i reboot the vm, and run the script again it start working fast and using all gpu capability, but after that period of time (15-20) it slow dramatically.

This happen if i run a complex model like llm llama2-7b or if i run a really simple pytorch code, next the information when it is running at 98%

i need to mention that in all the cases , when running fast or slow, pytorch recognizes cuda available and is able to locale the info in its memory or the gpu

here some data that be helpful

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

llm-lnx1:~$ lshw -c display
WARNING: you should run this program as super-user.
*-display
description: VGA compatible controller
product: SVGA II Adapter
vendor: VMware
physical id: f
bus info: pci@0000:00:0f.0
version: 00
width: 32 bits
clock: 33MHz
capabilities: vga_controller bus_master cap_list rom
configuration: driver=vmwgfx latency=64
resources: irq:16 ioport:1070(size=16) memory:e8000000-efffffff memory:fe000000-fe7fffff memory:c0000-dffff
*-display
description: VGA compatible controller
product: TU104GL [Tesla T4]
vendor: NVIDIA Corporation
physical id: 1
bus info: pci@0000:02:01.0
version: a1
width: 64 bits
clock: 66MHz
capabilities: vga_controller bus_master cap_list
configuration: driver=nvidia latency=0
resources: irq:69 memory:fc000000-fcffffff memory:d0000000-dfffffff memory:fa000000-fbffffff

this is an example code,

import torch
import math

torch.backends.cudnn.benchmark = True
dtype = torch.float
device = torch.device("cuda:0")
torch.backends.cudnn.benchmark = True
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# Create random input and output data
x = torch.linspace(-math.pi, math.pi, 2000, device=device, dtype=dtype)
y = torch.sin(x)

# Randomly initialize weights
a = torch.randn((), device=device, dtype=dtype)
b = torch.randn((), device=device, dtype=dtype)
c = torch.randn((), device=device, dtype=dtype)
d = torch.randn((), device=device, dtype=dtype)

learning_rate = 1e-6
for t in range(2000):
    # Forward pass: compute predicted y
    y_pred = a + b * x + c * x ** 2 + d * x ** 3

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum().item()
    
    print(t, loss)

    # Backprop to compute gradients of a, b, c, d with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_a = grad_y_pred.sum()
    grad_b = (grad_y_pred * x).sum()
    grad_c = (grad_y_pred * x ** 2).sum()
    grad_d = (grad_y_pred * x ** 3).sum()

    # Update weights using gradient descent
    a -= learning_rate * grad_a
    b -= learning_rate * grad_b
    c -= learning_rate * grad_c
    d -= learning_rate * grad_d


print(f'Result: y = {a.item()} + {b.item()} x + {c.item()} x^2 + {d.item()} x^3')

for the 98% of the consumption i am using another experiment that is running a llm llama2 model, but in practice the problem is not the amount of processing but the slowness that happen every 15-20 minutos, i need to mention when the gpu start to be so slow it never recovery, only a sudo reboot solves the issue.

here my python libraries:

Package Version


absl-py 2.1.0
accelerate 0.31.0
aiohttp 3.9.5
aiosignal 1.3.1
annotated-types 0.7.0
astunparse 1.6.3
async-timeout 4.0.3
attrs 23.2.0
bitsandbytes 0.43.1
certifi 2024.6.2
charset-normalizer 3.3.2
cmake 3.29.5.1
cuda-python 12.5.0
dataclasses-json 0.6.7
datasets 2.20.0
dill 0.3.8
exceptiongroup 1.2.1
expecttest 0.2.1
filelock 3.15.1
flatbuffers 24.3.25
frozenlist 1.4.1
fsspec 2024.5.0
gast 0.5.4
google-pasta 0.2.0
greenlet 3.0.3
grpcio 1.64.1
h5py 3.11.0
huggingface-hub 0.23.4
hypothesis 6.103.2
idna 3.7
Jinja2 3.1.4
jsonpatch 1.33
jsonpointer 3.0.0
keras 3.3.3
langchain 0.1.11
langchain-community 0.0.38
langchain-core 0.1.52
langchain-text-splitters 0.0.2
langsmith 0.1.81
libclang 18.1.1
lintrunner 0.12.5
Markdown 3.6
markdown-it-py 3.0.0
MarkupSafe 2.1.5
marshmallow 3.21.3
mdurl 0.1.2
ml-dtypes 0.3.2
mpmath 1.3.0
multidict 6.0.5
multiprocess 0.70.16
mypy-extensions 1.0.0
namex 0.0.8
networkx 3.3
ninja 1.11.1.1
numpy 1.26.4
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvcc-cu12 12.3.107
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.20.5
nvidia-nvjitlink-cu12 12.3.101
nvidia-nvtx-cu12 12.1.105
opt-einsum 3.3.0
optree 0.11.0
orjson 3.10.5
packaging 23.2
pandas 2.2.2
pillow 10.2.0
pip 24.0
protobuf 4.25.3
psutil 5.9.8
psycopg2-binary 2.9.9
pyarrow 16.1.0
pyarrow-hotfix 0.6
pydantic 2.7.4
pydantic_core 2.18.4
Pygments 2.18.0
pyodbc 5.1.0
python-dateutil 2.9.0.post0
pytz 2024.1
PyYAML 6.0.1
quanto 0.2.0
redis 5.0.1
regex 2024.5.15
requests 2.32.3
rich 13.7.1
safetensors 0.4.3
setuptools 65.5.0
six 1.16.0
sortedcontainers 2.4.0
SQLAlchemy 2.0.31
sympy 1.12.1
tenacity 8.4.1
tensorboard 2.16.2
tensorboard-data-server 0.7.2
tensorflow 2.16.1
tensorflow-io-gcs-filesystem 0.37.0
termcolor 2.4.0
tokenizers 0.19.1
torch 2.3.1+cu121
torchaudio 2.3.1+cu121
torchvision 0.18.1+cu121
tqdm 4.66.4
transformers 4.42.0.dev0
triton 2.3.1
types-dataclasses 0.6.6
typing_extensions 4.12.2
typing-inspect 0.9.0
tzdata 2024.1
urllib3 2.2.2
Werkzeug 3.0.3
wheel 0.43.0
wrapt 1.16.0
xxhash 3.4.1
yarl 1.9.4

any help would be really appreciated

Hi, @fermin_reyes

Sorry for the issue you met.
Please note this is forum for supporting Developer Tools.

Your problem is “Running pytorch related script becomes slow in vGPU environment”. I would suggest you ask in vGPU related forum or Deep Learning related forum to get better support.

Thanks !

This could be a symptom of clock throttling being applied due to overheating or insufficient power supply. Use nvidia-smi to monitor temperature, power, and all Slowdown metrics while your application is running. Ideally you would run this with administrative rights for access to all available metrics.

The Tesla T4 is a datacenter GPU that normally comes pre-installed in host systems that are configured by an NVIDIA-approved system integrator. For problems such as this one, you would want to turn to the system integrator for assistance.