Gpu tesla t4 suddenly has slow processing no more that 1 % solved after reboot it

fermin_reyes · June 21, 2024, 10:33pm

I have a vm using vmware software, the gpu are running under a vgpu system, the os is ubuntu 22.04 and running cuda version 12.4 with nvidia driver 550.90.07, on this machine i have a gpu Tesla t4

when I starting running a script using pytorch using cuda:0 as a device, it runs normally, I noticed how the gpu works as expected, but after 15-20 minutes suddenly the gpu start working really slowly and when I said slow is like using les than the 1% of the processing power, if i reboot the vm, and run the script again it start working fast and using all gpu capability, but after that period of time (15-20) it slow dramatically.

This happen if i run a complex model like llm llama2-7b or if i run a really simple pytorch code, next the information when it is running at 98%

i need to mention that in all the cases , when running fast or slow, pytorch recognizes cuda available and is able to locale the info in its memory or the gpu

here some data that be helpful

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

llm-lnx1:~$ lshw -c display
WARNING: you should run this program as super-user.
*-display
description: VGA compatible controller
product: SVGA II Adapter
vendor: VMware
physical id: f
bus info: pci@0000:00:0f.0
version: 00
width: 32 bits
clock: 33MHz
capabilities: vga_controller bus_master cap_list rom
configuration: driver=vmwgfx latency=64
resources: irq:16 ioport:1070(size=16) memory:e8000000-efffffff memory:fe000000-fe7fffff memory:c0000-dffff
*-display
description: VGA compatible controller
product: TU104GL [Tesla T4]
vendor: NVIDIA Corporation
physical id: 1
bus info: pci@0000:02:01.0
version: a1
width: 64 bits
clock: 66MHz
capabilities: vga_controller bus_master cap_list
configuration: driver=nvidia latency=0
resources: irq:69 memory:fc000000-fcffffff memory:d0000000-dfffffff memory:fa000000-fbffffff

this is an example code,

import torch
import math

torch.backends.cudnn.benchmark = True
dtype = torch.float
device = torch.device("cuda:0")
torch.backends.cudnn.benchmark = True
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# Create random input and output data
x = torch.linspace(-math.pi, math.pi, 2000, device=device, dtype=dtype)
y = torch.sin(x)

# Randomly initialize weights
a = torch.randn((), device=device, dtype=dtype)
b = torch.randn((), device=device, dtype=dtype)
c = torch.randn((), device=device, dtype=dtype)
d = torch.randn((), device=device, dtype=dtype)

learning_rate = 1e-6
for t in range(2000):
    # Forward pass: compute predicted y
    y_pred = a + b * x + c * x ** 2 + d * x ** 3

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum().item()
    
    print(t, loss)

    # Backprop to compute gradients of a, b, c, d with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_a = grad_y_pred.sum()
    grad_b = (grad_y_pred * x).sum()
    grad_c = (grad_y_pred * x ** 2).sum()
    grad_d = (grad_y_pred * x ** 3).sum()

    # Update weights using gradient descent
    a -= learning_rate * grad_a
    b -= learning_rate * grad_b
    c -= learning_rate * grad_c
    d -= learning_rate * grad_d


print(f'Result: y = {a.item()} + {b.item()} x + {c.item()} x^2 + {d.item()} x^3')

for the 98% of the consumption i am using another experiment that is running a llm llama2 model, but in practice the problem is not the amount of processing but the slowness that happen every 15-20 minutos, i need to mention when the gpu start to be so slow it never recovery, only a sudo reboot solves the issue.

here my python libraries:

Package Version

absl-py 2.1.0
accelerate 0.31.0
aiohttp 3.9.5
aiosignal 1.3.1
annotated-types 0.7.0
astunparse 1.6.3
async-timeout 4.0.3
attrs 23.2.0
bitsandbytes 0.43.1
certifi 2024.6.2
charset-normalizer 3.3.2
cmake 3.29.5.1
cuda-python 12.5.0
dataclasses-json 0.6.7
datasets 2.20.0
dill 0.3.8
exceptiongroup 1.2.1
expecttest 0.2.1
filelock 3.15.1
flatbuffers 24.3.25
frozenlist 1.4.1
fsspec 2024.5.0
gast 0.5.4
google-pasta 0.2.0
greenlet 3.0.3
grpcio 1.64.1
h5py 3.11.0
huggingface-hub 0.23.4
hypothesis 6.103.2
idna 3.7
Jinja2 3.1.4
jsonpatch 1.33
jsonpointer 3.0.0
keras 3.3.3
langchain 0.1.11
langchain-community 0.0.38
langchain-core 0.1.52
langchain-text-splitters 0.0.2
langsmith 0.1.81
libclang 18.1.1
lintrunner 0.12.5
Markdown 3.6
markdown-it-py 3.0.0
MarkupSafe 2.1.5
marshmallow 3.21.3
mdurl 0.1.2
ml-dtypes 0.3.2
mpmath 1.3.0
multidict 6.0.5
multiprocess 0.70.16
mypy-extensions 1.0.0
namex 0.0.8
networkx 3.3
ninja 1.11.1.1
numpy 1.26.4
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvcc-cu12 12.3.107
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.20.5
nvidia-nvjitlink-cu12 12.3.101
nvidia-nvtx-cu12 12.1.105
opt-einsum 3.3.0
optree 0.11.0
orjson 3.10.5
packaging 23.2
pandas 2.2.2
pillow 10.2.0
pip 24.0
protobuf 4.25.3
psutil 5.9.8
psycopg2-binary 2.9.9
pyarrow 16.1.0
pyarrow-hotfix 0.6
pydantic 2.7.4
pydantic_core 2.18.4
Pygments 2.18.0
pyodbc 5.1.0
python-dateutil 2.9.0.post0
pytz 2024.1
PyYAML 6.0.1
quanto 0.2.0
redis 5.0.1
regex 2024.5.15
requests 2.32.3
rich 13.7.1
safetensors 0.4.3
setuptools 65.5.0
six 1.16.0
sortedcontainers 2.4.0
SQLAlchemy 2.0.31
sympy 1.12.1
tenacity 8.4.1
tensorboard 2.16.2
tensorboard-data-server 0.7.2
tensorflow 2.16.1
tensorflow-io-gcs-filesystem 0.37.0
termcolor 2.4.0
tokenizers 0.19.1
torch 2.3.1+cu121
torchaudio 2.3.1+cu121
torchvision 0.18.1+cu121
tqdm 4.66.4
transformers 4.42.0.dev0
triton 2.3.1
types-dataclasses 0.6.6
typing_extensions 4.12.2
typing-inspect 0.9.0
tzdata 2024.1
urllib3 2.2.2
Werkzeug 3.0.3
wheel 0.43.0
wrapt 1.16.0
xxhash 3.4.1
yarl 1.9.4

any help would be really appreciated

veraj · June 24, 2024, 3:16am

Hi, @fermin_reyes

Sorry for the issue you met.
Please note this is forum for supporting Developer Tools.

Your problem is “Running pytorch related script becomes slow in vGPU environment”. I would suggest you ask in vGPU related forum or Deep Learning related forum to get better support.

Thanks !

njuffa · June 24, 2024, 7:47am

This could be a symptom of clock throttling being applied due to overheating or insufficient power supply. Use nvidia-smi to monitor temperature, power, and all Slowdown metrics while your application is running. Ideally you would run this with administrative rights for access to all available metrics.

The Tesla T4 is a datacenter GPU that normally comes pre-installed in host systems that are configured by an NVIDIA-approved system integrator. For problems such as this one, you would want to turn to the system integrator for assistance.

Topic		Replies	Views
Tesla V100 GPU way too slow CUDA Programming and Performance	8	6408	December 21, 2017
Ubuntu 22.04.3 LTS Server, Tesla P100, Driver Version: 470.199.02, CUDA Version: 11.4 CUDA Setup and Installation	3	3017	August 19, 2023
Tesla P100 Issue – Processing Stops at 8MiB, Multiple Driver Versions Tested nvc, nvc++ and nvfortran cuda	9	88	December 19, 2024
Nvidia Tesla P100 keeps throwing ECC errors CUDA Programming and Performance cuda , ubuntu , driver	2	370	July 2, 2024
CUDA very slow performance CUDA Programming and Performance	21	16408	March 6, 2020
When WSL is faster than Windows?! CUDA on Windows Subsystem for Linux	21	4360	July 25, 2022
Cuda application crashes occasionally + displays flicker (windows resets the GPU). gaining exclusive access? CUDA Programming and Performance	8	800	February 1, 2022
why gpu0 is much slower than gpu1 (v100, ubuntu 16.04)? CUDA Programming and Performance	7	2102	October 15, 2018
cudaMemcpy Hung CUDA Programming and Performance	21	4028	May 30, 2019
Sudden drop in CUDA/thrust perfomance CUDA Programming and Performance	4	1371	October 21, 2016

Gpu tesla t4 suddenly has slow processing no more that 1 % solved after reboot it

Related topics