We encountered an occasional problem. When using the service on thor to infer VLA model on the same data, different output results will appear in some time. After locating it, we found that the first time entering the service node, there is a probability that the loaded model parameters will change, just like the memory storing the parameters is overwritten. We call the service node as follows.
At the same time, we found that this problem will not occur if fastapi is not used. This problem will also not occur when using fastapi on orin. The base environment uses the officially recommended version. We used docker nvcr.io/nvidia/pytorch:25.08-py3.
Reloading the model weights when calling fastapi for the first time can temporarily circumvent the problem. But we want to know what might be the reason for overwriting parameters when calling fastapi? We will provide the necessary information if required.
We’ve discovered a phenomenon that may be related to this issue. When the server is started, the model is loaded onto the GPU, and jtop displays the GPU temperature normally. However, if the server is running but no inference data is being processed for a period of time (about 5-10 seconds), jtop displays the GPU as offline.
When inference data arrives again, the GPU will be reactivated. This problem may be related to this phenomenon. I have two questions. The first is why the GPU becomes offline when there is a process occupying the GPU. The second is whether there is any configuration that can keep the GPU active all the time, or is there something wrong with the method we are using?
Could you record the scenario so we can know more about the issue.
If there is GPU resources available (means the loading is not saturated), the task should be deployed.
Is it possible that the GPU is waiting for something? For example, memory-related bottleneck?
We recently found the cache might cause some error when running LLMs-related use case.
Could you try to run the below script (periodically drop cache) concurrently to see if the issue still presents?
This issue still presents. We have a sample script as below:
import time
import torch
for idx in range(100):
if (idx + 1) % 20 == 0:
time.sleep(20)
print('============sleep 20s=========')
start_time = time.time()
x = torch.randn(1024, 1024, device='cuda')
end_time = time.time()
print("=====idx {}: spend {} second".format(idx, end_time - start_time))
The output just like:
=====idx 0: spend 2.061129331588745 second
=====idx 1: spend 8.916854858398438e-05 second
=====idx 2: spend 1.8358230590820312e-05 second
=====idx 3: spend 1.1682510375976562e-05 second
=====idx 4: spend 1.049041748046875e-05 second
=====idx 5: spend 1.0251998901367188e-05 second
=====idx 6: spend 1.0013580322265625e-05 second
=====idx 7: spend 9.298324584960938e-06 second
=====idx 8: spend 8.58306884765625e-06 second
=====idx 9: spend 9.059906005859375e-06 second
=====idx 10: spend 9.5367431640625e-06 second
=====idx 11: spend 9.059906005859375e-06 second
=====idx 12: spend 8.58306884765625e-06 second
=====idx 13: spend 9.059906005859375e-06 second
=====idx 14: spend 9.059906005859375e-06 second
=====idx 15: spend 8.58306884765625e-06 second
=====idx 16: spend 8.58306884765625e-06 second
=====idx 17: spend 8.344650268554688e-06 second
=====idx 18: spend 8.58306884765625e-06 second
============sleep 20s=========
=====idx 19: spend 1.9319243431091309 second
=====idx 20: spend 0.00013780593872070312 second
=====idx 21: spend 1.52587890625e-05 second
=====idx 22: spend 1.2874603271484375e-05 second
=====idx 23: spend 1.0013580322265625e-05 second
=====idx 24: spend 9.775161743164062e-06 second
=====idx 25: spend 9.059906005859375e-06 second
=====idx 26: spend 1.2159347534179688e-05 second
=====idx 27: spend 8.58306884765625e-06 second
=====idx 28: spend 8.58306884765625e-06 second
=====idx 29: spend 8.106231689453125e-06 second
=====idx 30: spend 7.62939453125e-06 second
=====idx 31: spend 7.867813110351562e-06 second
=====idx 32: spend 8.344650268554688e-06 second
=====idx 33: spend 7.867813110351562e-06 second
=====idx 34: spend 7.62939453125e-06 second
=====idx 35: spend 7.62939453125e-06 second
=====idx 36: spend 7.867813110351562e-06 second
=====idx 37: spend 7.867813110351562e-06 second
=====idx 38: spend 7.867813110351562e-06 second
The frame after sleep is more time-consuming
Using jtop, we can find that during the sleep period, the GPU changes from online to offline state. Please close all other programs and applications that use CUDA when conducting this experiment.
There is no update from you for a period, assuming this is not an issue anymore.
Hence, we are closing this topic. If need further support, please open a new one.
Thanks ~1021
Hi,
Sorry for the late update.
The behavior looks expected.
During the 20-second sleep, there is no CUDA kernel launched, so the GPU enters the idle status.
Could you share more clues about how the idle status impacts the VLA inference?
Ideally, the status will be restored once a kernel is resumed.