We recently witnessed a problem in one of nvidia mps files – control.log.
The file has taken all of the disk space – increased to the size of ~400GB, causing our server to fail.
I tried looking online if someone has similar issues and I couldn’t find any records of that.
The bug accord in a critical time for our client so the AE had to delete the file (so we don’t have the logs) , I’ll mention that he couldn’t open the file as well,
but this is probably due the massive size.
Once deleted the regenerated file hasn’t increased drastically.
You may wish to file a bug with whatever info you have. I would include two things in the bug:
-
What you have discussed here, i.e. the control log grew large but you don’t have it because it was large and you had to delete. Also include as much other info as possible, such as your operating system, the GPU you are running on, the CUDA version, and how you have MPS configured. Any additional data like the number of MPS clients that were running when this happened may be useful.
-
Include an enhancement request to allow the size of the control.log file to be limited. e.g. to a fixed value like 1GB, or perhaps with a user-configurable value.
That’s just my suggestion of course. Whether you wish to file a bug or include both items 1 and 2 above is your decision.
I don’t think there is much that can be done immediately simply with the report that the control.log file grew large. More information will be needed to resolve it, and without reproduction instructions, it may not be resolvable. That is why I suggested item 2. Also, if you run into any more occurrences and possibly make additional observations, you can update the bug with those observations.
Also, if you are not using the latest CUDA version and driver for your GPU, you may wish to update to the latest. Bugs get fixed all the time. I don’t know of any bugs related to your observation; this is a general statement.