Hardware - GPU:A100
Operating System: Ubuntu 20.04.2 LTS
Triton Version: 21.02-py3
Clara Version: v4.0
I’m using clara-train-example and the triton container keeps restarting.
I attached screenshot of the errors.
the GPU is enabled, when you put nvidia-smi it appears:
I have tried with gpu 3, 5 and 6, the gpu has migs so I have also tried putting “3:1” “6:0” and others on device_ids.
Thanks for your interest in Clara Train SDK. Please note we have recently release clara train V4.1 based on MONAI 0.8 which uses PyTorch.
I am not sure why triton container is restarting. It could be the case where MIG is causing issues. Can you try with 1 GPU that doesn’t have MIG. you can specify that in the docker compose files
- driver: nvidia
capabilities: [ gpu ]
# To specify certain GPU uncomment line below
Please note that triton is only needed for AIAA, if you are interested in training or autoML or FL you should just start the train sdk container.
In case this is still giving you trouble, you could also use AIAA without triton by starting AIAA with the AIAA flag as
AIAA start -w /claraDevDay/AIAA/workspace/ --engine AIAA