I’m fine-tuning the TrafficCamNet. Before that, everything worked out for me and the model trained as much as necessary. But when I increased the dataset (now it is somewhere around 250k), then after the first epoch, before validation, I began to receive such messages, which may mean that my system removes this process from active tasks.
50db1067799:53:79 [0] NCCL INFO Connected all rings
c50db1067799:53:79 [0] NCCL INFO Connected all trees
c50db1067799:53:79 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
c50db1067799:53:79 [0] NCCL INFO comm 0x7f37d69f4000 rank 0 nranks 1 cudaDev 0 busId 1000 - Init COMPLETE
2022-02-16 00:27:26,909 [INFO] iva.detectnet_v2.evaluation.evaluation: step 0 / 7973, 0.00s/step
Killed
2022-02-16 02:27:33,176 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.
What should I do?