Killing the training process

I’m fine-tuning the TrafficCamNet. Before that, everything worked out for me and the model trained as much as necessary. But when I increased the dataset (now it is somewhere around 250k), then after the first epoch, before validation, I began to receive such messages, which may mean that my system removes this process from active tasks.

50db1067799:53:79 [0] NCCL INFO Connected all rings
c50db1067799:53:79 [0] NCCL INFO Connected all trees
c50db1067799:53:79 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
c50db1067799:53:79 [0] NCCL INFO comm 0x7f37d69f4000 rank 0 nranks 1 cudaDev 0 busId 1000 - Init COMPLETE
2022-02-16 00:27:26,909 [INFO] iva.detectnet_v2.evaluation.evaluation: step 0 / 7973, 0.00s/step
2022-02-16 02:27:33,176 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

What should I do?

It ran into out of memory(OOM) .
Please try to use a gpu with more gpu memory.
Or try to train with lower batch-size or lower input-size.

I got you. Thanks. Could you say me how I can change the input size for the TrafficCamNet training specs?

output_image_width: xxx
output_image_height: xxx
enable_auto_resize: True

I suggest you try to run with lower bs firstly.

Ok, thank you

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.