Description
I am running DIGITS on WSL2 to train a Caffe model using an A3000 GPU. I can train models in DIGITS using the nvidia/digits image.
To train in FP32/16, I use the nvcr.io/nvidia/caffe:20.03-py3 container and run:
caffe train --solver=solver.prototxt
To enable FP16, I edited the solver.prototxt file to include these headers:
default_forward_type: FLOAT16
default_backward_type: FLOAT16
default_forward_math: FLOAT
default_backward_math: FLOAT
The training process then hangs for about 30 minutes at this step:
I0805 22:44:10.131574 158 data_reader.cpp:322] Restarting data pre-fetching
I0805 22:44:10.210132 128 solver.cpp:581] (0.0) Test net output #0: accuracy = 0.504774
I0805 22:44:10.210203 128 solver.cpp:581] (0.0) Test net output #1: loss = 0.693341 (* 1 = 0.693341 loss)
I0805 22:44:10.210237 128 caffe.cpp:258] Solver performance on device 0: 47.54 * 256 = 1.217e+04 img/sec (9500 itr in 199.8 sec)
I0805 22:44:10.210253 128 caffe.cpp:262] Optimization Done in 22m 57s
root@9b5ade8b7000:/media/STORAGE/digits-jobs/Test# caffe train --solver=solver.prototxt
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0805 22:48:39.182775 169 parallel.cpp:48] P2PManager::Init, global rank: [0 of 1] @ 9b5ade8b7000
I0805 22:51:48.901777 169 gpu_memory.cpp:82] GPUMemory::Manager initialized
It eventually continues, but I’m not sure about the model quality—it seems odd.
I was able to follow the same steps on a Titan X GPU without issues.
My question: Is there a more up-to-date Caffe container for training? My end goal is to perform binary patch classification on a Jetson TX2. I already have this working on the TX2, but my Titan X server died, and I need to continue training on RTX cards.
Any suggestions are appreciated.
Environment
TensorRT Version:
GPU Type: TitanX vs A3000
Nvidia Driver Version:
CUDA Version:
CUDNN Version:
Operating System + Version:
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):