Caffe train super slow on RTX GPU

Description

I am running DIGITS on WSL2 to train a Caffe model using an A3000 GPU. I can train models in DIGITS using the nvidia/digits image.

To train in FP32/16, I use the nvcr.io/nvidia/caffe:20.03-py3 container and run:
caffe train --solver=solver.prototxt

To enable FP16, I edited the solver.prototxt file to include these headers:
default_forward_type: FLOAT16
default_backward_type: FLOAT16
default_forward_math: FLOAT
default_backward_math: FLOAT

The training process then hangs for about 30 minutes at this step:

I0805 22:44:10.131574 158 data_reader.cpp:322] Restarting data pre-fetching
I0805 22:44:10.210132 128 solver.cpp:581] (0.0) Test net output #0: accuracy = 0.504774
I0805 22:44:10.210203 128 solver.cpp:581] (0.0) Test net output #1: loss = 0.693341 (* 1 = 0.693341 loss)
I0805 22:44:10.210237 128 caffe.cpp:258] Solver performance on device 0: 47.54 * 256 = 1.217e+04 img/sec (9500 itr in 199.8 sec)
I0805 22:44:10.210253 128 caffe.cpp:262] Optimization Done in 22m 57s
root@9b5ade8b7000:/media/STORAGE/digits-jobs/Test# caffe train --solver=solver.prototxt
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0805 22:48:39.182775 169 parallel.cpp:48] P2PManager::Init, global rank: [0 of 1] @ 9b5ade8b7000
I0805 22:51:48.901777 169 gpu_memory.cpp:82] GPUMemory::Manager initialized

It eventually continues, but I’m not sure about the model quality—it seems odd.

I was able to follow the same steps on a Titan X GPU without issues.

My question: Is there a more up-to-date Caffe container for training? My end goal is to perform binary patch classification on a Jetson TX2. I already have this working on the TX2, but my Titan X server died, and I need to continue training on RTX cards.

Any suggestions are appreciated.

Environment

TensorRT Version:
GPU Type: TitanX vs A3000
Nvidia Driver Version:
CUDA Version:
CUDNN Version:
Operating System + Version:
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

Hi ,

This looks like a DIGITS related issue, wouyld request you to please raise on concerned forum.