Caffe training on Tegra K80 with DIGITS 5 breaks with SIGFPE

I’m trying to use the DIGITS 5 install on a host with a K80 Tegra GPU, 64 GB of RAM, and 8 cores.
I’ve downloaded and prepared the CIFAR10 data set, and want to train a classification model.
I’m using Caffe.

Trying with both the AlexNet and GoogLeNet models, editing them to set the crop parameter to 32, I get an error like this when training:

(From GoogLeNet cropped at 32):
I0428 02:53:35.457660 2520 net.cpp:144] Setting up inception_4a/output_inception_4a/output_0_split
I0428 02:53:35.457664 2520 net.cpp:151] Top shape: 32 512 2 2 (65536)
I0428 02:53:35.457669 2520 net.cpp:151] Top shape: 32 512 2 2 (65536)
I0428 02:53:35.457672 2520 net.cpp:151] Top shape: 32 512 2 2 (65536)
I0428 02:53:35.457675 2520 net.cpp:151] Top shape: 32 512 2 2 (65536)
I0428 02:53:35.457679 2520 net.cpp:151] Top shape: 32 512 2 2 (65536)
I0428 02:53:35.457681 2520 net.cpp:159] Memory required for data: 25707008
I0428 02:53:35.457684 2520 layer_factory.hpp:77] Creating layer loss1/ave_pool
I0428 02:53:35.457690 2520 net.cpp:94] Creating Layer loss1/ave_pool
I0428 02:53:35.457692 2520 net.cpp:435] loss1/ave_pool <- inception_4a/output_inception_4a/output_0_split_0
I0428 02:53:35.457698 2520 net.cpp:409] loss1/ave_pool -> loss1/ave_pool
*** Aborted at 1493348015 (unix time) try “date -d @1493348015” if you are using GNU date ***
PC: @ 0x7fb9ac7e66f0 (unknown)
*** SIGFPE (@0x7fb9ac7e66f0) received by PID 2520 (TID 0x7fb9ad3c6a40) from PID 18446744072308549360; stack trace: ***
@ 0x7fb9aad9acb0 (unknown)
@ 0x7fb9ac7e66f0 (unknown)
@ 0x7fb9ac7e6b09 (unknown)
@ 0x7fb9ac87a18c (unknown)
@ 0x7fb9ac902ff6 (unknown)
@ 0x7fb9ac92cf75 (unknown)
@ 0x7fb9ac92de9a (unknown)
@ 0x7fb9ac9112ba (unknown)
@ 0x7fb9ac91225c (unknown)
@ 0x7fb9ac9125a3 (unknown)
@ 0x7fb9ac98a7d9 (unknown)
@ 0x411d0c (unknown)
@ 0x40a97c (unknown)
@ 0x40867c (unknown)
@ 0x7fb9aad85f45 (unknown)
@ 0x408e4d (unknown)
@ 0x0 (unknown)

(From AlexNet cropped at 32):
I0428 02:41:19.800724 2429 net.cpp:435] relu5 <- conv5
I0428 02:41:19.800729 2429 net.cpp:396] relu5 -> conv5 (in-place)
I0428 02:41:19.800735 2429 net.cpp:144] Setting up relu5
I0428 02:41:19.800740 2429 net.cpp:151] Top shape: 128 256 1 1 (32768)
I0428 02:41:19.800741 2429 net.cpp:159] Memory required for data: 12042752
I0428 02:41:19.800745 2429 layer_factory.hpp:77] Creating layer pool5
I0428 02:41:19.800751 2429 net.cpp:94] Creating Layer pool5
I0428 02:41:19.800755 2429 net.cpp:435] pool5 <- conv5
I0428 02:41:19.800758 2429 net.cpp:409] pool5 -> pool5
*** Aborted at 1493347279 (unix time) try “date -d @1493347279” if you are using GNU date ***
PC: @ 0x7fd867e616f0 (unknown)
*** SIGFPE (@0x7fd867e616f0) received by PID 2429 (TID 0x7fd868a41a40) from PID 1743132400; stack trace: ***
@ 0x7fd866415cb0 (unknown)
@ 0x7fd867e616f0 (unknown)
@ 0x7fd867e61b09 (unknown)
@ 0x7fd867ef518c (unknown)
@ 0x7fd867fa7f75 (unknown)
@ 0x7fd867fa8e9a (unknown)
@ 0x7fd867f8c2ba (unknown)
@ 0x7fd867f8d25c (unknown)
@ 0x7fd867f8d5a3 (unknown)
@ 0x7fd8680057d9 (unknown)
@ 0x411d0c (unknown)
@ 0x40a97c (unknown)
@ 0x40867c (unknown)
@ 0x7fd866400f45 (unknown)
@ 0x408e4d (unknown)
@ 0x0 (unknown)

There are no symbols – why would you strip these binaries? The small extra size of the symbol table doesn’t cause any performance problems, and would help debugging.

Anyway, this output tells me nothing that can help me figure out what’s going wrong and how to fix it.
What’s going wrong, and how do I debug and fix it?

Here’s the whole AlexNet (cropped) log, which includes the actual network prototxt:
libdc1394 error: Failed to initialize libdc1394
I0428 02:41:18.867152 2429 upgrade_proto.cpp:1044] Attempting to upgrade input file specified using deprecated ‘solver_type’ field (enum)’: /var/lib/digits/jobs/20170428-024117-870a/solver.prototxt
I0428 02:41:18.867419 2429 upgrade_proto.cpp:1051] Successfully upgraded file specified using deprecated ‘solver_type’ field (enum) to ‘type’ field (string).
W0428 02:41:18.867427 2429 upgrade_proto.cpp:1053] Note that future Caffe releases will only support ‘type’ field (string) for a solver’s type.
I0428 02:41:18.998275 2429 caffe.cpp:197] Using GPUs 0
I0428 02:41:18.998476 2429 caffe.cpp:202] GPU 0: Tesla K80
I0428 02:41:19.314980 2429 solver.cpp:48] Initializing solver from parameters:
test_iter: 1563
test_interval: 391
base_lr: 0.001
display: 40
max_iter: 5865
lr_policy: “step”
gamma: 0.3
weight_decay: 1e-05
stepsize: 1173
snapshot: 391
snapshot_prefix: “snapshot”
solver_mode: GPU
device_id: 0
random_seed: 123
net: “train_val.prototxt”
type: “AdaGrad”
I0428 02:41:19.315131 2429 solver.cpp:91] Creating training net from net file: train_val.prototxt
I0428 02:41:19.315707 2429 net.cpp:323] The NetState phase (0) differed from the phase (1) specified by a rule in layer val-data
I0428 02:41:19.315731 2429 net.cpp:323] The NetState phase (0) differed from the phase (1) specified by a rule in layer accuracy
I0428 02:41:19.315888 2429 net.cpp:52] Initializing net from parameters:
state {
phase: TRAIN
}
layer {
name: “train-data”
type: “Data”
top: “data”
top: “label”
include {
phase: TRAIN
}
transform_param {
mirror: true
crop_size: 32
mean_file: “/var/lib/digits/jobs/20170428-015357-2072/mean.binaryproto”
}
data_param {
source: “/var/lib/digits/jobs/20170428-015357-2072/train_db”
batch_size: 128
backend: LMDB
}
}
layer {
name: “conv1”
type: “Convolution”
bottom: “data”
top: “conv1”
param {
lr_mult: 1
decay_mult: 1
}
param {
lr_mult: 2
decay_mult: 0
}
convolution_param {
num_output: 96
kernel_size: 11
stride: 4
weight_filler {
type: “gaussian”
std: 0.01
}
bias_filler {
type: “constant”
value: 0
}
}
}
layer {
name: “relu1”
type: “ReLU”
bottom: “conv1”
top: “conv1”
}
layer {
name: “norm1”
type: “LRN”
bottom: “conv1”
top: “norm1”
lrn_param {
local_size: 5
alpha: 0.0001
beta: 0.75
}
}
layer {
name: “pool1”
type: “Pooling”
bottom: “norm1”
top: “pool1”
pooling_param {
pool: MAX
kernel_size: 3
stride: 2
}
}
layer {
name: “conv2”
type: “Convolution”
bottom: “pool1”
top: “conv2”
param {
lr_mult: 1
decay_mult: 1
}
param {
lr_mult: 2
decay_mult: 0
}
convolution_param {
num_output: 256
pad: 2
kernel_size: 5
group: 2
weight_filler {
type: “gaussian”
std: 0.01
}
bias_filler {
type: “constant”
value: 0.1
}
}
}
layer {
name: “relu2”
type: “ReLU”
bottom: “conv2”
top: “conv2”
}
layer {
name: “norm2”
type: “LRN”
bottom: “conv2”
top: “norm2”
lrn_param {
local_size: 5
alpha: 0.0001
beta: 0.75
}
}
layer {
name: “pool2”
type: “Pooling”
bottom: “norm2”
top: “pool2”
pooling_param {
pool: MAX
kernel_size: 3
stride: 2
}
}
layer {
name: “conv3”
type: “Convolution”
bottom: “pool2”
top: “conv3”
param {
lr_mult: 1
decay_mult: 1
}
param {
lr_mult: 2
decay_mult: 0
}
convolution_param {
num_output: 384
pad: 1
kernel_size: 3
weight_filler {
type: “gaussian”
std: 0.01
}
bias_filler {
type: “constant”
value: 0
}
}
}
layer {
name: “relu3”
type: “ReLU”
bottom: “conv3”
top: “conv3”
}
layer {
name: “conv4”
type: “Convolution”
bottom: “conv3”
top: “conv4”
param {
lr_mult: 1
decay_mult: 1
}
param {
lr_mult: 2
decay_mult: 0
}
convolution_param {
num_output: 384
pad: 1
kernel_size: 3
group: 2
weight_filler {
type: “gaussian”
std: 0.01
}
bias_filler {
type: “constant”
value: 0.1
}
}
}
layer {
name: “relu4”
type: “ReLU”
bottom: “conv4”
top: “conv4”
}
layer {
name: “conv5”
type: “Convolution”
bottom: “conv4”
top: “conv5”
param {
lr_mult: 1
decay_mult: 1
}
param {
lr_mult: 2
decay_mult: 0
}
convolution_param {
num_output: 256
pad: 1
kernel_size: 3
group: 2
weight_filler {
type: “gaussian”
std: 0.01
}
bias_filler {
type: “constant”
value: 0.1
}
}
}
layer {
name: “relu5”
type: “ReLU”
bottom: “conv5”
top: “conv5”
}
layer {
name: “pool5”
type: “Pooling”
bottom: “conv5”
top: “pool5”
pooling_param {
pool: MAX
kernel_size: 3
stride: 2
}
}
layer {
name: “fc6”
type: “InnerProduct”
bottom: “pool5”
top: “fc6”
param {
lr_mult: 1
decay_mult: 1
}
param {
lr_mult: 2
decay_mult: 0
}
inner_product_param {
num_output: 4096
weight_filler {
type: “gaussian”
std: 0.005
}
bias_filler {
type: “constant”
value: 0.1
}
}
}
layer {
name: “relu6”
type: “ReLU”
bottom: “fc6”
top: “fc6”
}
layer {
name: “drop6”
type: “Dropout”
bottom: “fc6”
top: “fc6”
dropout_param {
dropout_ratio: 0.5
}
}
layer {
name: “fc7”
type: “InnerProduct”
bottom: “fc6”
top: “fc7”
param {
lr_mult: 1
decay_mult: 1
}
param {
lr_mult: 2
decay_mult: 0
}
inner_product_param {
num_output: 4096
weight_filler {
type: “gaussian”
std: 0.005
}
bias_filler {
type: “constant”
value: 0.1
}
}
}
layer {
name: “relu7”
type: “ReLU”
bottom: “fc7”
top: “fc7”
}
layer {
name: “drop7”
type: “Dropout”
bottom: “fc7”
top: “fc7”
dropout_param {
dropout_ratio: 0.5
}
}
layer {
name: “fc8”
type: “InnerProduct”
bottom: “fc7”
top: “fc8”
param {
lr_mult: 1
decay_mult: 1
}
param {
lr_mult: 2
decay_mult: 0
}
inner_product_param {
num_output: 10
weight_filler {
type: “gaussian”
std: 0.01
}
bias_filler {
type: “constant”
value: 0
}
}
}
layer {
name: “loss”
type: “SoftmaxWithLoss”
bottom: “fc8”
bottom: “label”
top: “loss”
}
I0428 02:41:19.316000 2429 layer_factory.hpp:77] Creating layer train-data
I0428 02:41:19.316679 2429 net.cpp:94] Creating Layer train-data
I0428 02:41:19.316690 2429 net.cpp:409] train-data -> data
I0428 02:41:19.316707 2429 net.cpp:409] train-data -> label
I0428 02:41:19.316721 2429 data_transformer.cpp:25] Loading mean file from: /var/lib/digits/jobs/20170428-015357-2072/mean.binaryproto
I0428 02:41:19.318146 2436 db_lmdb.cpp:35] Opened lmdb /var/lib/digits/jobs/20170428-015357-2072/train_db
I0428 02:41:19.318440 2429 data_layer.cpp:76] output data size: 128,3,32,32
I0428 02:41:19.323140 2429 net.cpp:144] Setting up train-data
I0428 02:41:19.323160 2429 net.cpp:151] Top shape: 128 3 32 32 (393216)
I0428 02:41:19.323164 2429 net.cpp:151] Top shape: 128 (128)
I0428 02:41:19.323168 2429 net.cpp:159] Memory required for data: 1573376
I0428 02:41:19.323173 2429 layer_factory.hpp:77] Creating layer conv1
I0428 02:41:19.323187 2429 net.cpp:94] Creating Layer conv1
I0428 02:41:19.323191 2429 net.cpp:435] conv1 <- data
I0428 02:41:19.323202 2429 net.cpp:409] conv1 -> conv1
I0428 02:41:19.589437 2429 net.cpp:144] Setting up conv1
I0428 02:41:19.589464 2429 net.cpp:151] Top shape: 128 96 6 6 (442368)
I0428 02:41:19.589468 2429 net.cpp:159] Memory required for data: 3342848
I0428 02:41:19.589488 2429 layer_factory.hpp:77] Creating layer relu1
I0428 02:41:19.589501 2429 net.cpp:94] Creating Layer relu1
I0428 02:41:19.589507 2429 net.cpp:435] relu1 <- conv1
I0428 02:41:19.589514 2429 net.cpp:396] relu1 -> conv1 (in-place)
I0428 02:41:19.589530 2429 net.cpp:144] Setting up relu1
I0428 02:41:19.589535 2429 net.cpp:151] Top shape: 128 96 6 6 (442368)
I0428 02:41:19.589539 2429 net.cpp:159] Memory required for data: 5112320
I0428 02:41:19.589541 2429 layer_factory.hpp:77] Creating layer norm1
I0428 02:41:19.589553 2429 net.cpp:94] Creating Layer norm1
I0428 02:41:19.589557 2429 net.cpp:435] norm1 <- conv1
I0428 02:41:19.589561 2429 net.cpp:409] norm1 -> norm1
I0428 02:41:19.589650 2429 net.cpp:144] Setting up norm1
I0428 02:41:19.589684 2429 net.cpp:151] Top shape: 128 96 6 6 (442368)
I0428 02:41:19.589686 2429 net.cpp:159] Memory required for data: 6881792
I0428 02:41:19.589689 2429 layer_factory.hpp:77] Creating layer pool1
I0428 02:41:19.589701 2429 net.cpp:94] Creating Layer pool1
I0428 02:41:19.589704 2429 net.cpp:435] pool1 <- norm1
I0428 02:41:19.589709 2429 net.cpp:409] pool1 -> pool1
I0428 02:41:19.589959 2429 net.cpp:144] Setting up pool1
I0428 02:41:19.589972 2429 net.cpp:151] Top shape: 128 96 3 3 (110592)
I0428 02:41:19.589977 2429 net.cpp:159] Memory required for data: 7324160
I0428 02:41:19.589983 2429 layer_factory.hpp:77] Creating layer conv2
I0428 02:41:19.590000 2429 net.cpp:94] Creating Layer conv2
I0428 02:41:19.590006 2429 net.cpp:435] conv2 <- pool1
I0428 02:41:19.590016 2429 net.cpp:409] conv2 -> conv2
I0428 02:41:19.617810 2429 net.cpp:144] Setting up conv2
I0428 02:41:19.617821 2429 net.cpp:151] Top shape: 128 256 3 3 (294912)
I0428 02:41:19.617825 2429 net.cpp:159] Memory required for data: 8503808
I0428 02:41:19.617833 2429 layer_factory.hpp:77] Creating layer relu2
I0428 02:41:19.617839 2429 net.cpp:94] Creating Layer relu2
I0428 02:41:19.617843 2429 net.cpp:435] relu2 <- conv2
I0428 02:41:19.617847 2429 net.cpp:396] relu2 -> conv2 (in-place)
I0428 02:41:19.617856 2429 net.cpp:144] Setting up relu2
I0428 02:41:19.617859 2429 net.cpp:151] Top shape: 128 256 3 3 (294912)
I0428 02:41:19.617861 2429 net.cpp:159] Memory required for data: 9683456
I0428 02:41:19.617864 2429 layer_factory.hpp:77] Creating layer norm2
I0428 02:41:19.617871 2429 net.cpp:94] Creating Layer norm2
I0428 02:41:19.617874 2429 net.cpp:435] norm2 <- conv2
I0428 02:41:19.617880 2429 net.cpp:409] norm2 -> norm2
I0428 02:41:19.617923 2429 net.cpp:144] Setting up norm2
I0428 02:41:19.617930 2429 net.cpp:151] Top shape: 128 256 3 3 (294912)
I0428 02:41:19.617933 2429 net.cpp:159] Memory required for data: 10863104
I0428 02:41:19.617936 2429 layer_factory.hpp:77] Creating layer pool2
I0428 02:41:19.617943 2429 net.cpp:94] Creating Layer pool2
I0428 02:41:19.617945 2429 net.cpp:435] pool2 <- norm2
I0428 02:41:19.617949 2429 net.cpp:409] pool2 -> pool2
I0428 02:41:19.617980 2429 net.cpp:144] Setting up pool2
I0428 02:41:19.617985 2429 net.cpp:151] Top shape: 128 256 1 1 (32768)
I0428 02:41:19.617987 2429 net.cpp:159] Memory required for data: 10994176
I0428 02:41:19.617990 2429 layer_factory.hpp:77] Creating layer conv3
I0428 02:41:19.618000 2429 net.cpp:94] Creating Layer conv3
I0428 02:41:19.618003 2429 net.cpp:435] conv3 <- pool2
I0428 02:41:19.618008 2429 net.cpp:409] conv3 -> conv3
I0428 02:41:19.710875 2429 net.cpp:144] Setting up conv3
I0428 02:41:19.710888 2429 net.cpp:151] Top shape: 128 384 1 1 (49152)
I0428 02:41:19.710892 2429 net.cpp:159] Memory required for data: 11190784
I0428 02:41:19.710901 2429 layer_factory.hpp:77] Creating layer relu3
I0428 02:41:19.710907 2429 net.cpp:94] Creating Layer relu3
I0428 02:41:19.710911 2429 net.cpp:435] relu3 <- conv3
I0428 02:41:19.710916 2429 net.cpp:396] relu3 -> conv3 (in-place)
I0428 02:41:19.710922 2429 net.cpp:144] Setting up relu3
I0428 02:41:19.710927 2429 net.cpp:151] Top shape: 128 384 1 1 (49152)
I0428 02:41:19.710929 2429 net.cpp:159] Memory required for data: 11387392
I0428 02:41:19.710932 2429 layer_factory.hpp:77] Creating layer conv4
I0428 02:41:19.710944 2429 net.cpp:94] Creating Layer conv4
I0428 02:41:19.710947 2429 net.cpp:435] conv4 <- conv3
I0428 02:41:19.710952 2429 net.cpp:409] conv4 -> conv4
I0428 02:41:19.764622 2429 net.cpp:144] Setting up conv4
I0428 02:41:19.764634 2429 net.cpp:151] Top shape: 128 384 1 1 (49152)
I0428 02:41:19.764638 2429 net.cpp:159] Memory required for data: 11584000
I0428 02:41:19.764645 2429 layer_factory.hpp:77] Creating layer relu4
I0428 02:41:19.764652 2429 net.cpp:94] Creating Layer relu4
I0428 02:41:19.764654 2429 net.cpp:435] relu4 <- conv4
I0428 02:41:19.764662 2429 net.cpp:396] relu4 -> conv4 (in-place)
I0428 02:41:19.764668 2429 net.cpp:144] Setting up relu4
I0428 02:41:19.764688 2429 net.cpp:151] Top shape: 128 384 1 1 (49152)
I0428 02:41:19.764690 2429 net.cpp:159] Memory required for data: 11780608
I0428 02:41:19.764693 2429 layer_factory.hpp:77] Creating layer conv5
I0428 02:41:19.764703 2429 net.cpp:94] Creating Layer conv5
I0428 02:41:19.764706 2429 net.cpp:435] conv5 <- conv4
I0428 02:41:19.764711 2429 net.cpp:409] conv5 -> conv5
I0428 02:41:19.800688 2429 net.cpp:144] Setting up conv5
I0428 02:41:19.800699 2429 net.cpp:151] Top shape: 128 256 1 1 (32768)
I0428 02:41:19.800704 2429 net.cpp:159] Memory required for data: 11911680
I0428 02:41:19.800711 2429 layer_factory.hpp:77] Creating layer relu5
I0428 02:41:19.800720 2429 net.cpp:94] Creating Layer relu5
I0428 02:41:19.800724 2429 net.cpp:435] relu5 <- conv5
I0428 02:41:19.800729 2429 net.cpp:396] relu5 -> conv5 (in-place)
I0428 02:41:19.800735 2429 net.cpp:144] Setting up relu5
I0428 02:41:19.800740 2429 net.cpp:151] Top shape: 128 256 1 1 (32768)
I0428 02:41:19.800741 2429 net.cpp:159] Memory required for data: 12042752
I0428 02:41:19.800745 2429 layer_factory.hpp:77] Creating layer pool5
I0428 02:41:19.800751 2429 net.cpp:94] Creating Layer pool5
I0428 02:41:19.800755 2429 net.cpp:435] pool5 <- conv5
I0428 02:41:19.800758 2429 net.cpp:409] pool5 -> pool5
*** Aborted at 1493347279 (unix time) try “date -d @1493347279” if you are using GNU date ***
PC: @ 0x7fd867e616f0 (unknown)
*** SIGFPE (@0x7fd867e616f0) received by PID 2429 (TID 0x7fd868a41a40) from PID 1743132400; stack trace: ***
@ 0x7fd866415cb0 (unknown)
@ 0x7fd867e616f0 (unknown)
@ 0x7fd867e61b09 (unknown)
@ 0x7fd867ef518c (unknown)
@ 0x7fd867fa7f75 (unknown)
@ 0x7fd867fa8e9a (unknown)
@ 0x7fd867f8c2ba (unknown)
@ 0x7fd867f8d25c (unknown)
@ 0x7fd867f8d5a3 (unknown)
@ 0x7fd8680057d9 (unknown)
@ 0x411d0c (unknown)
@ 0x40a97c (unknown)
@ 0x40867c (unknown)
@ 0x7fd866400f45 (unknown)
@ 0x408e4d (unknown)
@ 0x0 (unknown)

My best guess is the SIGFPE is triggered by an integer division by zero. In any event the SIGFPE happens in host code, so nothing to do with the GPU. Best I can see, the output above does not point to NVIDIA-provided code as the source of the SIGFPE?

Does the program work with the unmodified model? If so, the model modifications may vioate restrictions imposed by the model. The experts for the models are not likely to be found in this forum, there may be better venues for the question.

The data set I’m using (and that’s recommended by the DIGITS start documentation) is 32x32, and the unmodified net wants 256x256. The instructions say to change the cropping, so presumably somebody tested this at some point and it’s supposed to work.

I went looking for a NVIDIA DIGITS based forum, but didn’t find one. There’s also not really one for specific NVIDIA-optimized caffe training. This forum seemed to be the closest available. (I think it’s more a problem with the caffe tool than with the model itself, as the model is “just data.”)

So, my guess that debugging this without symbols is a lost cause seems correct-ish?
I’m trying to find a better avenue to engage with the DIGITS developers.

there is a pretty active digits-users google group

I think for what you want to do, the changes to the model that would be necessary go beyond just cropping. I’m not sure which instructions you are referring to. Cropping alone might be sufficient if you were using LeNet, I think other changes are needed for other models like Googlenet.

DIGITS can use caffe, and it is certainly possible to build that with symbols. You installed using a package method that pulled in binaries without symbols (I think), perhaps, but you could certainly use an alternate build/install method that built the necessary components from sources with symbols.

I wouldn’t tackle this problem by low level debugging. What you are doing is common enough that specific tutorials have been written about it:

https://blog.kickview.com/training-a-cnn-with-the-cifar-10-dataset-using-digits-5-1/

so the problem lies more likely in what you are doing than in a fundamental problem in the code. I think there is enough information out there that with a bit of searching you can figure out how to do this correctly, without resorting to debugging.

What I am suggesting is that the problem may be in Caffe itself, thus not specific to GPU acceleration with NVIDIA GPUs. One way to test that hypothesis would be to turn off GPU acceleration in Caffe and see what happens.

Have you tried asking in a forum or mailing list dedicated to Caffe? The Caffe website appears to suggest their Google group: https://groups.google.com/forum/#!forum/caffe-users

You may also want to try instrumenting net.cpp (that’s part of the open source code base, correct?) to pinpoint where exactly it dies and why.

FWIW, the most appropriate forum for this question on this site is probably “GPU-Accelerated Libraries”. I see questions related to deep-learning libraries there all the time.

Thanks for the suggestions.

I have, in fact, tried to follow the kickview blog post, and it fails in the way I describe.
The blog post is for DIGITS 4.1; the current release of DIGITS is 5.0.

I will go look in the GPU-Accelerated Libraries forum next time; thanks for the pointer!