Simple off-the-shelf MNIST classifier using Keras and Tensorflow will not learn on GeForce RTX 3090 GPU

I recently acquired a new machine with a GeForce RTX 3090 GPU. I installed the appropriate drivers, and am using an Anaconda environment with Python 3.8, Tensorflow 2.3.1, Keras 2.4.3, and cuda release 10.1, V10.1.105.

I was able to train a model using Conv3D layers, but for some reason, when switching over to using Conv2D layers, the network is unable to learn anything (loss/accuracy remains the same after initialization). I trained the same network on my old machine, and the network was able to learn. I tried matching all Python/Tensorflow/Keras versions, and still noticed this discrepancy.

I then tried replicating an MNIST classifier from this public repo: https://github.com/sambit9238/Deep-Learning/blob/master/cnn_mnist.ipynb The code is copied here below:

from future import print_function
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPooling2D
from keras import backend as k
import matplotlib.pyplot as plt
import numpy as np

#load mnist dataset
(X_train, y_train), (X_test, y_test) = mnist.load_data() #everytime loading data won’t be so easy :)

img_rows , img_cols = 28, 28

#reshaping
#this assumes our data format
#For 3D data, “channels_last” assumes (conv_dim1, conv_dim2, conv_dim3, channels) while
#“channels_first” assumes (channels, conv_dim1, conv_dim2, conv_dim3).
if k.image_data_format() == ‘channels_first’:
X_train = X_train.reshape(X_train.shape[0], 1, img_rows, img_cols)
X_test = X_test.reshape(X_test.shape[0], 1, img_rows, img_cols)
input_shape = (1, img_rows, img_cols)
else:
X_train = X_train.reshape(X_train.shape[0], img_rows, img_cols, 1)
X_test = X_test.reshape(X_test.shape[0], img_rows, img_cols, 1)
input_shape = (img_rows, img_cols, 1)
#more reshaping
X_train = X_train.astype(‘float32’)
X_test = X_test.astype(‘float32’)
X_train /= 255
X_test /= 255

set number of categories
num_category = 10

convert class vectors to binary class matrices

y_train = keras.utils.to_categorical(y_train, num_category)
y_test = keras.utils.to_categorical(y_test, num_category)

##model building
model = Sequential()
#convolutional layer with rectified linear unit activation
model.add(Conv2D(32, kernel_size=(3, 3),
activation=‘relu’,
input_shape=input_shape))
#32 convolution filters used each of size 3x3
#again
model.add(Conv2D(64, (3, 3), activation=‘relu’))
#64 convolution filters used each of size 3x3
#choose the best features via pooling
model.add(MaxPooling2D(pool_size=(2, 2)))
#randomly turn neurons on and off to improve convergence
model.add(Dropout(0.25))
#flatten since too many dimensions, we only want a classification output
model.add(Flatten())
#fully connected to get all relevant data
model.add(Dense(128, activation=‘relu’))
#one more dropout for convergence’ sake :)
model.add(Dropout(0.5))
#output a softmax to squash the matrix into output probabilities
model.add(Dense(num_category, activation=‘softmax’))
#Adaptive learning rate (adaDelta) is a popular form of gradient descent rivaled only by adam and adagrad
#categorical ce since we have multiple classes (10)
model.compile(loss=keras.losses.categorical_crossentropy,
optimizer=keras.optimizers.Adadelta(learning_rate=0.001),
metrics=[‘accuracy’])

batch_size = 128
num_epoch = 10
#model training
model_log = model.fit(X_train, y_train,
batch_size=batch_size,
epochs=num_epoch,
verbose=1,
validation_data=(X_test, y_test))

And during the training phase, the network is unable to learn anything on my machine:

LEARNING RATE = 0.001

Epoch 1/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3026 - accuracy: 0.1037 - val_loss: 2.3026 - val_accuracy: 0.1078

Epoch 2/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3026 - accuracy: 0.1068 - val_loss: 2.3026 - val_accuracy: 0.1118

Epoch 3/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3026 - accuracy: 0.1097 - val_loss: 2.3026 - val_accuracy: 0.1129

Epoch 4/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3026 - accuracy: 0.1115 - val_loss: 2.3025 - val_accuracy: 0.1134

Epoch 5/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3025 - accuracy: 0.1120 - val_loss: 2.3025 - val_accuracy: 0.1134

Epoch 6/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3025 - accuracy: 0.1111 - val_loss: 2.3025 - val_accuracy: 0.1135

Epoch 7/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3026 - accuracy: 0.1122 - val_loss: 2.3025 - val_accuracy: 0.1135

Epoch 8/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3027 - accuracy: 0.1123 - val_loss: 2.3025 - val_accuracy: 0.1135

Epoch 9/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3028 - accuracy: 0.1126 - val_loss: 2.3025 - val_accuracy: 0.1135

Epoch 10/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3030 - accuracy: 0.1124 - val_loss: 2.3025 - val_accuracy: 0.1135

LEARNING RATE = 0.01

Epoch 1/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3026 - accuracy: 0.1066 - val_loss: 2.3026 - val_accuracy: 0.1128

Epoch 2/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3026 - accuracy: 0.1122 - val_loss: 2.3026 - val_accuracy: 0.1118

Epoch 3/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3026 - accuracy: 0.1116 - val_loss: 2.3026 - val_accuracy: 0.1118

Epoch 4/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3026 - accuracy: 0.1120 - val_loss: 2.3025 - val_accuracy: 0.1135

Epoch 5/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3025 - accuracy: 0.1124 - val_loss: 2.3025 - val_accuracy: 0.1135

Epoch 6/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3025 - accuracy: 0.1124 - val_loss: 2.3025 - val_accuracy: 0.1135

Epoch 7/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3025 - accuracy: 0.1124 - val_loss: 2.3025 - val_accuracy: 0.1135

Epoch 8/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3026 - accuracy: 0.1124 - val_loss: 2.3025 - val_accuracy: 0.1135

Epoch 9/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3026 - accuracy: 0.1124 - val_loss: 2.3025 - val_accuracy: 0.1135

Epoch 10/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3026 - accuracy: 0.1124 - val_loss: 2.3024 - val_accuracy: 0.1135

LEARNING RATE = 0.1

Epoch 1/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3054 - accuracy: 0.1119 - val_loss: 2.3016 - val_accuracy: 0.1135

Epoch 2/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3048 - accuracy: 0.1124 - val_loss: 2.3013 - val_accuracy: 0.1135

Epoch 3/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3052 - accuracy: 0.1123 - val_loss: 2.3011 - val_accuracy: 0.1135

Epoch 4/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3087 - accuracy: 0.1123 - val_loss: 2.3011 - val_accuracy: 0.1135

Epoch 5/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3069 - accuracy: 0.1124 - val_loss: 2.3010 - val_accuracy: 0.1135

Epoch 6/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3108 - accuracy: 0.1124 - val_loss: 2.3010 - val_accuracy: 0.1135

Epoch 7/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3133 - accuracy: 0.1124 - val_loss: 2.3010 - val_accuracy: 0.1135

Epoch 8/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3192 - accuracy: 0.1124 - val_loss: 2.3010 - val_accuracy: 0.1135

Epoch 9/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3184 - accuracy: 0.1124 - val_loss: 2.3010 - val_accuracy: 0.1135

Epoch 10/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3284 - accuracy: 0.1124 - val_loss: 2.3010 - val_accuracy: 0.1135

As you can see, the network does not learn. I am quite stumped. Does anyone know how I might be able to progress?

Hello,

sadly, there’s no reply, yet. However, I’m facing a similar problem: I’ve got several systems I run my code on and also two RTX-Systems (one with an RTX 2080 TI and another with a new RTX 3090).

I have written some code for a publication and on the Titan XP, Tesla V100 and on my personal GTX 1080 TI, the code runs just fine and is also learning properly, which was also true for the RTX 2080 TI. However, at some point in the past half year, the RTX must have stopped producing reliable results, which recently became apparent, when I received the new RTX 3090, because during hyperparameter optimization most training losses were simply NAN right from the beginning of the epoch. For some hyperparameter combinations, the results were “meaningful” at first, but when further analyzing the results, they were unexpectedly low only showing a difference in the 4th or 5th decimal place, so the results were anything but useful. The same applies to the RTX 2080 TI, so I assume it was introduced with a certain driver version

I then read, that the RTX 3090 requires > CUDA 11.1 and > cuDNN 8.0.4 to operate properly. And that’s where I struggle for 3 weeks now: since the systems are used by other people, too, who require their special setup, I’m pretty much bound to Windows and Conda and I could not figure out a combination of Conda Packages to get this sh… eer nerv irritation to work.

Could you find out something new so far? Or did you get it to work?

Kind regards