I recently acquired a new machine with a GeForce RTX 3090 GPU. I installed the appropriate drivers, and am using an Anaconda environment with Python 3.8, Tensorflow 2.3.1, Keras 2.4.3, and cuda release 10.1, V10.1.105.
I was able to train a model using Conv3D layers, but for some reason, when switching over to using Conv2D layers, the network is unable to learn anything (loss/accuracy remains the same after initialization). I trained the same network on my old machine, and the network was able to learn. I tried matching all Python/Tensorflow/Keras versions, and still noticed this discrepancy.
I then tried replicating an MNIST classifier from this public repo: https://github.com/sambit9238/Deep-Learning/blob/master/cnn_mnist.ipynb The code is copied here below:
from future import print_function
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPooling2D
from keras import backend as k
import matplotlib.pyplot as plt
import numpy as np#load mnist dataset
(X_train, y_train), (X_test, y_test) = mnist.load_data() #everytime loading data won’t be so easy :)img_rows , img_cols = 28, 28
#reshaping
#this assumes our data format
#For 3D data, “channels_last” assumes (conv_dim1, conv_dim2, conv_dim3, channels) while
#“channels_first” assumes (channels, conv_dim1, conv_dim2, conv_dim3).
if k.image_data_format() == ‘channels_first’:
X_train = X_train.reshape(X_train.shape[0], 1, img_rows, img_cols)
X_test = X_test.reshape(X_test.shape[0], 1, img_rows, img_cols)
input_shape = (1, img_rows, img_cols)
else:
X_train = X_train.reshape(X_train.shape[0], img_rows, img_cols, 1)
X_test = X_test.reshape(X_test.shape[0], img_rows, img_cols, 1)
input_shape = (img_rows, img_cols, 1)
#more reshaping
X_train = X_train.astype(‘float32’)
X_test = X_test.astype(‘float32’)
X_train /= 255
X_test /= 255set number of categories
num_category = 10convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_category)
y_test = keras.utils.to_categorical(y_test, num_category)##model building
model = Sequential()
#convolutional layer with rectified linear unit activation
model.add(Conv2D(32, kernel_size=(3, 3),
activation=‘relu’,
input_shape=input_shape))
#32 convolution filters used each of size 3x3
#again
model.add(Conv2D(64, (3, 3), activation=‘relu’))
#64 convolution filters used each of size 3x3
#choose the best features via pooling
model.add(MaxPooling2D(pool_size=(2, 2)))
#randomly turn neurons on and off to improve convergence
model.add(Dropout(0.25))
#flatten since too many dimensions, we only want a classification output
model.add(Flatten())
#fully connected to get all relevant data
model.add(Dense(128, activation=‘relu’))
#one more dropout for convergence’ sake :)
model.add(Dropout(0.5))
#output a softmax to squash the matrix into output probabilities
model.add(Dense(num_category, activation=‘softmax’))
#Adaptive learning rate (adaDelta) is a popular form of gradient descent rivaled only by adam and adagrad
#categorical ce since we have multiple classes (10)
model.compile(loss=keras.losses.categorical_crossentropy,
optimizer=keras.optimizers.Adadelta(learning_rate=0.001),
metrics=[‘accuracy’])batch_size = 128
num_epoch = 10
#model training
model_log = model.fit(X_train, y_train,
batch_size=batch_size,
epochs=num_epoch,
verbose=1,
validation_data=(X_test, y_test))
And during the training phase, the network is unable to learn anything on my machine:
LEARNING RATE = 0.001
Epoch 1/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3026 - accuracy: 0.1037 - val_loss: 2.3026 - val_accuracy: 0.1078
Epoch 2/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3026 - accuracy: 0.1068 - val_loss: 2.3026 - val_accuracy: 0.1118
Epoch 3/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3026 - accuracy: 0.1097 - val_loss: 2.3026 - val_accuracy: 0.1129
Epoch 4/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3026 - accuracy: 0.1115 - val_loss: 2.3025 - val_accuracy: 0.1134
Epoch 5/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3025 - accuracy: 0.1120 - val_loss: 2.3025 - val_accuracy: 0.1134
Epoch 6/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3025 - accuracy: 0.1111 - val_loss: 2.3025 - val_accuracy: 0.1135
Epoch 7/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3026 - accuracy: 0.1122 - val_loss: 2.3025 - val_accuracy: 0.1135
Epoch 8/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3027 - accuracy: 0.1123 - val_loss: 2.3025 - val_accuracy: 0.1135
Epoch 9/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3028 - accuracy: 0.1126 - val_loss: 2.3025 - val_accuracy: 0.1135
Epoch 10/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3030 - accuracy: 0.1124 - val_loss: 2.3025 - val_accuracy: 0.1135
LEARNING RATE = 0.01
Epoch 1/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3026 - accuracy: 0.1066 - val_loss: 2.3026 - val_accuracy: 0.1128
Epoch 2/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3026 - accuracy: 0.1122 - val_loss: 2.3026 - val_accuracy: 0.1118
Epoch 3/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3026 - accuracy: 0.1116 - val_loss: 2.3026 - val_accuracy: 0.1118
Epoch 4/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3026 - accuracy: 0.1120 - val_loss: 2.3025 - val_accuracy: 0.1135
Epoch 5/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3025 - accuracy: 0.1124 - val_loss: 2.3025 - val_accuracy: 0.1135
Epoch 6/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3025 - accuracy: 0.1124 - val_loss: 2.3025 - val_accuracy: 0.1135
Epoch 7/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3025 - accuracy: 0.1124 - val_loss: 2.3025 - val_accuracy: 0.1135
Epoch 8/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3026 - accuracy: 0.1124 - val_loss: 2.3025 - val_accuracy: 0.1135
Epoch 9/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3026 - accuracy: 0.1124 - val_loss: 2.3025 - val_accuracy: 0.1135
Epoch 10/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3026 - accuracy: 0.1124 - val_loss: 2.3024 - val_accuracy: 0.1135
LEARNING RATE = 0.1
Epoch 1/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3054 - accuracy: 0.1119 - val_loss: 2.3016 - val_accuracy: 0.1135
Epoch 2/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3048 - accuracy: 0.1124 - val_loss: 2.3013 - val_accuracy: 0.1135
Epoch 3/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3052 - accuracy: 0.1123 - val_loss: 2.3011 - val_accuracy: 0.1135
Epoch 4/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3087 - accuracy: 0.1123 - val_loss: 2.3011 - val_accuracy: 0.1135
Epoch 5/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3069 - accuracy: 0.1124 - val_loss: 2.3010 - val_accuracy: 0.1135
Epoch 6/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3108 - accuracy: 0.1124 - val_loss: 2.3010 - val_accuracy: 0.1135
Epoch 7/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3133 - accuracy: 0.1124 - val_loss: 2.3010 - val_accuracy: 0.1135
Epoch 8/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3192 - accuracy: 0.1124 - val_loss: 2.3010 - val_accuracy: 0.1135
Epoch 9/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3184 - accuracy: 0.1124 - val_loss: 2.3010 - val_accuracy: 0.1135
Epoch 10/10 469/469 [==============================] - 1s 3ms/step - loss: 2.3284 - accuracy: 0.1124 - val_loss: 2.3010 - val_accuracy: 0.1135
As you can see, the network does not learn. I am quite stumped. Does anyone know how I might be able to progress?