Segmentation Fault (core dumped) while training a CNN model using SLURM script

0

I am facing the same issue, when I attempt to train a CNN model. I use Supercomputer at our organization for training using TeslaV100 NVIDIA GPU.

The model gets trained for a smaller number of Epochs. Eg if epochs=50 the model runs fine and outputs the trained model file (using callbacks in model.fit). But when epochs=75 the jupyter notebook says kernel disconnected at 54th epoch (using the same dataset).

I ran .py file by submitting it as a job through slurm script. This also gave 'segmentation fault (core dumped) error at 54th epoch (same place where jupyter notebook kernel died). So what could be the reason?

Dataset size is 80,000*5600, size around 9GB, following is the code:

import os
import os.path
import sys

import h5py
import numpy as np
import tensorflow as tf
from tensorflow import keras

def load_file(database_file, load_data=False):
    in_file  = h5py.File(database_file, "r")
    
    if load_data == False:
        return (X_train, Y_train), (X_test, Y_test)
    else:
        return (X_train, Y_train), (X_test, Y_test), (in_file['train/data'], in_file['test/data'])

database = "/home/..../dataset.h5"
trained_model = "/home..../trained_epoch_test.h5"
  
(X_train, Y_train), (X_test, Y_test) = load_file(database_file)

X_train_scaled = X_train.reshape((X_train.shape[0], X_train.shape[1], 1))
X_test_scaled = X_test.reshape((X_test.shape[0], X_test.shape[1], 1))

y_train_categorical = keras.utils.to_categorical(Y_train, num_classes=256)
y_test_categorical    = keras.utils.to_categorical(Y_test, num_classes=256)

batch_size = 800 
epochs = 1
   
classes=256
input_shape = (10000,1)     
img_input = keras.layers.Input(shape=input_shape)

x = keras.layers.Conv1D(64, 11, activation='relu', padding='same', name='block1_conv1')(img_input)
x = keras.layers.AveragePooling1D(2, strides=2, name='block1_pool')(x)

x = keras.layers.Conv1D(128, 11, activation='relu', padding='same', name='block2_conv1')(x)
x = keras.layers.AveragePooling1D(2, strides=2, name='block2_pool')(x)

x = keras.layers.Conv1D(256, 11, activation='relu', padding='same', name='block3_conv1')(x)
x = keras.layers.AveragePooling1D(2, strides=2, name='block3_pool')(x)

x = keras.layers.Conv1D(512, 11, activation='relu', padding='same', name='block4_conv1')(x)
x = keras.layers.AveragePooling1D(2, strides=2, name='block4_pool')(x)

x = keras.layers.Conv1D(512, 11, activation='relu', padding='same', name='block5_conv1')(x)
x = keras.layers.AveragePooling1D(2, strides=2, name='block5_pool')(x)

x = keras.layers.Flatten(name='flatten')(x)
x = keras.layers.Dense(4096, activation='relu', name='fc1')(x)
x = keras.layers.Dense(4096, activation='relu', name='fc2')(x)
x = keras.layers.Dense(classes, activation='softmax', name='predictions')(x)
  
model = keras.models.Model(img_input, x, name='cnn_best')
optimizer = keras.optimizers.RMSprop(lr=0.00001)
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
    
save_model = keras.callbacks.ModelCheckpoint(trained_model)
callbacks=[save_model]
    
model.fit(x=X_train_scaled, y=y_train_categorical, batch_size=batch_size, verbose = 1, epochs=epochs, callbacks=callbacks)

Following is the slurm script used:
The dataset size is around 9GB.

Following is the slurm script used when run without Jupyter Notebook:

#!/bin/bash
#SBATCH --partition=GPU_three
#SBATCH --nodelist=Node_03
#SBATCH --output=output

python3 train.py

Other details are:

OS: Linux Python version: 3.9.7 conda list: packages in environment at /home/…/test: preferred channel:WMLCE

cudatoolkit 10.2.89 684.g752c550
cudnn 7.6.5_10.2 650.g338a052 h5py 2.8.0 py37h8d01980_0 hdf5 1.10.2 hba1933b_1 ipython 7.29.0 py37he95b402_0 jupyter_client 7.1.2 pyhd3eb1b0_0 jupyter_core 4.9.1 py37h6ffa863_0 jupyterlab_pygments 0.1.2 py_0 keras 2.3.1 690.gf2fc3f6
keras-applications 1.0.8 py_1 keras-base 2.3.1 py37_690.gf2fc3f6
keras-gpu 2.3.1 690.gf2fc3f6
keras-preprocessing 1.1.0 py_1 matplotlib 3.4.3 py37h6ffa863_0 matplotlib-base 3.4.3 py37he087750_0 matplotlib-inline 0.1.2 pyhd3eb1b0_2 python 3.7.11 h836d2c2_0 python-dateutil 2.8.2 pyhd3eb1b0_0 pyyaml 5.4.1 py37h140841e_1 pyzmq 22.3.0 py37h29c3540_2 readline 8.1.2 h140841e_1 requests 2.22.0 py37_1 scipy 1.3.1 py37he2b7bc3_0 send2trash 1.8.0 pyhd3eb1b0_1 setuptools 58.0.4 py37h6ffa863_0 six 1.13.0 py37_0 sqlite 3.37.0 hd7247d8_0 tensorboard 2.1.1 py37_66d10d7_4000.g7f90012
tensorflow 2.1.3 gpu_py37_945.g7f90012
tensorflow-base 2.1.3 gpu_py37_77f47d6_72821.g5e36fbc
tensorflow-estimator 2.1.0 py37_7ec4e5d_1493.g7f90012
tensorflow-gpu 2.1.3 945.g7f90012
tensorrt 7.0.0.11 py37_698.g9922bde
termcolor 1.1.0 py37h6ffa863_1 terminado 0.9.4 py37h6ffa863_0 testpath 0.5.0 pyhd3eb1b0_0 tk 8.6.11 h7e00dab_0 tornado 6.1 py37h140841e_0 traitlets 5.1.1 pyhd3eb1b0_0 typing_extensions 3.10.0.2 pyh06a4308_0 uff 0.6.5 py37_698.g9922bde