Got exception error on data in validation key set

I tried to train a model using the command in clara_xray_classification_chest_amp/commands with own dataset.
I changed several configurations in the config directory to use my new dataset.
My dataset got only 1 class (positive or negative).
After I run the training script, the training process using the training key set seems to work fine but when it comes to the validation key set it throws errors as follows:

Exception: <class 'ValueError'>: Input contains NaN, infinity or a value too large for dtype('float32').
  File "workflows/fitters/supervised_fitter.py", line 224, in fit
  File "workflows/fitters/supervised_fitter.py", line 625, in _do_fit
  File "components/metrics/metric.py", line 89, in get
  File "libs/metrics/auc.py", line 64, in get
  File "/usr/local/lib/python3.6/dist-packages/sklearn/metrics/ranking.py", line 355, in roc_auc_score
    sample_weight=sample_weight)
  File "/usr/local/lib/python3.6/dist-packages/sklearn/metrics/base.py", line 80, in _average_binary_score
    y_score = check_array(y_score)
  File "/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py", line 542, in check_array
    allow_nan=force_all_finite == 'allow-nan')
  File "/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py", line 56, in _assert_all_finite
    raise ValueError(msg_err.format(type_err, X.dtype))

I found that if the data in the validation key set were labeled with different classes it will throw the errors.
I am not sure how to fix this problem.

Hi
Thanks for your interest in Clara Train SDK.
The error you get is in the validation step while training or when running validation script. The root cause is usually that you have a AUC metric looking for an index that doesn’t exist so it runs into a nan / div by zero error.
If you get this error from validation script (after fully training a network), simply make sure that the metric in the validation.json (different from the training.json) matches validation section in the training .json
If you getting this error while training then you need to fix the validation section in the training.json by simply removing the AUC for index that are not in your dataset/problem

If you still face a problem, please attach the configuration that causes the error.

Hi
Thank for your help. I got this error while training. I have tried removing some of AUC metrics in the validation section but the errors still occur so I attached the configuration I used in the training process here https://drive.google.com/file/d/1yb__Rw0wS_G6nxcUXYfTxvLrMIre-3OH/view?usp=sharing.

Other problems we found are

  1. validate.sh show zero score in each metric in config_validation.json
  2. infer.sh show only one line of result in preds_model.csv

I use trained model without validation section to test the scripts above, I don’t know it related to validation section exception or not

Hi
For #1 I see you only run for 3 epochs so it is expected that the score is zero/low
For #2 You should get one line per image in the validation tag in the data set

I am not sure you can/should train without a validation section as you will always get an error once you get to the validation step. Can you run train.sh with a full config and share the full error log?

Thanks

Hi
For #1, I have tried it with 10 epoch model but score show the same with 0, but the model trained without validation section due to the exception I have encountered.
In train.sh there is num_training_epoch_per_valid argument that was set the value to 1. Is it mean validation section will use for every 1 training epoch? Should I change it to a higher number?
For #2, There is only one line result in preds_model.csv even the data in DATA_LIST_KEY have more than one. Should the number of result lines be equal to data in DATA_LIST_KEY?

I can train the model without the validation section by removing the whole section in config_train.sh and it show no error during training process, but it will provide only model_final.ckpt model after trained due to no best metric score to check which epoch model is the best.

Hi
I think I fixed the exception by increasing the epoch number from 3 to 40 and num_training_epoch_per_valid number from 1 to 10.

However, problem #1 and #2 still exists.

I found a mistake in datalist.json that cause #1 and #2 problems. I found the data values in DATA_LIST_KEY are all the same values.

Thank for your help.