Data problem with fine-tuning ESM2 BioNeMo

Hello everyone,

We have encountered some difficulties while trying to fine-tune the ESM2 model.

  • We have tried to run the pretrain.py scripts from the /examples/protein/esm2nv with example data as follows:

python pretrain.py ++model.data.dataset.train=/workspace/bionemo/examples/tests/test_data/protein/train/x000.csv ++model.data.dataset.val=/workspace/bionemo/examples/tests/test_data/protein/val/x000.csv ++model.data.dataset.test=/workspace/bionemo/examples/tests/test_data/protein/test/x000.csv ++trainer.devices=1

  • with the esm1 model it worked
  • with the esm2 model, which we would actually like to use, the problem was that some files were not found of which we do not understand why they would be needed: AssertionError: Following files do not exist /workspace/bionemo/data/uniref202104_esm2/uf50/train/x000.csv , …

Why is this file needed for the ESM2 model, is there a bug somewhere in the configuration maybe?
Also, in the future we would prefer to use a data preprocessing script from bionemo instead of our own, but we noticed there are multiple preprocessing scripts available. Is there one you would specifically recommend for our purposes?

Every input is very much appreciated!