Regarding Nvidia Merlin Examples

I am trying to run the examples of NVIDIA MERLIN (NVTabular/examples at main · NVIDIA-Merlin/NVTabular · GitHub) NVTabular/examples/getting-started-movielens/03-Training-with-HugeCTR.ipynb while I am getting an error when we are training the HUGECTR model.

ERROR Message:

====================================================Model Init=====================================================

[15d11h59m35s][HUGECTR][INFO]: Global seed is 3188599617

[15d11h59m35s][HUGECTR][INFO]: Device to NUMA mapping:

GPU 0 → node 0

[15d11h59m37s][HUGECTR][INFO]: Peer-to-peer access cannot be fully enabled.

[15d11h59m37s][HUGECTR][INFO]: Start all2all warmup

[15d11h59m37s][HUGECTR][INFO]: End all2all warmup

[15d11h59m37s][HUGECTR][INFO]: Using All-reduce algorithm OneShot

Device 0: Tesla V100-SXM2-16GB

[15d11h59m37s][HUGECTR][INFO]: num of DataReader workers: 1

[HCDEBUG][ERROR] Runtime error: file list open failed: /root/nvt-examples/movielens/data/train/_file_list.txt /var/tmp/HugeCTR/HugeCTR/include/data_readers/file_list.hpp:63

RuntimeError Traceback (most recent call last)

/tmp/ipykernel_663/ in

 23 model = hugectr.Model(solver, reader, optimizer)


—> 25 model.add(

 26     hugectr.Input(

 27         label_dim=1,

RuntimeError: [HCDEBUG][ERROR] Runtime error: file list open failed: /root/nvt-examples/movielens/data/train/_file_list.txt /var/tmp/HugeCTR/HugeCTR/include/data_readers/file_list.hpp:63

For running this code, we have used the docker image Merlin Training | NVIDIA NGC. We have followed all the steps given in this link.


We are using EC2 instance AWS Deep Learning AMI (Ubuntu 18.04)

EC2 instance type is p3.2xlarge where memory is 61GiB, CPU 8 virtual cores, plus 1x Nvidia V100 GPU, Storage is EBS and add the S3 bucket 500GB volumes.

To clarify what version of the NGC Merlin Training Container are you running. i.e.
on your docker command line? What version of the Data Science Workbench software are you running?

docker run --runtime=nvidia --rm -it -p 8888:8888 -p 8797:8787 -p 8796:8786 --ipc=host /bin/bash

I am running this docker command and yes I am running the 22.02 version