Things I’ve done:
-
Pull the trainv2.0 docker image
-
use the following script to lunch the container
sudo docker run
–runtime=nvidia
–shm-size=1G
–ulimit memlock=-1
–ulimit stack=67108864
-it --rm
-v
/home/myuserename:/workspace/home
nvcr.io/nvidia/clara-train-sdk:v2.0 /bin/bash -
Download the MMAR demo, extract it from archive and run
train.sh
It first gave an error saying “medical.tlt2” not found. I changed the $PYTHONPATH to a Deploy SDK folder with tlt2 package.
Then it gave a confusing error, which is something like this:
2020-03-19 09:27:10.393142: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
Traceback (most recent call last):
File “/usr/lib/python3.6/runpy.py”, line 193, in _run_module_as_main
“main”, mod_spec)
File “/usr/lib/python3.6/runpy.py”, line 85, in _run_code
exec(code, run_globals)
File “apps/train.py”, line 71, in
File “apps/train.py”, line 61, in main
File “workflows/workflow_factory.py”, line 29, in create_trainer
File “workflows/workflow_factory.py”, line 29, in
File “workflows/workflow_factory.py”, line 191, in build_component
File “utils/compo_module_names.py”, line 14, in init
File “utils/compo_module_names.py”, line 25, in _create_classes_table
File “/usr/lib/python3.6/importlib/init.py”, line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File “”, line 994, in _gcd_import
File “”, line 971, in _find_and_load
File “”, line 955, in _find_and_load_unlocked
File “”, line 665, in _load_unlocked
File “”, line 678, in exec_module
File “”, line 219, in _call_with_frames_removed
File “components/transforms/transforms.py”, line 3, in
File “tlt2/src/components/transforms/libs/transforms.py”, line 25, in
File “tlt2/src/components/transforms/libs/cupyhelper.py”, line 61, in init
File “cupy/cuda/function.pyx”, line 178, in cupy.cuda.function.Module.load_file
File “cupy/cuda/function.pyx”, line 182, in cupy.cuda.function.Module.load_file
File “cupy/cuda/driver.pyx”, line 177, in cupy.cuda.driver.moduleLoad
File “cupy/cuda/driver.pyx”, line 82, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_FILE_NOT_FOUND: file not found
Since the VM is built on GCP’s Nvidia HGC Image on market place and I have tested the MNIST demo successfully, I don’t understand why a CUDA error still pops out. Besides, the output did say a CUDA library has been successfully loaded.
Nvcc —version gives (with a Nvidia P100) a version of 440 which satisfies the requirements.