/usr/local/bin/nvidia_entrypoint.sh using 'find -L' causes major problems

paul.raines · August 17, 2021, 10:04pm

In the /usr/local/bin/nvidia_entrypoint.sh script one of the first things done is a “find -L /usr -name libcuda.so.1”. If the user happens to bind mount a large local volume of their own under /usr this will result in huge startup times just to bootstrap the container.

In my case we have a /usr/pubsw which is a mount on our systems to a huge 1TB+ trove of 3rd party software. To clone our environment in the container we do a bind mount to the same location in docker/singularity when running the NGC container but then they spend over 15 minutes doing this find -L

Instead I think the entrypoint script should only search the directories that are in LD_LIBRARY_PATH or just test for /dev/nvidiactl

paul.raines · August 18, 2021, 6:12pm

I should state for anyone having the same problem who finds this, our workaround is to overwrite the /usr/local/bin/nvidia_entrypoint.sh script in the container with a bind mound to a nvidia_entrypoint.sh script of our own where we removed the “find”.

singularity run --nv -B /usr/pubsw -B /usr/pubsw/IMAGES/nvidia_entrypoint.sh:/usr/local/bin/nvidia_entrypoint.sh …