Hi all.
(before everything, sorry in advance for the long message. Additionaly, I’m not quite certain if this is th proper placing in he forums, so please forgive me if my choice prove itself unwise)
I’m trying to install gdrcopy on an opensuse Leap 15.6 workstation equipped with two GPU cards, a RTX3060 12Gb and a RTX4060Ti 8Gb. As far as I could verify, RTXs cards should be compatible with it, correct?
The reason for it are my suspicions that the systems mainly simulated in it are under using the GPU (although I have no idea on how to actually measure how much of the processor is being occupied) and therefore I would be better served running more simultaneous processes in a single GPU, and the fact that it seems that opensuse leap 15.6 does not have some of the rpms needed for a “simple” install (rpmbuild and dkms, this last one I’m assuming dkms-nvidia which fails for the G06 driver).
As a consequence, I’m interested on running completely independent simulations of very similar size (aka computational cost) at the same time on the same GPU. This is officially supported on gromacs (the program I’m using for it) however it demands a special compilation of it with specifically openmpi compiled specifically with ucx which finally in turn also need to be compiled with gdrcopy.
Right now I’m stuck in the first stage, gdrcopy compilation. :(
I downloaded the latest version source code from its git (GitHub - NVIDIA/gdrcopy: A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology). Ran it with the commands:
sudo make prefix=/usr/local/chem/gdrcopy CUDA=/usr/local/cuda all install
sudo ./insmod.sh
The make command output has gone as this:
~/Downloads/gdrcopy-2.4.4> sudo make prefix=/usr/local/chem/gdrcopy CUDA=/usr/local/cuda all install
GDRAPI_ARCH=X86
cd src/gdrdrv && \
make
make[1]: Entering directory '/home/johannes/Downloads/gdrcopy-2.4.4/src/gdrdrv'
Picking NVIDIA driver sources from NVIDIA_SRC_DIR=/usr/src/kernel-modules/nvidia-565.57.01-default/nvidia. If that does not meet your expectation, you might have a stale driver still around and that might cause problems.
Setting NVIDIA_IS_OPENSOURCE=
Setting HAVE_VM_FLAGS_SET=y
make[2]: Entering directory '/usr/src/linux-6.4.0-150600.21-obj/x86_64/default'
CC [M] /home/johannes/Downloads/gdrcopy-2.4.4/src/gdrdrv/nv-p2p-dummy.o
CC [M] /home/johannes/Downloads/gdrcopy-2.4.4/src/gdrdrv/gdrdrv.o
MODPOST /home/johannes/Downloads/gdrcopy-2.4.4/src/gdrdrv/Module.symvers
CC [M] /home/johannes/Downloads/gdrcopy-2.4.4/src/gdrdrv/nv-p2p-dummy.mod.o
LD [M] /home/johannes/Downloads/gdrcopy-2.4.4/src/gdrdrv/nv-p2p-dummy.ko
BTF [M] /home/johannes/Downloads/gdrcopy-2.4.4/src/gdrdrv/nv-p2p-dummy.ko
Skipping BTF generation for /home/johannes/Downloads/gdrcopy-2.4.4/src/gdrdrv/nv-p2p-dummy.ko due to unavailability of vmlinux
CC [M] /home/johannes/Downloads/gdrcopy-2.4.4/src/gdrdrv/gdrdrv.mod.o
LD [M] /home/johannes/Downloads/gdrcopy-2.4.4/src/gdrdrv/gdrdrv.ko
BTF [M] /home/johannes/Downloads/gdrcopy-2.4.4/src/gdrdrv/gdrdrv.ko
Skipping BTF generation for /home/johannes/Downloads/gdrcopy-2.4.4/src/gdrdrv/gdrdrv.ko due to unavailability of vmlinux
make[2]: Leaving directory '/usr/src/linux-6.4.0-150600.21-obj/x86_64/default'
make[1]: Leaving directory '/home/johannes/Downloads/gdrcopy-2.4.4/src/gdrdrv'
cd src && \
make LIB_MAJOR_VER=2 LIB_MINOR_VER=4
make[1]: Entering directory '/home/johannes/Downloads/gdrcopy-2.4.4/src'
GDRAPI_ARCH=X86
cc -O2 -fPIC -I ../include -I gdrdrv/ -D GDRAPI_ARCH=X86 -c -o gdrapi.o gdrapi.c
cc -O2 -fPIC -I ../include -I gdrdrv/ -D GDRAPI_ARCH=X86 -c -mavx -o memcpy_avx.o memcpy_avx.c
cc -O2 -fPIC -I ../include -I gdrdrv/ -D GDRAPI_ARCH=X86 -c -msse -o memcpy_sse.o memcpy_sse.c
cc -O2 -fPIC -I ../include -I gdrdrv/ -D GDRAPI_ARCH=X86 -c -msse4.1 -o memcpy_sse41.o memcpy_sse41.c
cc -shared -Wl,-soname,libgdrapi.so.2 -o libgdrapi.so.2.4 gdrapi.o memcpy_avx.o memcpy_sse.o memcpy_sse41.o
PATH=/sbin:/usr/sbin:$PATH; ldconfig -n /home/johannes/Downloads/gdrcopy-2.4.4/src
ln -sf libgdrapi.so.2.4 libgdrapi.so.2
ln -sf libgdrapi.so.2 libgdrapi.so
make[1]: Leaving directory '/home/johannes/Downloads/gdrcopy-2.4.4/src'
cd tests && \
make CUDA=/usr/local/cuda
make[1]: Entering directory '/home/johannes/Downloads/gdrcopy-2.4.4/tests'
g++ -O2 -I /usr/local/cuda/include -I ../include -I ../src -I /usr/local/cuda/include -c -o copybw.o copybw.cpp
g++ -O2 -I /usr/local/cuda/include -I ../include -I ../src -I /usr/local/cuda/include -c -o common.o common.cpp
g++ -O2 -I /usr/local/cuda/include -I ../include -I ../src -I /usr/local/cuda/include -L /usr/local/cuda/lib64 -L /usr/local/cuda/lib -L /usr/lib64/nvidia -L /usr/lib/nvidia -L /usr/local/cuda/lib64/stubs -L /usr/local/cuda/lib64 -L ../src -o gdrcopy_copybw copybw.o common.o -lcuda -lpthread -ldl -lgdrapi -lrt
g++ -O2 -I /usr/local/cuda/include -I ../include -I ../src -I /usr/local/cuda/include -c -o sanity.o sanity.cpp
g++ -O2 -I /usr/local/cuda/include -I ../include -I ../src -I /usr/local/cuda/include -c -o testsuites/testsuite.o testsuites/testsuite.cpp
g++ -O2 -I /usr/local/cuda/include -I ../include -I ../src -I /usr/local/cuda/include -L /usr/local/cuda/lib64 -L /usr/local/cuda/lib -L /usr/lib64/nvidia -L /usr/lib/nvidia -L /usr/local/cuda/lib64/stubs -L /usr/local/cuda/lib64 -L ../src -o gdrcopy_sanity sanity.o common.o testsuites/testsuite.o -lcuda -lpthread -ldl -lgdrapi
g++ -O2 -I /usr/local/cuda/include -I ../include -I ../src -I /usr/local/cuda/include -c -o copylat.o copylat.cpp
g++ -O2 -I /usr/local/cuda/include -I ../include -I ../src -I /usr/local/cuda/include -L /usr/local/cuda/lib64 -L /usr/local/cuda/lib -L /usr/lib64/nvidia -L /usr/lib/nvidia -L /usr/local/cuda/lib64/stubs -L /usr/local/cuda/lib64 -L ../src -o gdrcopy_copylat copylat.o common.o -lcuda -lpthread -ldl -lgdrapi -lrt
g++ -O2 -I /usr/local/cuda/include -I ../include -I ../src -I /usr/local/cuda/include -c -o apiperf.o apiperf.cpp
g++ -O2 -I /usr/local/cuda/include -I ../include -I ../src -I /usr/local/cuda/include -L /usr/local/cuda/lib64 -L /usr/local/cuda/lib -L /usr/lib64/nvidia -L /usr/lib/nvidia -L /usr/local/cuda/lib64/stubs -L /usr/local/cuda/lib64 -L ../src -o gdrcopy_apiperf apiperf.o common.o -lcuda -lpthread -ldl -lgdrapi -lrt
/usr/local/cuda/bin/nvcc -o pplat.o -c pplat.cu -lcuda -lpthread -ldl -lgdrapi -I /usr/local/cuda/include -I ../include -I ../src -I /usr/local/cuda/include
/usr/local/cuda/bin/nvcc -o gdrcopy_pplat pplat.o common.o -L /usr/local/cuda/lib64 -L /usr/local/cuda/lib -L /usr/lib64/nvidia -L /usr/lib/nvidia -L /usr/local/cuda/lib64/stubs -L /usr/local/cuda/lib64 -L ../src -lgdrapi -lcuda
make[1]: Leaving directory '/home/johannes/Downloads/gdrcopy-2.4.4/tests'
installing in /usr/local/chem/gdrcopy/lib /usr/local/chem/gdrcopy/include...
'src/libgdrapi.so.2.4' -> '/usr/local/chem/gdrcopy/lib/libgdrapi.so.2.4'
'include/gdrapi.h' -> '/usr/local/chem/gdrcopy/include/gdrapi.h'
'include/gdrconfig.h' -> '/usr/local/chem/gdrcopy/include/gdrconfig.h'
cd tests && make install DESTBIN=/usr/local/chem/gdrcopy/bin
make[1]: Entering directory '/home/johannes/Downloads/gdrcopy-2.4.4/tests'
installing exes in /usr/local/chem/gdrcopy/bin...
'gdrcopy_copybw' -> '/usr/local/chem/gdrcopy/bin/gdrcopy_copybw'
'gdrcopy_copylat' -> '/usr/local/chem/gdrcopy/bin/gdrcopy_copylat'
'gdrcopy_apiperf' -> '/usr/local/chem/gdrcopy/bin/gdrcopy_apiperf'
'gdrcopy_sanity' -> '/usr/local/chem/gdrcopy/bin/gdrcopy_sanity'
'gdrcopy_pplat' -> '/usr/local/chem/gdrcopy/bin/gdrcopy_pplat'
cd /usr/local/chem/gdrcopy/bin && \
ln -sf gdrcopy_copybw copybw && \
ln -sf gdrcopy_copylat copylat && \
ln -sf gdrcopy_apiperf apiperf && \
ln -sf gdrcopy_sanity sanity
make[1]: Leaving directory '/home/johannes/Downloads/gdrcopy-2.4.4/tests'
~/Downloads/gdrcopy-2.4.4> sudo ./insmod.sh
INFO: driver major is 236
INFO: creating /dev/gdrdrv inode
No errors to begin with. Observe that at the very beginning it sates that:
Picking NVIDIA driver sources from NVIDIA_SRC_DIR=/usr/src/kernel-modules/nvidia-565.57.01-default/nvidia. If that does not meet your expectation, you might have a stale driver still around and that might cause problems.
Which seems to be absolutely right and the directory looks like the driver source:
~/Downloads/gdrcopy-2.4.4> ls /usr/src/kernel-modules/nvidia-565.57.01-default/nvidia
detect-self-hosted.h ioctl_nvswitch.h libspdm_hkdf_sha.c libspdm_shash.c nv-caps-imex.c nv-i2c.c nvlink_caps.c nvlink_os.h nv-nano-timer.c nv-pci-table.h nvspdm_cryptlib_extensions.h os-pci.c
export_nvswitch.h library libspdm_hmac_sha.c libspdm_x509.c nv-caps-imex.h nv-ibmnpu.c nvlink_caps.h nvlink_pci.h nv-p2p.c nv-procfs.c nv-usermap.c os-registry.c
hal libspdm_aead_aes_gcm.c libspdm_internal_crypt_lib.c linux_nvswitch.c nv-cray.c nv-ibmnpu.h nvlink_common.h nvlink_proto.h nv-p2p.h nv-reg.h nv_uvm_interface.c os-usermap.c
i2c_nvswitch.c libspdm_aead.c libspdm_rand.c linux_nvswitch.h nv-dmabuf.c nvidia.Kbuild nvlink_errors.h nv-memdbg.c nv-pat.c nv-report-err.c nv-vm.c procfs_nvswitch.c
internal libspdm_ec.c libspdm_rsa.c nv-acpi.c nv-dma.c nvidia-sources.Kbuild nvlink_export.h nv-mmap.c nv-pat.h nv-report-err.h nv-vtophys.c rmp2pdefines.h
internal_crypt_lib.h libspdm_ecc.c libspdm_rsa_ext.c nv.c nv_gpu_ops.h nv-kernel.o_binary nvlink_linux.c nv-modeset-interface.c nv-pci.c nv-rsync.c os-interface.c
ioctl_common_nvswitch.h libspdm_hkdf.c libspdm_sha.c nv-caps.c nv-host1x.c nv-kthread-q.c nvlink_linux.h nv-msi.c nv-pci-table.c nv-rsync.h os-mlock.c
However, when I move to the tests, starting by the sanity, I get a full failure:
~/Downloads/gdrcopy-2.4.4> gdrcopy_sanity
Assertion "(check_gdr_support(dev)) == (true)" failed at sanity.cpp:73
Assertion "(check_gdr_support(dev)) == (true)" failed at sanity.cpp:73
Assertion "(check_gdr_support(dev)) == (true)" failed at sanity.cpp:73
Assertion "(check_gdr_support(dev)) == (true)" failed at sanity.cpp:73
Assertion "(check_gdr_support(dev)) == (true)" failed at sanity.cpp:73
Assertion "(check_gdr_support(dev)) == (true)" failed at sanity.cpp:73
Assertion "(check_gdr_support(dev)) == (true)" failed at sanity.cpp:73
Assertion "(check_gdr_support(dev)) == (true)" failed at sanity.cpp:73
Assertion "(check_gdr_support(dev)) == (true)" failed at sanity.cpp:73
Assertion "(check_gdr_support(dev)) == (true)" failed at sanity.cpp:73
Assertion "(check_gdr_support(dev)) == (true)" failed at sanity.cpp:73
Assertion "(check_gdr_support(dev)) == (true)" failed at sanity.cpp:73
Assertion "(read(read_fd, &cont, sizeof(int))) == (sizeof(int))" failed at sanity.cpp:997
Assertion "(check_gdr_support(dev)) == (true)" failed at sanity.cpp:73
Assertion "(read(read_fd, &cont, sizeof(int))) == (sizeof(int))" failed at sanity.cpp:997
Assertion "(check_gdr_support(dev)) == (true)" failed at sanity.cpp:73
Assertion "(check_gdr_support(dev)) == (true)" failed at sanity.cpp:73
Assertion "(check_gdr_support(dev)) == (true)" failed at sanity.cpp:73
Assertion "(check_gdr_support(dev)) == (true)" failed at sanity.cpp:73
Assertion "(check_gdr_support(dev)) == (true)" failed at sanity.cpp:73
Assertion "(read(read_fd, &d_A, sizeof(CUdeviceptr))) == (sizeof(CUdeviceptr))" failed at sanity.cpp:1829
Assertion "(check_gdr_support(dev)) == (true)" failed at sanity.cpp:73
Assertion "(check_gdr_support(dev)) == (true)" failed at sanity.cpp:73
Assertion "(check_gdr_support(dev)) == (true)" failed at sanity.cpp:73
Assertion "(check_gdr_support(dev)) == (true)" failed at sanity.cpp:73
Assertion "(check_gdr_support(dev)) == (true)" failed at sanity.cpp:73
Assertion "(check_gdr_support(dev)) == (true)" failed at sanity.cpp:73
Assertion "(check_gdr_support(dev)) == (true)" failed at sanity.cpp:73
Assertion "(check_gdr_support(dev)) == (true)" failed at sanity.cpp:73
Assertion "(check_gdr_support(dev)) == (true)" failed at sanity.cpp:73
Assertion "(check_gdr_support(dev)) == (true)" failed at sanity.cpp:73
Assertion "(check_gdr_support(dev)) == (true)" failed at sanity.cpp:73
Assertion "(check_gdr_support(dev)) == (true)" failed at sanity.cpp:73
Assertion "(check_gdr_support(dev)) == (true)" failed at sanity.cpp:73
Assertion "(check_gdr_support(dev)) == (true)" failed at sanity.cpp:73
Total: 28, Passed: 0, Failed: 26, Waived: 2
List of failed tests:
basic_child_thread_pins_buffer_cumemalloc
basic_child_thread_pins_buffer_vmmalloc
basic_cumemalloc
basic_small_buffers_mapping
basic_unaligned_mapping
basic_vmmalloc
basic_with_tokens
data_validation_cumemalloc
data_validation_vmmalloc
invalidation_access_after_gdr_close_cumemalloc
invalidation_access_after_gdr_close_vmmalloc
invalidation_fork_access_after_free_cumemalloc
invalidation_fork_access_after_free_vmmalloc
invalidation_fork_after_gdr_map_cumemalloc
invalidation_fork_after_gdr_map_vmmalloc
invalidation_fork_child_gdr_map_parent_cumemalloc
invalidation_fork_child_gdr_map_parent_vmmalloc
invalidation_fork_child_gdr_pin_parent_with_tokens
invalidation_fork_map_and_free_cumemalloc
invalidation_fork_map_and_free_vmmalloc
invalidation_two_mappings_cumemalloc
invalidation_two_mappings_vmmalloc
invalidation_unix_sock_shared_fd_gdr_map_cumemalloc
invalidation_unix_sock_shared_fd_gdr_map_vmmalloc
invalidation_unix_sock_shared_fd_gdr_pin_buffer_cumemalloc
invalidation_unix_sock_shared_fd_gdr_pin_buffer_vmmalloc
List of waived tests:
invalidation_access_after_free_cumemalloc
invalidation_access_after_free_vmmalloc
Error: Encountered an error or a test failure with status=1
Does anybody has any idea on what could possibly be going wrong here? Any clue or advice would be very welcome!!
Thanks a lot in advance.
(and again, sorry for the long message)