[Solved]GPUDirect RDMA help

I am trying to use functions defined in nv-p2p.h for GPUDirect RDMA (nvidia_p2p_get_pages(), etc). Does anyone know what libraries I need to link to when compiling? Using -lcuda and -lcudart and /kernel/nvidia.o is not working for me. Thanks.

This may be of interest:

http://stackoverflow.com/questions/21565751/undefined-symbols-with-linux-kernel-driver-build-nvidia

Using the info from the above link, I have built the kernel module in ~/NVIDIA-Linux-x86_64-340.32/kernel and have been trying to link to nvidia.o but have still been getting errors.

-sh-4.1$ nvcc gpuRDMA.cu -lcuda -L ~/NVIDIA-Linux-x86_64-340.32/kernel/nvidia.o -o gpuRDMA
gpuRDMA.cu(75): warning: variable “devptr” is used before its value is set

gpuRDMA.cu(75): warning: variable “devptr” is used before its value is set

/tmp/tmpxft_00006f41_00000000-16_gpuRDMA.o: In function free_callback(void*)': tmpxft_00006f41_00000000-3_gpuRDMA.cudafe1.cpp:(.text+0xa2): undefined reference to nvidia_p2p_free_page_table(nvidia_p2p_page_table*)’
/tmp/tmpxft_00006f41_00000000-16_gpuRDMA.o: In function kmd_pin_memory(kmd_state*, void*, unsigned long)': tmpxft_00006f41_00000000-3_gpuRDMA.cudafe1.cpp:(.text+0x11b): undefined reference to nvidia_p2p_get_pages(unsigned long, unsigned int, unsigned long, unsigned long, nvidia_p2p_page_table**, void ()(void), void*)’
collect2: ld returned 1 exit status

I also got this error when building the kernel:
-sh-4.1$ make module
NVIDIA: calling KBUILD…
make[1]: Entering directory `/usr/src/kernels/2.6.32-431.el6.x86_64’
make -C /lib/modules/2.6.32-431.el6.x86_64/build
KBUILD_SRC=/usr/src/kernels/2.6.32-431.el6.x86_64
KBUILD_EXTMOD="/pcshome/clementm/NVIDIA-Linux-x86_64-340.32/kernel" -f /usr/src/kernels/2.6.32-431.el6.x86_64/Makefile
modules
test -e include/linux/autoconf.h -a -e include/config/auto.conf || (
echo;
echo " ERROR: Kernel configuration is invalid.";
echo " include/linux/autoconf.h or include/config/auto.conf are missing.";
echo " Run ‘make oldconfig && make prepare’ on kernel src to fix it.";
echo;
/bin/false)
but the kernel module was still compiled. Any idea where things are going awry?

I have written as simple program to use some of the GPUDirect RDMA functions declared in nv-p2p.h (I attach the code and Makefile) I have built an nvidia kernel module in a directory and am trying to link against kernel object files.

nvidia.o should contain the nvidia_p2p functions:
-sh-4.1$ nm ~/NVIDIA-Linux-x86_64-340.32/kernel/nvidia.o | grep nvidia_p2p
000000004c9ba34e A __crc_nvidia_p2p_destroy_mapping
0000000088765bb5 A __crc_nvidia_p2p_free_page_table
00000000f487b36a A __crc_nvidia_p2p_get_pages
00000000c28548aa A __crc_nvidia_p2p_init_mapping
00000000eacba72c A __crc_nvidia_p2p_put_pages
0000000000000018 r __kcrctab_nvidia_p2p_destroy_mapping
0000000000000008 r __kcrctab_nvidia_p2p_free_page_table
0000000000000010 r __kcrctab_nvidia_p2p_get_pages
0000000000000020 r __kcrctab_nvidia_p2p_init_mapping
0000000000000000 r __kcrctab_nvidia_p2p_put_pages
0000000000000045 r __kstrtab_nvidia_p2p_destroy_mapping
0000000000000015 r __kstrtab_nvidia_p2p_free_page_table
0000000000000030 r __kstrtab_nvidia_p2p_get_pages
0000000000000060 r __kstrtab_nvidia_p2p_init_mapping
0000000000000000 r __kstrtab_nvidia_p2p_put_pages
0000000000000030 r __ksymtab_nvidia_p2p_destroy_mapping
0000000000000010 r __ksymtab_nvidia_p2p_free_page_table
0000000000000020 r __ksymtab_nvidia_p2p_get_pages
0000000000000040 r __ksymtab_nvidia_p2p_init_mapping
0000000000000000 r __ksymtab_nvidia_p2p_put_pages
0000000000572740 T nvidia_p2p_destroy_mapping
00000000005722f0 T nvidia_p2p_free_page_table
0000000000572440 T nvidia_p2p_get_pages
00000000005727e0 T nvidia_p2p_init_mapping
0000000000003a70 B nvidia_p2p_page_t_cache
0000000000572360 T nvidia_p2p_put_pages

but when I try to link agains the directory ~/NVIDIA-Linux-x86_64-340.32/kernel/ in my make command:
nvcc -c gpuRDMA.cu
gpuRDMA.cu(75): warning: variable “devptr” is used before its value is set

gpuRDMA.cu(75): warning: variable “devptr” is used before its value is set

nvcc gpuRDMA.o -L/pcshome/clementm/NVIDIA-Linux-x86_64-340.32/kernel/ -lcuda -lcudart -o gpuRDMA
gpuRDMA.o: In function free_callback(void*)': tmpxft_00006fb8_00000000-3_gpuRDMA.cudafe1.cpp:(.text+0xa2): undefined reference to nvidia_p2p_free_page_table(nvidia_p2p_page_table*)’
gpuRDMA.o: In function kmd_pin_memory(kmd_state*, void*, unsigned long)': tmpxft_00006fb8_00000000-3_gpuRDMA.cudafe1.cpp:(.text+0x11b): undefined reference to nvidia_p2p_get_pages(unsigned long, unsigned int, unsigned long, unsigned long, nvidia_p2p_page_table**, void ()(void), void*)’
collect2: ld returned 1 exit status

Does anyone have any bright ideas? Thanks.

If you want to link in another object (to satisfy those undefined references), it’s not enough to just “link against a directory” (to use your words).

You have to specify all the objects you want to link together. You specify directories using the -L switch when you want to direct the linker where to look for libraries (specified using the -l switch). An object like nvidia.o is not a library. Even if it were, you haven’t told the linker to include it.

(This has nothing to do with CUDA, by the way.)

Try something like this:

nvcc gpuRDMA.o /pcshome/clementm/NVIDIA-Linux-x86_64-340.32/kernel/nvidia.o -lcuda -o gpuRDMA

(When compiling with nvcc, it’s not necessary to specify -lcudart)

I haven’t actually tested this. Bringing in the nvidia.o object may then bring in a whole bunch of new dependencies that may have to be resolved by linking in other objects or libraries. But if you intend to satisfy an undefined reference using an entry point in nvidia.o, it’s necessary to explicitly include nvidia.o in your link command. The directory alone won’t get you there.

As an aside, it’s unlikely that you would want to ignore warnings like these:

gpuRDMA.cu(75): warning: variable "devptr" is used before its value is set

They almost always indicate a significant problem.

I’ve tried making a simple character device driver that can use the function nvidia_p2p_get_pages(). Every time I try to run my user space program (ioctl_user), I get a kernal panic and my computer crashes. I’ve attached my source code and README file that explains what I am trying to do. If anyone has any ideas on why my kernel module is crashing my computer, they would be greatly appreciated.

Also attached the vmcore-dmesg.txt output from one of the crashes.