cuFFT Callbacks With Host Compiler

Hi, I’m trying to get an existing application that uses both host and device compilers with cross linking. I’ve looked at the simple_cuFFT_callback example, and that compiles/links fine, but it only uses the nvcc compiler. When I cross-link, I see a bunch of:

set_callback.compute_75.cudafe1.cpp:(.text+0x504): undefined reference to `__cudaRegisterLinkedBinary_31_set_callback_compute_75_cpp1_ii_d966aea7’

This is despite -lcufft_static -lculibos, and I can see with “strings” that the symbol is in cufft_static.a, so I don’t understand what’s going on. I’m following the exact examples here:

https://stackoverflow.com/questions/22076052/cuda-dynamic-parallelism-makefile

I’m using g++ on all host files, nvcc on all device files with -dc, combining them all with -dlink into a single object, and finally linking with g++ on that object. I’ve tried nvcc with the final link step as well, and that appears to be hitting an issue where the linker reports:

nvlink fatal : Memory allocation failure

Anyone have an idea?

There would appear to be at least 2 issues:

  1. a problem with your device link step. Without seeing a full example, I wouldn’t be able to say more.
  2. A problem with nvcc (the nvlink fatal message). That should probably be filed as a bug. To file a bug that is useful, you would also need to provide a short but complete example of how to get that failure.

I also recommend for item 2 above checking on the latest CUDA version, if you have not already done so.

Thanks Robert. I noticed the out of memory error only occurs when I link in the dpdk library. It’s quite large. If I leave that out, I don’t appear to hit that bug, but the message about nvlink seems very odd. I’m guessing nvlink is not referring to the interconnect, but the linker, and was probably named earlier.

It’s going to be fairly hard to post a full example. there are hundreds of files on both the CPU and the GPU side. I changed around the makefile to follow your instructions from stack overflow where it’s doing those three distinct compilation steps. I don’t know if some flags are wrong or something, but the example in the cuda directory seems extremely simple. The biggest difference I can see is I’m only including architectures 6 and above, where the sample does 3 and above.

I’m using cuda 10.1.

It does often require substantial effort to create a good quality MWE or MCVE or short complete example. Many folks don’t want to put that effort in, but it’s not a ridiculous request or expectation, IMO. Some sites like Stack Overflow make it a requirement (at least, according to published rules).

I suggest it merely from the standpoint that if you can do so, it is usually the best and quickest way to resolve the issue from that point forward, based on my experience. Of course, it may not be “quick” to create it.

You’re welcome to do what you wish, of course. These are just suggestions, for best expectations of forward progress, based on my experience. I’m reasonably certain that if you filed a bug without a way to reproduce the error, it would be less likely to lead to resolution.

Robert, understood. In the interest of time, I will put here what I think is relevant, since the exact same flags are repeated for all compilation steps. Here is one of the CPU steps. I removed all include paths since it’s not pertinent:

/usr/bin/g++ -std=c++17 -g -O2 -march=haswell -c -o obj/DpdkMgr.o ./src/DpdkMgr.cpp

And here are the GPU steps:

/usr/local/cuda-10.1/bin/nvcc -ccbin /usr/bin/g++ -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -std=c++14 --compiler-options -march=haswell,-Wall,-Wno-unused-function,-fPIC -dc -o obj/GPU/GpuUtils.o ./src/GPU/GpuUtils.cu

-dlink step:

/usr/local/cuda-10.1/bin/nvcc -ccbin /usr/bin/g++ -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -std=c++14 --compiler-options -march=haswell,-Wall,-Wno-unused-function,-fPIC -dlink obj/GPU/GpuUtils.o -o ./obj/GPU//device_objects.o

Final Link:

/usr/bin/g++ -std=c++14 -g -O2 -pthread -Wl,–whole-archive -Wl,-ldpdk -Wl,–start-group -Wl,-lrt -Wl,-lm -Wl,-ldl -Wl,–end-group -Wl,–no-whole-archive -libverbs -lnuma -lmnl -o src/testbed ./obj/all_cpu.o
./obj/GPU//device_objects.o -L/usr/local/cuda-10.1/targets/x86_64-linux/lib -L./obj/ -L./obj/GPU/ -lcudart -lcublas -lcurand -l:libcufft_static.a -l:libculibos.a -lnvToolsExt -l:libz.a -l:libssl.a -l:libcrypto.a

I will keep playing around and try to create an example that reproduces it in the meantime. Thanks.

the cufft callbacks are not being device-linked

that won’t work

Robert, the -dc flag is on the second set of commands I pasted in GpuUtils.cu. That’s where the FFT callbacks are compiled from. Is that what you mean, or did I miss something?

That is your callback code. It needs to be connected to the cufft library itself. That is not happening in your device link step.

When you have cufft callbacks, your main code is calling into the cufft library.

The cufft library routine will eventually launch a kernel(s) that will need to be connected to your provided callback routines. That connection of device code, from a global kernel (in the CUFFT library) to your device routines in a separate compilation unit, requires device linking. That device-link connection could not possibly be happening, because you are providing no indication of cufft to your device-link step. It’s not automatic. It would be equivalent to omitting a library from your final link step. And to be clear, this connection of a cufft library global routine to your provided device routine cannot be done during final link by g++, because g++ knows nothing about CUDA.

Hopefully that will give you enough info to make changes. There may be answered questions like this on the web already, or you may want to look at makefiles for CUDA/cufft sample projects that use callbacks. If they do the compile/link all in one step (I don’t remember if they do), then that won’t be that instructive, but some of them may have the device link phase split out separately.

Otherwise when I have time later I’ll put together a sample project that demonstrates the compile/link sequence.

Robert, thanks! I’ll spend some time working on it using what you said, and can hopefully come to find a fix.

Starting with the cuda sample code simpleCUFFT_callback, I was able to perform the following sequence:

nvcc -I/usr/local/cuda/samples/common/inc  -dc simpleCUFFT_callback.cu
nvcc -dlink simpleCUFFT_callback.o -o dlink.o -lculibos -lcufft_static
g++ -o test1 simpleCUFFT_callback.o dlink.o -L/usr/local/cuda/lib64 -lcufft_static -lculibos -ldl -lpthread  -lcudart

to build the code successfully (and the test1 executable runs correctly) using CUDA 10.0.

Hi Robert, sorry for the delay. I’m trying to follow exactly as you have it, but it still seems to be having a problem:

[obj/GPU/GpuTests.o] from [./unit_tests/GpuTests.cu]
/usr/local/cuda-10.1/bin/nvcc -ccbin /usr/bin/g++ -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -std=c++14 --compiler-options -march=haswell,-Wall,-Wno-unused-function,-fPIC, -isystem /usr/local/cuda-10.1/targets/x86_64-linux/include/  -isystem ../cots/cub/ -dc -o obj/GPU/GpuTests.o ./unit_tests/GpuTests.cu
[src/app]
/usr/local/cuda-10.1/bin/nvcc -ccbin /usr/bin/g++ -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -std=c++14 --compiler-options -march=haswell,-Wall,-Wno-unused-function,-fPIC,     -include ../cots/dpdk/x86_64-native-linuxapp-gcc/include/rte_config.h  -dlink  ./obj/GPU/GpuTests.o  -o ./obj/GPU//device_objects.o -lcudart -lcublas -lcurand -l:libcufft_static.a -l:libculibos.a -lnvToolsExt
/usr/bin/g++   -std=c++14 -g -O2 -pthread -Wl,--whole-archive -Wl,-ldpdk -Wl,--start-group -Wl,-lrt -Wl,-lm -Wl,-ldl -Wl,--end-group -Wl,--no-whole-archive -libverbs -lnuma -lmnl    -o src/app 
   ./obj/GPU//device_objects.o  ./obj/CPU_objs.o  -L/usr/local/cuda-10.1/targets/x86_64-linux/lib -L./obj/ -L./obj/GPU/  -lcudart -lcublas -lcurand -l:libcufft_static.a -l:libculibos.a -lnvToolsExt -l:libgtest.a -l:libconfig++.a -l:libprotobuf.a -l:libgrpc++_reflection.a -l:libgrpc++.a -l:libgrpc.a -lbenchmark -l:libz.a -l:libssl.a -l:libcrypto.a -lmlx5

The errors start with:

/usr/local/cuda-10.1/targets/x86_64-linux/lib/libcufft_static.a(set_callback.o): In function `__sti____cudaRegisterAll()':
set_callback.compute_75.cudafe1.cpp:(.text+0x10d): undefined reference to `__cudaRegisterLinkedBinary_31_set_callback_compute_75_cpp1_ii_d966aea7'
/usr/local/cuda-10.1/targets/x86_64-linux/lib/libcufft_static.a(set_callback.o): In function `global constructors keyed to Visitors::Callback::replace(fftDevice const&, cufftXtCallbackType_t, callback_t, KernelLaunch*, cufftXtCallbackType_t*, KernelLaunch* (Queue::*)(KernelLaunch*), Queue*, KernelLaunch**)':

why does:

obj/GPU/GpuTests.o

not appear anywhere in your final link line?

The final link must include the original object created by the -dc compilation step as well as the device-linked object created by the -dlink step.

Hi Robert, I just included that and have the same results. I will try to make a MWE as you suggested since I can’t seem to get it by following your example.

Different versions of the gnu toolchain are more or less picky about library link order. Your link order does not match mine. If a subsequent library (e.g. cufft static) depends on a previous library (e.g. cudart) in the link order that can be a problem depending on your specific gnu version. In my experience newer gnu versions are more picky than older gnu versions.

You might want to fiddle with link order to exactly match mine. Linking twice against e.g. cudart may also be not a good idea.

If you post a MWE, be sure to indicate exact versions of the CPU and GPU compiler toolchains.

For compilers that are picky, I believe a basic rule of thumb is that dependencies should be satisfied from left to right in the supplied library order:

https://eli.thegreenplace.net/2013/07/09/library-order-in-static-linking

Hi Robert, someone else here took a stab at reordering the linking lines, and it worked. Thanks for all your help.

Hi Robert, while the callbacks are indeed working, we’re now seeing a fairly noticeable performance hit in other kernels that appears to be caused by using -rdc. This seems to have been hit before:

https://devtalk.nvidia.com/default/topic/991617/the-cost-of-relocatable-device-code-rdc-true-/

The performance hit on some kernels is about 20%, which is higher than that other thread was reporting. The callbacks aren’t really worth doing with that large of an impact to other kernels. Is there a way around this, or do the callbacks need relocatable code?

Yes, -rdc can result in slower code.
Yes, callbacks require -rdc

I indicated the reason why (callbacks require -rdc) earlier in this thread. There is device-code linking required between code you supply and code in the CUFFT library.

My suggestion:
If you want to partition your code, it’s possible to compile some compilation units/modules/files with -dc and some without, and link it together. You would perform a device link step collectively on the modules that were compiled with -dc, and feed the other cuda compiled objects as-is to the final link step. This should prevent the -rdc slowdown effect for any kernels in the modules that were compiled without -dc.

Since the callback functionality is contained entirely in device routines, it’s not obvious to me that you couldn’t put these in their own module. However I’m not familiar with your code organization or what obstacles there may be.

Here’s an example demonstrating the general idea, albeit not with cufft callbacks:

$ cat t1.cu
#include <stdio.h>
void k2();
__global__ void k1(){
  printf(" hello from k1\n");
}

int main(){

  k1<<<1,1>>>();
  k2();
  cudaDeviceSynchronize();
}
$ cat t2.cu
#include <stdio.h>
__device__ void my3();
__global__ void myk2(){
  my3();
}

void k2(){

  myk2<<<1,1>>>();
}

$ cat t3.cu
#include <stdio.h>

__device__ void my3(){

  printf("hello\n");
}
$ nvcc -c t1.cu
$ nvcc -dc t2.cu
$ nvcc -dc t3.cu
$ nvcc -dlink t2.o t3.o -o dlink.o
$ g++ t1.o t2.o t3.o dlink.o -o test -L/usr/local/cuda/lib64 -lcudart
$ cuda-memcheck ./test
========= CUDA-MEMCHECK
 hello from k1
hello
========= ERROR SUMMARY: 0 errors
$

Thanks Robert! With some makefile reorganization that worked perfectly. The other kernels are back to their normal runtime.