cuFFT Callbacks With Host Compiler

cliff.burdick · April 12, 2019, 4:36am

Hi, I’m trying to get an existing application that uses both host and device compilers with cross linking. I’ve looked at the simple_cuFFT_callback example, and that compiles/links fine, but it only uses the nvcc compiler. When I cross-link, I see a bunch of:

set_callback.compute_75.cudafe1.cpp:(.text+0x504): undefined reference to `__cudaRegisterLinkedBinary_31_set_callback_compute_75_cpp1_ii_d966aea7’

This is despite -lcufft_static -lculibos, and I can see with “strings” that the symbol is in cufft_static.a, so I don’t understand what’s going on. I’m following the exact examples here:

https://stackoverflow.com/questions/22076052/cuda-dynamic-parallelism-makefile

I’m using g++ on all host files, nvcc on all device files with -dc, combining them all with -dlink into a single object, and finally linking with g++ on that object. I’ve tried nvcc with the final link step as well, and that appears to be hitting an issue where the linker reports:

nvlink fatal : Memory allocation failure

Anyone have an idea?

Robert_Crovella · April 12, 2019, 1:48pm

There would appear to be at least 2 issues:

a problem with your device link step. Without seeing a full example, I wouldn’t be able to say more.
A problem with nvcc (the nvlink fatal message). That should probably be filed as a bug. To file a bug that is useful, you would also need to provide a short but complete example of how to get that failure.

I also recommend for item 2 above checking on the latest CUDA version, if you have not already done so.

cliff.burdick · April 12, 2019, 2:28pm

Thanks Robert. I noticed the out of memory error only occurs when I link in the dpdk library. It’s quite large. If I leave that out, I don’t appear to hit that bug, but the message about nvlink seems very odd. I’m guessing nvlink is not referring to the interconnect, but the linker, and was probably named earlier.

It’s going to be fairly hard to post a full example. there are hundreds of files on both the CPU and the GPU side. I changed around the makefile to follow your instructions from stack overflow where it’s doing those three distinct compilation steps. I don’t know if some flags are wrong or something, but the example in the cuda directory seems extremely simple. The biggest difference I can see is I’m only including architectures 6 and above, where the sample does 3 and above.

I’m using cuda 10.1.

Robert_Crovella · April 12, 2019, 2:34pm

It does often require substantial effort to create a good quality MWE or MCVE or short complete example. Many folks don’t want to put that effort in, but it’s not a ridiculous request or expectation, IMO. Some sites like Stack Overflow make it a requirement (at least, according to published rules).

I suggest it merely from the standpoint that if you can do so, it is usually the best and quickest way to resolve the issue from that point forward, based on my experience. Of course, it may not be “quick” to create it.

You’re welcome to do what you wish, of course. These are just suggestions, for best expectations of forward progress, based on my experience. I’m reasonably certain that if you filed a bug without a way to reproduce the error, it would be less likely to lead to resolution.

cliff.burdick · April 12, 2019, 2:54pm

Robert, understood. In the interest of time, I will put here what I think is relevant, since the exact same flags are repeated for all compilation steps. Here is one of the CPU steps. I removed all include paths since it’s not pertinent:

/usr/bin/g++ -std=c++17 -g -O2 -march=haswell -c -o obj/DpdkMgr.o ./src/DpdkMgr.cpp

And here are the GPU steps:

/usr/local/cuda-10.1/bin/nvcc -ccbin /usr/bin/g++ -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -std=c++14 --compiler-options -march=haswell,-Wall,-Wno-unused-function,-fPIC -dc -o obj/GPU/GpuUtils.o ./src/GPU/GpuUtils.cu

-dlink step:

/usr/local/cuda-10.1/bin/nvcc -ccbin /usr/bin/g++ -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -std=c++14 --compiler-options -march=haswell,-Wall,-Wno-unused-function,-fPIC -dlink obj/GPU/GpuUtils.o -o ./obj/GPU//device_objects.o

Final Link:

/usr/bin/g++ -std=c++14 -g -O2 -pthread -Wl,–whole-archive -Wl,-ldpdk -Wl,–start-group -Wl,-lrt -Wl,-lm -Wl,-ldl -Wl,–end-group -Wl,–no-whole-archive -libverbs -lnuma -lmnl -o src/testbed ./obj/all_cpu.o
./obj/GPU//device_objects.o -L/usr/local/cuda-10.1/targets/x86_64-linux/lib -L./obj/ -L./obj/GPU/ -lcudart -lcublas -lcurand -l:libcufft_static.a -l:libculibos.a -lnvToolsExt -l:libz.a -l:libssl.a -l:libcrypto.a

I will keep playing around and try to create an example that reproduces it in the meantime. Thanks.

Robert_Crovella · April 12, 2019, 3:05pm

the cufft callbacks are not being device-linked

that won’t work

cliff.burdick · April 12, 2019, 3:08pm

Robert, the -dc flag is on the second set of commands I pasted in GpuUtils.cu. That’s where the FFT callbacks are compiled from. Is that what you mean, or did I miss something?

Robert_Crovella · April 12, 2019, 3:22pm

That is your callback code. It needs to be connected to the cufft library itself. That is not happening in your device link step.

When you have cufft callbacks, your main code is calling into the cufft library.

The cufft library routine will eventually launch a kernel(s) that will need to be connected to your provided callback routines. That connection of device code, from a global kernel (in the CUFFT library) to your device routines in a separate compilation unit, requires device linking. That device-link connection could not possibly be happening, because you are providing no indication of cufft to your device-link step. It’s not automatic. It would be equivalent to omitting a library from your final link step. And to be clear, this connection of a cufft library global routine to your provided device routine cannot be done during final link by g++, because g++ knows nothing about CUDA.

Hopefully that will give you enough info to make changes. There may be answered questions like this on the web already, or you may want to look at makefiles for CUDA/cufft sample projects that use callbacks. If they do the compile/link all in one step (I don’t remember if they do), then that won’t be that instructive, but some of them may have the device link phase split out separately.

Otherwise when I have time later I’ll put together a sample project that demonstrates the compile/link sequence.

cliff.burdick · April 12, 2019, 3:29pm

Robert, thanks! I’ll spend some time working on it using what you said, and can hopefully come to find a fix.

Robert_Crovella · April 12, 2019, 9:22pm

Starting with the cuda sample code simpleCUFFT_callback, I was able to perform the following sequence:

nvcc -I/usr/local/cuda/samples/common/inc  -dc simpleCUFFT_callback.cu
nvcc -dlink simpleCUFFT_callback.o -o dlink.o -lculibos -lcufft_static
g++ -o test1 simpleCUFFT_callback.o dlink.o -L/usr/local/cuda/lib64 -lcufft_static -lculibos -ldl -lpthread  -lcudart

to build the code successfully (and the test1 executable runs correctly) using CUDA 10.0.

cliff.burdick · May 1, 2019, 10:31pm

Hi Robert, sorry for the delay. I’m trying to follow exactly as you have it, but it still seems to be having a problem:

[obj/GPU/GpuTests.o] from [./unit_tests/GpuTests.cu]
/usr/local/cuda-10.1/bin/nvcc -ccbin /usr/bin/g++ -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -std=c++14 --compiler-options -march=haswell,-Wall,-Wno-unused-function,-fPIC, -isystem /usr/local/cuda-10.1/targets/x86_64-linux/include/  -isystem ../cots/cub/ -dc -o obj/GPU/GpuTests.o ./unit_tests/GpuTests.cu
[src/app]
/usr/local/cuda-10.1/bin/nvcc -ccbin /usr/bin/g++ -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -std=c++14 --compiler-options -march=haswell,-Wall,-Wno-unused-function,-fPIC,     -include ../cots/dpdk/x86_64-native-linuxapp-gcc/include/rte_config.h  -dlink  ./obj/GPU/GpuTests.o  -o ./obj/GPU//device_objects.o -lcudart -lcublas -lcurand -l:libcufft_static.a -l:libculibos.a -lnvToolsExt
/usr/bin/g++   -std=c++14 -g -O2 -pthread -Wl,--whole-archive -Wl,-ldpdk -Wl,--start-group -Wl,-lrt -Wl,-lm -Wl,-ldl -Wl,--end-group -Wl,--no-whole-archive -libverbs -lnuma -lmnl    -o src/app 
   ./obj/GPU//device_objects.o  ./obj/CPU_objs.o  -L/usr/local/cuda-10.1/targets/x86_64-linux/lib -L./obj/ -L./obj/GPU/  -lcudart -lcublas -lcurand -l:libcufft_static.a -l:libculibos.a -lnvToolsExt -l:libgtest.a -l:libconfig++.a -l:libprotobuf.a -l:libgrpc++_reflection.a -l:libgrpc++.a -l:libgrpc.a -lbenchmark -l:libz.a -l:libssl.a -l:libcrypto.a -lmlx5

The errors start with:

/usr/local/cuda-10.1/targets/x86_64-linux/lib/libcufft_static.a(set_callback.o): In function `__sti____cudaRegisterAll()':
set_callback.compute_75.cudafe1.cpp:(.text+0x10d): undefined reference to `__cudaRegisterLinkedBinary_31_set_callback_compute_75_cpp1_ii_d966aea7'
/usr/local/cuda-10.1/targets/x86_64-linux/lib/libcufft_static.a(set_callback.o): In function `global constructors keyed to Visitors::Callback::replace(fftDevice const&, cufftXtCallbackType_t, callback_t, KernelLaunch*, cufftXtCallbackType_t*, KernelLaunch* (Queue::*)(KernelLaunch*), Queue*, KernelLaunch**)':

Robert_Crovella · May 1, 2019, 10:49pm

why does:

obj/GPU/GpuTests.o

not appear anywhere in your final link line?

The final link must include the original object created by the -dc compilation step as well as the device-linked object created by the -dlink step.

cliff.burdick · May 1, 2019, 11:17pm

Hi Robert, I just included that and have the same results. I will try to make a MWE as you suggested since I can’t seem to get it by following your example.

Robert_Crovella · May 1, 2019, 11:24pm

Different versions of the gnu toolchain are more or less picky about library link order. Your link order does not match mine. If a subsequent library (e.g. cufft static) depends on a previous library (e.g. cudart) in the link order that can be a problem depending on your specific gnu version. In my experience newer gnu versions are more picky than older gnu versions.

You might want to fiddle with link order to exactly match mine. Linking twice against e.g. cudart may also be not a good idea.

If you post a MWE, be sure to indicate exact versions of the CPU and GPU compiler toolchains.

For compilers that are picky, I believe a basic rule of thumb is that dependencies should be satisfied from left to right in the supplied library order:

[url]https://eli.thegreenplace.net/2013/07/09/library-order-in-static-linking[/url]

cliff.burdick · May 2, 2019, 2:39pm

Hi Robert, someone else here took a stab at reordering the linking lines, and it worked. Thanks for all your help.

cliff.burdick · May 3, 2019, 8:15pm

Hi Robert, while the callbacks are indeed working, we’re now seeing a fairly noticeable performance hit in other kernels that appears to be caused by using -rdc. This seems to have been hit before:

https://devtalk.nvidia.com/default/topic/991617/the-cost-of-relocatable-device-code-rdc-true-/

The performance hit on some kernels is about 20%, which is higher than that other thread was reporting. The callbacks aren’t really worth doing with that large of an impact to other kernels. Is there a way around this, or do the callbacks need relocatable code?

Robert_Crovella · May 3, 2019, 9:39pm

Yes, -rdc can result in slower code.
Yes, callbacks require -rdc

I indicated the reason why (callbacks require -rdc) earlier in this thread. There is device-code linking required between code you supply and code in the CUFFT library.

My suggestion:
If you want to partition your code, it’s possible to compile some compilation units/modules/files with -dc and some without, and link it together. You would perform a device link step collectively on the modules that were compiled with -dc, and feed the other cuda compiled objects as-is to the final link step. This should prevent the -rdc slowdown effect for any kernels in the modules that were compiled without -dc.

Since the callback functionality is contained entirely in device routines, it’s not obvious to me that you couldn’t put these in their own module. However I’m not familiar with your code organization or what obstacles there may be.

Here’s an example demonstrating the general idea, albeit not with cufft callbacks:

$ cat t1.cu
#include <stdio.h>
void k2();
__global__ void k1(){
  printf(" hello from k1\n");
}

int main(){

  k1<<<1,1>>>();
  k2();
  cudaDeviceSynchronize();
}
$ cat t2.cu
#include <stdio.h>
__device__ void my3();
__global__ void myk2(){
  my3();
}

void k2(){

  myk2<<<1,1>>>();
}

$ cat t3.cu
#include <stdio.h>

__device__ void my3(){

  printf("hello\n");
}
$ nvcc -c t1.cu
$ nvcc -dc t2.cu
$ nvcc -dc t3.cu
$ nvcc -dlink t2.o t3.o -o dlink.o
$ g++ t1.o t2.o t3.o dlink.o -o test -L/usr/local/cuda/lib64 -lcudart
$ cuda-memcheck ./test
========= CUDA-MEMCHECK
 hello from k1
hello
========= ERROR SUMMARY: 0 errors
$

cliff.burdick · May 5, 2019, 3:16pm

Robert_Crovella:

Yes, -rdc can result in slower code.
Yes, callbacks require -rdc

I indicated the reason why earlier in this thread. There is device-code linking required between code you supply and code in the CUFFT library.

My suggestion:
If you want to partition your code, it’s possible to compile some compilations units/modules/files with -dc and some without, and link it together. You would perform a device link step collectively on the modules that were compiled with -dc, and feed the other cuda compiled objects as-is to the final link step. This should prevent the -rdc slowdown effect for any kernels in the modules that were compiled without -dc.

Since the callback functionality is contained entirely in device routines, it’s not obvious to me that you couldn’t put these in their own module. However I’m not familiar with your code organization or what obstacles there may be.

Here’s an example demonstrating the general idea, albeit not with cufft callbacks:
$ cat t1.cu
#include <stdio.h>
void k2();
__global__ void k1(){
  printf(" hello from k1\n");
}

int main(){

  k1<<<1,1>>>();
  k2();
  cudaDeviceSynchronize();
}
$ cat t2.cu
#include <stdio.h>
__device__ void my3();
__global__ void myk2(){
  my3();
}

void k2(){

  myk2<<<1,1>>>();
}


$ cat t3.cu
#include <stdio.h>

__device__ void my3(){

  printf("hello\n");
}
$ nvcc -c t1.cu
$ nvcc -dc t2.cu
$ nvcc -dc t3.cu
$ nvcc -dlink t2.o t3.o -o dlink.o
$ g++ t1.o t2.o t3.o dlink.o -o test -L/usr/local/cuda/lib64 -lcudart
$ cuda-memcheck ./test
========= CUDA-MEMCHECK
 hello from k1
hello
========= ERROR SUMMARY: 0 errors
$

Thanks Robert! With some makefile reorganization that worked perfectly. The other kernels are back to their normal runtime.

Topic		Replies	Views
CUDA Pro Tip: Use cuFFT Callbacks for Custom Data Processing Technical Blog	18	1223	August 15, 2023
Separate Compilation and Linking of CUDA C++ Device Code Technical Blog	39	1784	September 8, 2019
Calling kernels from a c-File Linking problem CUDA Programming and Performance	15	9107	October 5, 2009
Advice on porting to an HPC application to GPU nvc, nvc++ and nvfortran	6	122	July 30, 2024
What specifically is deprecated about cuFFT callbacks in CUDA 11.4? GPU-Accelerated Libraries cufft	6	1251	October 16, 2024
NVCC forces c++ compilation of .cu files CUDA Programming and Performance	11	25705	December 11, 2011
cuFFT Callbacks in Shared Libraries GPU-Accelerated Libraries	3	990	April 24, 2018
Nvcc compiler options on Eclipse CUDA Programming and Performance	12	25931	June 2, 2010
How to debug "Invalid memory reference" while generating linker CUDA Programming and Performance	17	5207	August 21, 2014
Mars: MapReduce Framework & Cuda ToolKit 7.0 CUDA Programming and Performance	25	5881	June 9, 2015

cuFFT Callbacks With Host Compiler

Related topics