Unavoidable register spilling with cuFFT callbacks

I just sent this bug report through the registered developer program. I copy it here in order to discuss workarounds, if any.

The current method to attach cuFFT callback causes unavoidable spilling even with tiny functions. This is because the minimum number of usable registers per kernel by ABI is small, and the function has to stick to that in order to be compatible with every possible calling cuFFT kernel. If one compiles the very same function but called from an own kernel, spilling does not happen because the kernel is free to allocate as many register it needs. See the self-contained test case code (tested CUDA 7.0 and 7.5).

#include <cufft.h>

extern "C"{

//complex multiplication
__device__ static cufftDoubleComplex operator*(const cufftDoubleComplex& a, const cufftDoubleComplex& b){
	return { a.x * b.x - a.y * b.y, a.x * b.y + a.y * b.x };

//e^(I x), x is real
__device__ static cufftDoubleComplex e_pow_I(double x){
	cufftDoubleComplex e;
	sincos(x, &e.y, &e.x);
	return e;

//cuFFT store callback, register number limited by ABI
__device__ void spilling(void* _dataOut, size_t offset, cufftDoubleComplex element, void* = 0, void* = 0){
	//calculate psi -> psi * e^(I |psi|^2), just for fun
	double square = element.x * element.x + element.y * element.y;
	((cufftDoubleComplex*)_dataOut)[offset] = element * e_pow_I(square);

//same function called as a kernel, will be inlined and enjoy unlimited register allocation
__global__ void nonspilling(cufftDoubleComplex* psis){
	size_t offset = size_t(blockIdx.x) * blockDim.x + threadIdx.x;
	spilling(psis, offset, psis[offset]);

//compile with
//nvcc -std c++11 -rdc true -Xptxas -v -gencode arch=compute_35,code=sm_35 -cubin -o spilling_test.cubin spilling_test.cu
//relevant output:
//ptxas info    : Function properties for nonspilling
//    8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
//ptxas info    : Function properties for spilling
//    40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads


The issue is not that the number of registers available is small in general when the ABI is used. The issue is that a device function called from a kernel cannot use more registers than have been allocated to that kernel at launch time. In this case, the CUFFT kernel seems to use registers quite economically, and the called device function uses double-precision sincos() which requires quite a few registers. The difference, which seems to amount to nine registers (= 36 bytes) needs to be spilled to local memory. Accesses to local memory are cached in the top-most cache of the memory hierarchy, so while the spilling will cause some performance degradation I wouldn’t expect it to be horrible.

GPUs operate with a variable-size register file as seen by user code, and the registers are allocated at kernel launch time, based on the register required by the kernel’s code as recorded in its object file. Given that, I do not see a way to get around the restrictions placed on separately compiled device functions called from such kernels. One ad-hoc idea would be to allow a user-configurable parameter to over-allocate registers at kernel launch time, beyond what the object code for the kernel requires. That might even be quite do-able when using the CUDA driver API (as opposed to the CUDA runtime API). But it would still pose issues when the kernel in question is part of a pre-compiled library like CUFFT.

If you can think of a scheme that would solve the various issues, and does not conflict with existing code base, you may want to propose it to NVIDIA in the form of an enhancement request (RFE), which you can file through the bug reporting form linked from the CUDA registered developer website.

I don’t see how this is relevant, in fact the problem arises also in a standalone snippet such as the one in the first post

that is imo something that should be improved on its own. I reckon that there is a slowpath if the argument must be reduced, and it would be nice to have a version of sincos() that doesn’t do it and gives UB if the argument would actually need it.

Also, the logic in the test program, if ported to a load callback, spills 16 bytes less (a double2), for some obscure reason that I cannot figure out looking at the assembly. However, this gives me a significant performance gain, worth the extra code complication of porting the store to a load callback. I am working toward a simple standalone benchmark of the two ways. By extrapolation I would say that not having spills at all would give another performance boost.

I am not quite experienced enough with CUDA to come up with a solution myself. I was talking to someone at NVIDIA and he said

Do you think that this could be viable?

I agree that the underlying issue is not specific to CUFFT, it applies to any situation where you have a separately compiled kernel calling a separately compiled device function. I referenced CUFFT because according to the subject line of this thread, that is the actual context of your question, i.e. the relevant kernel is pre-compiled inside the CUFFT library which means you cannot change the kernel code.

I am afraid I don’t know what you mean by “a load callback”. One of my last contributions to the CUDA math library prior to my departure from NVIDIA was a reduction in the code and register footprint of the trigonometric functions. The chances of improving it further are likely slim. Libraries in general have the attractive feature of having a single set of specifications, with the disadvantage that this means the implementation cannot be tailored to every specific use case.

I do not know who you talked to at NVIDIA, and I am a bit puzzled why you started this thread if you are already in direct contact with them. In practical terms I would say your best course of action would be to continue to work through your contacts at NVIDIA. You might consider filing a generic RFE for addressing spilling in CUFFT callbacks, but I am not sure whether that would help much. Callbacks could be arbitrarily complex, much more complex than a single call to sincos().

Based on my understanding of the issue at hand, JIT compiling the callback from PTX may achieve some reduction in spilling by recomputation of common subexpressions (increasing dynamic instruction count in the process) to reduce register pressure. But the number of registers needed for a device function cannot be reduced arbitrarily without spilling, so as long as the number of registers for the function exceeds what was allocated for the kernel, I would think spilling has to take place.

One might imagine some sort of link-time or even run-time compilation of kernels based on the register needs of all separately compiled device functions they call, but that would probably require a lot of infrastructure in driver and toolchain, and may give rise to other performance issues, e.g. JIT overhead in the case of run-time compilation.

CUFFT has two types of callbacks:

  1. a callback that will be invoked upon initial loading of data to be used in a transform
  2. a callback that will be invoked immediately prior to storing transformed data


Having said that, I agree completely that linking a user-defined device function against a precompiled binary (library global function) pretty much puts you in handcuffs. I have no idea how this:

“i guess the neat solution would be to keep the callbacks in a register agnostic ptx form and then jit them as soon as the register constraints of the calling kernel are known”

could possibly help. The “register constraints of the calling kernel” are indeed the handcuffs. I’m not sure how or why holding the code in ptx form vs. C source code form would get you out of those handcuffs.