I have been working with cuFFT callbacks lately, and I found that there are a number of issues related to ABI, or to say it differently, to the fact that callbacks cannot be inlined. I have already explained in this topic that callbacks must use a maximum register total defined by some ABI, so spilling is likely. I have also found two other issues, which I’ll report here FYI, cause I don’t believe many know about them.
- If a callback function takes a pointer to shared memory (such is the case of cuFFT callbacks), it will be treated as a generic pointer, leading to the emission of generic memory access instructions. The problem is that there there is no way to make the compiler believe that a pointer belongs to some address space: you can't cast to (__shared__ mytype*) even if you really know you are handling __shared__ memory, because the __shared__ attribute will be ignored (and nvcc warns about it). I have found however a silly workaround:
/* * This function takes a pointer and forces the compiler to believe * that it resides in __shared__ space. */ template<typename P> __device__ P* force_shared_ptr(P* assumed_shmem){ extern __shared__ char shmem[]; auto ptrdiff = (char*)assumed_shmem - shmem; return (P*)(shmem + ptrdiff); }
- Callbacks can declare static __shared__ storage, but then this extra __shared__ memory must be allocated even if the function is not used when you run a kernel in the same CUmodule. If you have two callbacks with two different static __shared__ allocations and a kernel that calls only one of them, both the __shared__ buffers will be statically allocated. On the bright side, I have found that nvcc is smart enough to avoid this if a kernel does not use callbacks at all.