callback ABI issues/caveats

I have been working with cuFFT callbacks lately, and I found that there are a number of issues related to ABI, or to say it differently, to the fact that callbacks cannot be inlined. I have already explained in this topic that callbacks must use a maximum register total defined by some ABI, so spilling is likely. I have also found two other issues, which I’ll report here FYI, cause I don’t believe many know about them.

  1. If a callback function takes a pointer to shared memory (such is the case of cuFFT callbacks), it will be treated as a generic pointer, leading to the emission of generic memory access instructions. The problem is that there there is no way to make the compiler believe that a pointer belongs to some address space: you can't cast to (__shared__ mytype*) even if you really know you are handling __shared__ memory, because the __shared__ attribute will be ignored (and nvcc warns about it). I have found however a silly workaround:
      * This function takes a pointer and forces the compiler to believe  
      * that it resides in __shared__ space.  
     template<typename P> __device__ P* force_shared_ptr(P* assumed_shmem){  
         extern __shared__ char shmem[];  
         auto ptrdiff = (char*)assumed_shmem - shmem;  
         return (P*)(shmem + ptrdiff);  
  2. Callbacks can declare static __shared__ storage, but then this extra __shared__ memory must be allocated even if the function is not used when you run a kernel in the same CUmodule. If you have two callbacks with two different static __shared__ allocations and a kernel that calls only one of them, both the __shared__ buffers will be statically allocated. On the bright side, I have found that nvcc is smart enough to avoid this if a kernel does not use callbacks at all.

And sample code:

Please review the above documentation and sample code, and see if this is what you want.