Cufft store callback broken

I’m trying to integrate a post-FFT 1D Shift via a cufft store callback to improve performance over a current implementation that performs the 1D Shift as a separate kernel after executing cufft. The 128x256 C2C 2D FFTs are executed in a batch of 192, with separate contiguous blocks of cuda-managed memory for input & output. The shift operations swaps rows 0-63 with 64-127 and the callback looks like:

__device__ void CB_Shift1D(void *dataOut, size_t offset, cufftComplex value, void *info, void *ptr)
{
    int delta = 64 * 256;
    if(((offset >> 8) & 0x7F) >= 64)
        delta *= -1;
    ((cufftComplex *) dataOut)[offset + delta] = value;
}

This seems to work for the most part but when comparing against the reference results there are between 0 and ~100 entire rows (of size 256) that differ (out of 24576 rows across all batches). The number of and particular rows are different every time the code is run with the same input data.

Curiously, if I add some printfs to the callback to aid debugging the issue disappears entirely, i.e. new results match reference results perfectly, but resurfaces when the printfs are removed. Although this suggests a memory/timing related issue/race-condition I am unsure if my handling off “offset” in the callback is “safe” as it assumes the “dataOut” pointer is valid for an entire 128x256 array to perform the shift/swap… If this assumption is invalid, is there an alternative viable approach?

I’m using CUDA 11.7 running on a GeForce RTX 3070 Ti Laptop GPU (compute 8.6 - ampere).

Let’s take a look carefully at the text here including the sub-section. Excerpting:

When more than one kernel are used to implement a transform, cuFFT alternates using the workspace and the output buffer to write intermediate results.

Note that there are no guarantees on the relative order of execution of blocks within a grid. As such, callbacks should not rely on any particular ordering within a kernel. For instance, reordering data (such as an FFT-shift) could rely on the order of execution of the blocks. Results in this case would be undefined.

So it is possible that the output buffer location you are writing to, is being used by CUFFT for temporary storage for computation of other output points. If that were true, the only way your code could be correct is if it were guaranteed that by the time the very first store callback is called, that all computations are finished, and cufft does not guarantee that.

One possibility (I think) is to use a completely independent output buffer, known only to the callback. This obviously implies that you are willing to increase the memory required by a non-inplace C2C or Z2Z callback by 50% (effectively allocating a 3rd buffer, used as a “2nd” output buffer).

I haven’t worked thru all the details (nor have you provided a complete test case) but it seems to me that by doubling the size of the output buffer (CUFFT effectively only “knowing” about the first half of this space) and having your callback offset each output index by the size of the first half of the buffer, you should be able to fairly easily create an independent output space. It’s not obvious to me that this should have a significant performance impact on the transform (the overall number of loads and stores should be the same, with approximately the same relative access pattern and coalescing, except that the additional space increases the cache footprint), but I haven’t tried it.

I don’t have any solution ideas that work entirely within the process you have sketched out (i.e. using callback, no additional memory, no performance impact, no extra kernel, etc.)

Hi Robert,

Thanks for bringing that excerpt to my attention! That was the missing piece to solving my problem. In this case I didn’t need to double the output buffer as I just made the cufft call in-place so that it wasn’t aware of the output buffer, which gets passed to the callback via the callerInfo pointer (and used for storing instead of dataOut). So the working callback code now looks like this:

__device__ void CB_Shift1D(void *dataOut, size_t offset, cufftComplex value, void *info, void *ptr)
{
    int delta = 64 * 256;
    if(((offset >> 14) & 0x1) == 0x1)
        delta *= -1;
    ((cufftComplex *) info)[offset + delta] = value;
}

Regards,
Hugh