[bug?] weird kernel call when cufft is enabled

I’m writing an Ising-like simulation on GPU in 2D. I compute correlation lengths in the system through FFTs and use cuFFT to do so. The main program loop looks like this :

for(auto i = 0; i < nStatistics; i++){
    for(auto j = 0; j < nCorr; j++){
        cudaSwapTypes(swapArgs);
        cudaSwapParticles(swapParticleArgs);
    }
    if(statistics){
        cudaDeviceSynchronize();
        cudaComputeEnergy(energyArgs);
        cufftExecD2Z(forward, ArrayA, ArrayB);
        cudaSquareArrayB(ArrayB);
        cufftExecZ2D(backward, ArrayB, ArrayC);
        cudaDeviceSynchronize();
    }
}

With fft plans setup somewhere else. When the code runs, it doesn’t produce errors. However, if memcheck is on in cuda-gdb it reports Warp Out-of-range Address in dpRadix0004A::kernel3MemBluestein. If I set a break on every kernel call, then dpRadix0004A::kernel3MemBluestein appears to be called between SwapTypes and SwapParticles:

[Switching focus to CUDA kernel 8, grid 86, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 1, lane 0]
kernelSwapTypes<true, float, float, 8u><<<(64,1,1),(512,1,1)>>> (states=0x2aaaee200000, particles=0x2aaaeea00000, size=128, logSize=7, 
    rngState=0x2aaaf0000000, levelSpacing=0x2aaaee000000, chemicalPotentials=0x0, maxState=0) at SystemGPU.cu:55
55	    unsigned int idx = threadIdx.x;// + blockDim.x * blockIdx.x;
(cuda-gdb) c
Continuing.
0x0000000002020680 in void dpRadix0004A::kernel3MemBluestein<unsigned int, double, (fftDirection_t)1, 128u, 6u, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_blues_mem_radix3_t, unsigned int, double>)<<<(64,1,1),(512,1,1)>>> ()

Which then produces the out-of-range address error:

Thread 1 "a.out" received signal CUDA_EXCEPTION_5, Warp Out-of-range Address.
0x00000000025c3bb0 in void dpRadix0004A::kernel3MemBluestein<unsigned int, double, (fftDirection_t)1, 128u, 6u, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_blues_mem_radix3_t, unsigned int, double>)<<<(64,1,1),(512,1,1)>>> ()

It appears there is no calling location in that kernel. This still happens if I remove cufftExec, but goes away if I remove the cufftPlanMany (called somewhere else). Also, FFTs dimensions are powers of 2, so shouldn’t Cooley-Tukey be used here instead of Bluestein ?