Cuda memcheck find an illegal instruction in the cufft

When I enable memcheck in cuda-gdb I am getting the following error code:
“Received signal CUDA_EXCEPTION_$, warp Illegal instruction”
When I run this code without cuda memcheck turned on it works without error. When I try and print the stack I only get a single line in the cufft kerenel

The function it points to is void dpRadix0025B::kernel3MemBluestein() which is a part of cufft which I was not expecting, my FFT size is 2048 so I am surprised that it is using the bluestein algorithm at all, I thought with a power 2 fft size it used cooley-tukey.

Any advice on how to dig deeper to find the root cause of this error is appreciated, this is my first cuda project and I have hit the limit of my knowledge

Can you provide some sample code?

I am trying to pull the code out into a smaller single file, unfortunately when I do the error changes. I have provided a description below of how the error changes and my system settings which might help
When I run my full code through cuda-gdb with memcheck on I get the error I described above pointing to the bluestein function.
When I run it through cuda-memcheck I get the error message an illegal instruction was encountered in one of my kernels that is shuffling some data around a buffer using shared memory and an error message “Internal memcheck Error: Initalization failed” and the printed backtrace which mentions cufftPlanMany.
When I split it into a smaller program the error disappears in cuda-gdb but I still get an error with cuda-memcheck.

Ubuntu 20.04.2 LTS
g++ 9.3.0
nvcc 10.1
nvidia Tesla T4

I have included the code that uses the cufft part, this is a FIR FFT filter performing an overlap save to filter the multi-channel input data.
I can provide more if it will help, but the code in question references quite a few different classes (2.9 KB)

Unfortunately, I’m not able to run or test. If the issue is in cufftExecC2R or cufftExecR2C, it’s possible the array sizes aren’t lining up, or aren’t large enough, and the FFT kernels are trying to access data out of bounds. Providing a small reproducer would be very helpful.