I have a tricky (for me) signal processing problem. I’ve sketched it below, with a proposed solution. I’d appreciate any feedback on my assumptions, and general direction.
Say we have a long array, A = [a_0, …, a_(N-1)], and a short array B = [b_0, …, b_127]. I need to compute FFT([a_k, …, a_(k+127)]*B) for all k<N-128 (where “*B” means element-wise multiplication with B).
I believe this an ideal application for callbacks  (?).
Unfortunately, I can’t use them. From the docs: “NOTE:The callback API is available in the statically linked cuFFT library only, and only on 64 bit LINUX operating systems.”
In that case, I need to write a custom kernel–something like the “before” case in . The problem is with memory. If N is large (say 2^20), then there are a lot of these: [a_k, …, a_(k+127)]. The only way I forward I can see is to process in batches.
Does that make sense?