Hello,

In view of using Volkov’s FFT code in my own kernels, I try to understand how it works. For example, I assume that FFT8( float2 *work, int batch ) calculates the FFT for every of the #batch vectors of length 8 each, and all vectors are independent from each other. I understand that every thread is responsible for calculating an FFT of one vector of length only. So I would expect the kernel to look like:

[codebox]**global** void FFT8_device( float2 *work )

{

```
work += (blockIdx.y*gridDim.x+blockIdx.x) * 64 * 8 //offset due to blocks composed of 64 threadseach, working on vector of size 8 each
+ threadIdx.x * 8; //offset per thread working on vector of size 8 each
float2 a[8];
load<8>( a, work, 1);
FFT8( a );
store<8>( a, work, 1);
```

}[/codebox]

In other words, the first two lines calculate the offset to the first element of the vector of interest, or the vector the thread is concerned with, then load subsequent values counting from the newly calculated work pointer, calculate the FFT8 on the local vector, and store it back. However, the original code would do

[codebox]**global** void FFT8_device( float2 *work )

{

```
work += (blockIdx.y*gridDim.x+blockIdx.x) * 64 *8 //offset due to blocks composed of 64 threadseach, working on vector of size 8 each
+ threadIdx.x; //offset is the thread index???
float2 a[8];
load<8>( a, work, 64 );// space of 64 between???
FFT8( a );
store<8>( a, work, 64 );// space of 64 between???
```

}[/codebox]

The first line does what I would expect, the second line adds the thread index, so points somewhere accros the first 8 vectors in a block of threads, i.e. block of vectors, e.g. vector[3] of the 4th vector in a block. When the values are loaded, then 8 vector[3] values would be loaded 64 values apart. What is its purpose?

Either there is an error in the code, or, more likely, I miss something essential in that algorithm.

I would appreciate reading an explanation to gain some understanding.

Kind regards,

peter