I am going to implement an efficient FFT on GPU using CUDA.
Before implementing my FFT algorithm, I would like to know store units of the GPU.
I am having few basic questions. Those are
[list=1] How many store units(Register files) are there in each core of G80, GT200 and Fermi? How many Time units(clocks) for acessing L1 cache(24 KB??) and shared memory(??KB) in GT200 ? How many Time units(clocks) for accessing the L1/shared memory and L2 cache in Fermi? How each core acesses Global memory?
If possible can you give the documents which gives the answers for above questions. Thanks for spending your valuable time for me.