Basic questions about GPU Architectures.


I am going to implement an efficient FFT on GPU using CUDA.

Before implementing my FFT algorithm, I would like to know store units of the GPU.

I am having few basic questions. Those are

[list=1][] How many store units(Register files) are there in each core of G80, GT200 and Fermi?[] How many Time units(clocks) for acessing L1 cache(24 KB??) and shared memory(??KB) in GT200 ?[] How many Time units(clocks) for accessing the L1/shared memory and L2 cache in Fermi?[] How each core acesses Global memory?

If possible can you give the documents which gives the answers for above questions.

Thanks for spending your valuable time for me.

(rewriting this post because of ^W)

You should have a look at the NVIDIA OpenCL Programming Guide. In appendix A you have an overview of recent graphic cards and their compute capability and in appendix C.1 there is a features list of these specifications.

As far as I can remember accessing global memory is really painfull in regards to the use of clock cycles (~500 clock cycles per access). Unfortunately I forgot where I got this from.

But there are caching mechanisms for constant memory.