Secondly, I have data that runs into GigaBytes … what strategy I should apply in order to solve my problem? If I solve my problem in pieces, I am afraid ,the memory transfer between CPU and GPU will take too much time.
There are only two physical kinds of memory (if you don’t count registers) on current GPUs:
global memory (off chip), sometimes called “device” memory
shared memory (on chip)
However, you can access global memory in a few different ways:
direct access with pointers (this is usually what people mean when they say “global” or “device” memory in the context of a kernel, rather than talking about hardware)
through the texture cache, which is what you do when you are accessing “texture memory”
thread-local storage
So “local”, “global”, and “texture” memory are physically the same thing, and you can use the 384 MB (minus some overhead for CUDA) for any combination of the above. (There are some dimensionality limits on textures, but no explicit size restrictions.) Shared memory is separate, and not counted because it is a tiny contribution anyway. The 8800 GS has 12 multiprocessors, so that is a total of 192 kB of shared memory.
As for dealing with large datasets, not sure what the best strategy is here. I don’t know if the 8800 GS supports asynchronous memory transfers overlapped with kernel calls. Depending on how much calculation you have to do per memory chunk, you could double buffer and transfer one block of data to/from the card while the other block is being used in a calculation.
PCI-Express 2.0 can move data at 3-6 GB/sec, depending on how good your motherboard and RAM are, and whether you use pinned memory on the host.
Oh, and I forgot constant memory. This is limited to 64kB, but it’s not clear to me where it lives physically. I assume it is also stored in global memory, but there is also a constant cache on each multiprocessor, 6-8 kB in size.