Basic Question (over CUDA concepts)

Hi CUDA Community,

This is my first post, so obviously I am new here. I have been reading through many threads for the past few days, though. I recently accepted a job as an intern, and they are going to have me do a great deal of work on a Tesla S1070 system, though I think only 2 of the C1060’s are operating at the moment. That can be rectified later when the time for all 4 GPU’s becomes evident. It’s running all on a Windows XP system, I do believe.

I am sure I will be asking a lot of questions to everyone, so please be patient with me. I have been dropped into the world of parallel programming on a whim without any kind of training. I have read through a great deal of the programming guide and many other resources (including this forum!).

1st question. In the upcoming weeks, I will be doing a lot of data processing and performing mathematical calculations on it, and then passing the now old data for some new data. So, I will surely have latency issues with copying from host memory to the GPU’s global memory and vice versa. It seems like streams are used for transferring data to and from the host memory while concurrently executing a kernel? I have done a great deal of reading about memory management and optimizing code, but when it comes to processing gigabytes of data for hours at a time, where should I get started?

Sorry for sounding like such a newbie, but that’s honestly what I am. I’m getting there, though!