large data to device memory

I’m newbie in CUDA programming, so sorry if the problem looks silly.

I’m going to process (preprocessing & digital filtering) more than 100 millions samples of whole-day EEG signal,
so there will be a lot of data transfer to and from device memory.I use 9800 gtx+,it has 16 multiprocessors,
so if for block I have 256 threads and I need 4 registers for thread, so 8192 reg/(4256)=8, so can I use 8 blocks
for Multiproc. and so for 16 MP’s 8
16=128 blocks for whole gpu?
I try to compute next: 128blocks256 thread is 32768 numbers or samples of eeg signal.But if I have millions of samples,
so I need a loop around kernel for re-filling with new data from device memory. Is this correct?
for(int j=0;j<n;j++)
{
kernel<<<dimGrid,dimBlock>>>(A_device,B_device,size);
cudaMemCpy(A_device,C_device,size
sizeof(float),cudaMemCpyDe
viceToDevice);
}
but the cudaMemCpy function copy data of length size *4B from nullth index of data(B_device[0]),my question is how can I copy data from another index of B_device; not from beginning of B_device, is it possible?
Does it go with pointer change? I’ll be very thankful for any advice.

You can launch many more threadblocks than can be run concurrently - hw/driver will schedule them. So, it’s probably simplest to simply have as many threads as output elements/samples. Once you have that working, it may be iteresting to have each thread process several output elements. This could potentially offer some improvement in performance, because you could reuse some computations between the elements, rather than recomputing them in another thread.

paulius