I’m newbie in CUDA programming, so sorry if the problem looks silly.
I’m going to process (preprocessing & digital filtering) more than 100 millions samples of whole-day EEG signal,
so there will be a lot of data transfer to and from device memory.I use 9800 gtx+,it has 16 multiprocessors,
so if for block I have 256 threads and I need 4 registers for thread, so 8192 reg/(4256)=8, so can I use 8 blocks
for Multiproc. and so for 16 MP’s 816=128 blocks for whole gpu?
I try to compute next: 128blocks256 thread is 32768 numbers or samples of eeg signal.But if I have millions of samples,
so I need a loop around kernel for re-filling with new data from device memory. Is this correct?
for(int j=0;j<n;j++)
{
kernel<<<dimGrid,dimBlock>>>(A_device,B_device,size);
cudaMemCpy(A_device,C_device,sizesizeof(float),cudaMemCpyDe
viceToDevice);
}
but the cudaMemCpy function copy data of length size *4B from nullth index of data(B_device[0]),my question is how can I copy data from another index of B_device; not from beginning of B_device, is it possible?
Does it go with pointer change? I’ll be very thankful for any advice.