# Fast sequental access in GPU memory please, advise me a method!

Hi,

I have the following program:

const int N=1000000;

cublasAlloc(N, sizeof(float), &A);

for(Iter=0; Iter<MaxIter; Iter++)

{ <g,t>Kernel1(...,A,...);

<g,t>Kernel2(N, A);

<g,t>Kernel3(...,A,...);

}

Kernels 1 and 3 show very good performance (about 100GFlop/s), and the Kernel2 looks like following:

Kernel2(int N, int *A)

{ int i;

for(i=0; i<N-1; i++)

A[i+1]+=A[i];

return;

}

So, it need an access to N*sizeof(int) bytes of memory and make only N operations, and it is completely not parallel job…

For the total timing of the project it is enough to reach a computational time equal to bandwidth of the main memory of GPU, so, in my example, if the kernel works just 2*sizeof(int)*N/7e10 seconds, it will be perfect! (7e10 is 70GByte/s is the bandwidth of the main memory of GPU).

In case if I run this algorithm just on one thread it is almost 100 times slower.

Sincerely

Ilghiz Ibraghimov

What you are trying to do, it is known as prefix sum (or scan) and there is a parallel implementation.

Take a look at the cudpp library, it has an implementation of scan.

Hi,

As I remember from the old Cray C90/T3D implementations of scan, it requires at least 3 accesses to all entries of array element, but, you are right, it is completely parallel. So, the parallel implementation will provide guarantied 1/3 of possible peak performance.

Actually the main question for me was exactly fast sequential access to the global memory. In the case if instead of

A[i+1]+=A[i]

i write

A[i+1]=f(A[i+1],A[i])

and the function f is not associative, I cannot use algorithms like scan.

PS: please, excuse me that I did not point out everything in my first topic!

Sincerely

Ilghiz

No matter what the speed will be, if you have no parallelism at all it is completely pointless to do it on the GPU, at least if it is more than a few elements. Either find a way to parallelize the algorithm or copy the data from the GPU, process it with the CPU and copy it back.

But back to your question: As I understand the documentation, at least unless you have a GTX2xx card you will get at most 1/16th of the overall bandwidth like this,

so you should at least use a whole half-warp (16 threads) to copy the data from global to shared memory and back again.

Hi,