Hi,

I have the following program:

```
const int N=1000000;
cublasAlloc(N, sizeof(float), &A);
for(Iter=0; Iter<MaxIter; Iter++)
{ <g,t>Kernel1(...,A,...);
<g,t>Kernel2(N, A);
<g,t>Kernel3(...,A,...);
}
```

Kernels 1 and 3 show very good performance (about 100GFlop/s), and the Kernel2 looks like following:

```
Kernel2(int N, int *A)
{ int i;
for(i=0; i<N-1; i++)
A[i+1]+=A[i];
return;
}
```

So, it need an access to N*sizeof(int) bytes of memory and make only N operations, and it is completely not parallel job…

For the total timing of the project it is enough to reach a computational time equal to bandwidth of the main memory of GPU, so, in my example, if the kernel works just 2*sizeof(int)*N/7e10 seconds, it will be perfect! (7e10 is 70GByte/s is the bandwidth of the main memory of GPU).

In case if I run this algorithm just on one thread it is almost 100 times slower.

Please, advise me, how to access the main memory from one thread with the peak speed!

Sincerely

Ilghiz Ibraghimov