Fast sequental access in GPU memory please, advise me a method!


I have the following program:

const int N=1000000;

cublasAlloc(N, sizeof(float), &A);

for(Iter=0; Iter<MaxIter; Iter++)

{ <g,t>Kernel1(...,A,...);

   <g,t>Kernel2(N, A);



Kernels 1 and 3 show very good performance (about 100GFlop/s), and the Kernel2 looks like following:

Kernel2(int N, int *A)

{ int i;

   for(i=0; i<N-1; i++)




So, it need an access to N*sizeof(int) bytes of memory and make only N operations, and it is completely not parallel job…

For the total timing of the project it is enough to reach a computational time equal to bandwidth of the main memory of GPU, so, in my example, if the kernel works just 2*sizeof(int)*N/7e10 seconds, it will be perfect! (7e10 is 70GByte/s is the bandwidth of the main memory of GPU).

In case if I run this algorithm just on one thread it is almost 100 times slower.

Please, advise me, how to access the main memory from one thread with the peak speed!


Ilghiz Ibraghimov

What you are trying to do, it is known as prefix sum (or scan) and there is a parallel implementation.

Take a look at the cudpp library, it has an implementation of scan.


thank you for your kind answer!

As I remember from the old Cray C90/T3D implementations of scan, it requires at least 3 accesses to all entries of array element, but, you are right, it is completely parallel. So, the parallel implementation will provide guarantied 1/3 of possible peak performance.

Actually the main question for me was exactly fast sequential access to the global memory. In the case if instead of


i write


and the function f is not associative, I cannot use algorithms like scan.

PS: please, excuse me that I did not point out everything in my first topic!



No matter what the speed will be, if you have no parallelism at all it is completely pointless to do it on the GPU, at least if it is more than a few elements. Either find a way to parallelize the algorithm or copy the data from the GPU, process it with the CPU and copy it back.

But back to your question: As I understand the documentation, at least unless you have a GTX2xx card you will get at most 1/16th of the overall bandwidth like this,

so you should at least use a whole half-warp (16 threads) to copy the data from global to shared memory and back again.


thank you for your answer!

I do not need to parallelize it, the main performance and the parallelism are out of this block. The only reason of this question is how to do this with the all power of the hardware - the GPU memory bandwidth power. I think you can imaging how slow it will be to copy all data from GPU to CPU, make the computation on CPU and copy it back :)

It seems that we get together to the same solution. I tried to run 16 threads and 1 multiprocessor just to fetch data from the memory and store then locally in shared memory and, after thread synchronization, make computations. It works reasonably faster, however it is still not on the peak performance of memory bandwidth.



Of course not, with only one warp you can not hide the memory latency. Possibly you can still hide it a bit by using tex1Dfetch or doing the memory loading on a second warp (loading the block you will calculate on in the following step).

I.e. thread 0 calculates sum of block1 while threads 32-48 load block2 and store block 0. It’s not going to be beautiful nor particularly fast still.