Calculation sum of array parts have large prime number elements

Hi everyone;

I have a problem in my cuda algorithm about array summation.

Suppose that we have an array have 617 x 4 = 2468 element

(617 is a prime number) and I want to calculate summation

of [0-616], [617-1233],[1234-1850] and [1851-2467] array parts.

One thread block has maximum 512 threads and size of each part

doesn t divide any size of thread block exactly. So, how can

I update the following cuda code for this purpose.

[codebox]global void SumArray(float *a, float *b, float *c, const unsigned int N)


unsigned int tx = threadIdx.x;

unsigned int bx = blockIdx.x;

unsigned int aBegin = blockDim.x * bx;

unsigned int aEnd = N;

unsigned int aStep = 512;

for(unsigned int k =0; k < 4; ++k)


   float total = 0;

for(unsigned int i = aBegin; i < aEnd; i += aStep)


      __shared__ float aS[512];

aS[tx] = a[i + tx];


for(unsigned int j = 0; j < 512; ++j)

         total += aS[j];



   c[k] = total;




Many thanks…

It has an good reference in Nvidia-SDK, that is Reduction program.

Thanks for helping, but main problem is not about summation of an array.
The reduction algorithm helps me only for loop depends on the variable “j”.

All summation of array algorithms including reduction use all of the elements
of shared data part. Let me explain my problem.

I want to calculate the first part of the array start with index 1 to 617. The maximum
size of block is 512 threads. Assume that we used 512 threads for each block.
The first thread block is not a problem because whole block is in the part of array.
But the second block and following are the problem. First 105 (617 - 512 = 105)
elements of second block are used for calculation of the first part of array. The others
are used for calculation for second part of array start with 618 to 1234.

My problem is all the algorithms I see is used whole block for calculation. But I have
to separate the blocks for each calculation of array parts.

I am newbie about cuda and may be the reduction helps me. If the problem solves with
reduction algorithm, how?

Thanks for helping…

As far as I understanding your code is you tried to calculate the sum of each rows?

  1. You tried to load data from global memory to shared memory but you met an problem is you can not cooporate data between 2 blocks.
    assuming that first block (index = 0) load 0-511 elements, second block (index = 1 tx = 0 to tx = 104) load element from 512-617 from global mem to shared and you can calculate the sum of them easily (we simply call that sum0 and sum 1), but you can not calculate the rowSum = sum0 + sum1?
  2. You tried to use tx = 105 to tx = 512 for calculating summary of second row?
    If my understanding is correctly, I will give you 2 suggestions to solve this issuses.
  3. You can use texture to calculate it easily (take a look at cudaMallocPitch) so let assume that you have 512 threads per block so you need 2 blocks to calculate an rows and totally you need 8 blocks to calculate all.
  4. 1 threads processes 4 continuously elements on each row (to get the hight performance) so you need only 1 block with about 256 threads (some of them will donothing) to load all row element onto shared memory (limited size of shared memory is 16k per block) and calcualte. So you need only 4 blocks to do it.
    I hope someone else will give you good suggestion. good luck.

Thanks Quoc Vinh;

I didn t know the function cudaMallocPitch(). The function looks like help me.

But the second method you suggested does not work for me. The number 617 is
only assumed. Sometimes size of row would be millions of elements and I can not
whole row into shared memory.

Thanks for both…

I know of two approaches here.

  1. Work out the number of elements assigned to each thread, so in your case, 1 element per thread (617 / 512 = 1 in integer maths). Then work out the number of unprocessed elements at the end. If there are unprocessed elements at the end then add 1 to the number of elements assigned to each thread, i.e. each thread will now do 2 elements. At the beginning of processing an element, check to see if it’s index is inside your range [0-616], if not then don’t process it. This one’s easy to code but probably horribly inefficient. As many threads will not have work to do. All the logic instructions to implement this will also reduce speed.

  2. Alternatively, run 3 kernels. The first does the summation over a multiple of 512 elements that fits within your range. The second kernel call sums up the remaining elements (105 in your case). Both of these kernels write their partial sums to memory, so each section of your array is represented by two partial sums. The final kernel call then sums each pair, giving you the sum of each element in each pair’s respective ranges.

I think no. 2 will be the better one to use.