Calculation sum of array parts have large prime number elements

I know of two approaches here.

  1. Work out the number of elements assigned to each thread, so in your case, 1 element per thread (617 / 512 = 1 in integer maths). Then work out the number of unprocessed elements at the end. If there are unprocessed elements at the end then add 1 to the number of elements assigned to each thread, i.e. each thread will now do 2 elements. At the beginning of processing an element, check to see if it’s index is inside your range [0-616], if not then don’t process it. This one’s easy to code but probably horribly inefficient. As many threads will not have work to do. All the logic instructions to implement this will also reduce speed.

  2. Alternatively, run 3 kernels. The first does the summation over a multiple of 512 elements that fits within your range. The second kernel call sums up the remaining elements (105 in your case). Both of these kernels write their partial sums to memory, so each section of your array is represented by two partial sums. The final kernel call then sums each pair, giving you the sum of each element in each pair’s respective ranges.

I think no. 2 will be the better one to use.