sequential sum within a kernel.

Hi All,

Is it possible to do a sequential sum within a kernel?
For example, I have the following type of kernel

int  i = blockIdx.x*blockDim.x+threadIdx.x;

		if(i<=n) r_d[i] = phi1d_d[i+(kvol*(na-1))];	

		for(j_d = 1; j_d <= n; j_d++) {

		if(i<=n) p_d[i]=r_d[i];

If I run this in device emulation mode I find that the sum is done for every kernel that I launch.
What I want is to do i=256,512,65K (or whatever my choice is) then sum over those completed threads from within the kernel. Is that possible? am I making any sense???



look at the reduction example on how to sum.
You are missing a __syncthreads() by the way.

Thanks Denis,

I’ve got to the point now where I have…

sum[tid] = betan_d[i];


for(unsigned int s=blockDim.x/2; s>0; s>>=1) {

if (tid < s) {

		sum[tid] += sum[tid + s];




// write result for this block to global mem

if (tid == 0) {

	betanTot_d = sum[0];


Both sum and betanTot_d are defined outside main() using device (so I guess static at runtime).

My problem begins when I have more than one block. The last block overwrites the previous one.

Does anyone know how I can sum these blocks on the device???

I really don’t want to take them off and do it on the cpu.



Have the first pass write out one sum per block and run the kernel again reading this array from device memory. After enough passes, you will end up running a 1-block kernel that computes a single value for the sum.

Thanks for the reply.

Unfortunately I only want to run the kernel once. That is the reason for me wanting to do the sum on the device. I’m trying to avoid the latency of memcpy and launching kernels.

The sum value is used afterwards to multiply some vectors, again on the device.

Is there a way to do that? sum the results of different blocks on the device and make that result visible to all blocks???



once you have written a single value from all blocks, you can sum those up using the reduction example. You do not have to copy back & forth, just call a kernel. There is as far as I know no atomic add function for floating point, so there is no way to do what you want.

I have a similar question, is it possible to sum across one dimension of an array? I want to take a 2d array (512x64) and sum across one dimension and endup with a 1-d (512) array. I am new to CUDA and the reduction examples aren’t very straight forward to me. Can someone help?

global void sum_joh(float2 *input, float2 *output){
extern shared float2 subdata;

unsigned int tid=threadIdx.x;
unsigned int i=blockIdx.x*blockDim.x+threadIdx.x;



//do the reduction in shared mem
for(unsigned int s=blockDim.x/2;s>0;s>>1){

//write result for this block to global mem
if(tid==0) {



Sorry, the only way is to make at least 2 kernel calls.

There is no latency in memcpy becuase there is no mempcy call, the 2nd kernel call can read the memory written from the first call directly. And calling the 2nd pass kernel will only cost you ~20 microseconds assuming the final amount of sums to be done is very small.

Ooohh :(

Can I call a kernel from my original kernel?

Or is there a way to call a function on the host to do the final sum from within my current kernel???

I really need it to be as quick as possible because the current kernel is called 1000 times from within a for loop, so any additional calls costing 20microseconds hit me hard <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=’:’(’ />

I wonder if I could just replace the final sum value in my kernel with

sum[tid] = betan_d[i];

for(unsigned int s=blockDim.x/2; s>0; s>>=1) {
if (tid < s) {
sum[tid] += sum[tid + s];

// write result for this block to global mem
if (tid == 0) {
betanTot_d[blockIdx.x] = sum[0];

then instead of having a total value I just use betanTot_d[0]+betanTot_d[1]+betanTot_d[2]
Obviously I need to know how many blocks I’ll have to begin with, but that is possible because I know the number of threads per block and the total number of threads I need to run???

Thanks for all the help!


P.S. johsimr, I have a feeling that you might be able to index your 2d array as a id, then call the kernel with the relivent inidex and set your block size to 512??? just guessing though, I’m no expert!
I’d put your question in a new thread, that way more people might look at it and be able to help.

what about using atomics in global memory for summing up block results?


Thanks Christian,

I’ve had a look at the atomics and it seems the atomic add will only work with int results or a int and a float.

I need to add two or more floats :(

I think I’ll try writing it into the code at compile time, that’s about all I can think of doing <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=’:’(’ />

Also I believe our tesla doesn’t support atomics anyhow :wacko:

Thanks again,


atomics are indeed not supported in 1.0 hardware. Also atomics are slow when every thread (and I believe even every block) write an output value. atomics are more useful when some random thread adds a value to a global value and it also has to be very sparse (not a lot of threads writing)

1000 times 20 microseconds is 20 milliseconds. If that is really significant, CUDA might just not be the answer (until you start performing this all on more data)

on the other question: yes it is possible to add only in 1 direction and end up with a vector when the input is a matrix. It is however needed to understand the reduction example. Actually for me it was the first parallel algorithm that I understood, and I use some form of it in about 50% of all my kernels. So google for parallel reduction and CUDA, you can find a pdf by NVIDIA, that takes you through all the steps from the most simple form, to the latest best performing version.

Nice thread!

I wonder if I can steal some suggestions as well. :-)

I’m trying to implement some very basic element-to-element array functions on the GPU, which is a G92 (8800 GTS 512MB).

Something like:

for (i=0; i < N_ELEMENTS; i++)

    S[i] = A[i] + B[i];

and so on with other operators (arrays are of type “cufftComplex” in my case).

I’m facing extremely heavy performance hits because of uncoalesced memory access in my (admittely very naive) implementation: to the point where the CPU is much faster.

I cared about using dimensions which are multiple of 128, I selected optimal block size and grid size (after test runs) but nonetheless, my GPU functions are very very slow (2 ms for a 1024x1024-elements S=A+B sum! That’s 0.5 GFlop/s!) .

Problem is, I’m new to parallel programming so it’s not that easy for me to follow the “reduction” example, which was written to address a more complex problem than my basic array functions.

I’m studying the “matrix multiply” example, but again, being a matrix-level operation (select a submatrix, load it into shared memory, accumulate the cross product, write it back to device memory and so) it’s quite more complex than a simple element-level sum, so it’s not so straightforward for me to adapt the algorithm for my needs.

Any simple suggestion?

Thanks a lot!


Post your kernel code and people might be able to help.


This is my simple kernel (I’m going by memory at the moment, please forgive little errors):

GPUArraySum (float *A, float *B, float *Res, int ArraySize, int SegmentLen)


  Â int Start, End, i;

 Â Start = (blockIdx.x*blockDim.x+threadIdx.x) * SegmentLen;

  End  = Start + SegmentLen;

  Â if (End < ArraySize)

  Â {

  Â for (i = Start; i < End; i++)

  Â Res[i] = A[i] + B[i];



SegmentLen is computed by the caller as

SegmentLen = ArraySize / (BLOCKSIZE * GRIDSIZE);

I use



dim3 GridDim (GRIDSIZE,1,1);

dim3 BlockDim (BLOCKSIZE,1,1);

In my tests, ArraySize is either 512512, 10241024 or 2048*2048.

Thanks! :)


Your memory reads & writes are not coalesced. If you take away your for loop, and just make your grid size depend on your array size (256 threads per block is a very good size) you will have coalesced memory accesses and be much faster.


Not everyone uses CUDA for the same things. If you want to have an interactive visualization of your CUDA program, then ideally you want to have at least 20 FPS. That’s only 50ms to do everything including rendering. Losing 20ms of that to overhead would indeed be “significant”.

Oh yeah, I forgot about atomics: but they are only good if you are summing integers. As long as you aren’t summing floats, you can avoid the 2nd kernel call by using atomics. I’ve never played with atomics much, though, so I don’t know if the performance penalty from using them will outweigh the ~20 microsecond kernel call or not.

Assuming the 1st pass of the kernel reduction is of a reasonable size and takes, ~milliseconds already, paying the extra 20us for the 2nd pass is nothing. If you are making 1000’s of kernel calls each only being ~20-100us, then you may find it faster just to keep your algorithm on the CPU. Ahmdal’s law just bites sometimes, and on GPUs you can run into Ahmahl’s law when your algorithm just isn’t “parallel enough”.

This is very interesting, but please, could you give me a few more hints?

3 questions:

  1. Why my reads&writes are uncoalesced? I know they are, it’s just that I can’t figure out why in this case (a reference to the CUDA programming guide is OK, I did read it but may have missed something)

  2. In my case ArraySize goes from 512512=262’144 to 20482048=4’194’304.

Since BLOCKSIZE can’t be more than 512, can I really set GRIDSIZE up to 8192?

Isn’t it too much?

With 16 SMs (G92 GTS), it would mean 512 blocks per SM (I was under the impression that we can’t have more than 24 blocks per SM?).

  1. Is there a way to exploit those 16KB of shared memory per SM, in my case? I don’t think so since it’s a per-element operation, but…

Thank you very much.


If you read good, that would suggest he would need 1000 iterations before doing a screenupdate. That would mean that the total time available for one iteration would be 50 microseconds.

So it might just be impossible nowadays (as I don’t think you will be able to make this work on CPU either. Maybe in Cg)

On the other hand, if he would want to update the screen after each iteration, he would only have 20 microseconds overhead of those 50 ms…