Hello,

I have a global function, which summs up all Values (which were multiplicated before).

```
float sum = 0;
for (int j = 1; j < 256; j++){
sum += array[j + 256*threadIdx.x] * array[j + 256*threadIdx.x];
}
result += sum;
```

I call the function like this: function<<< 1, 256 >>>(arraypointer, resultpointer)

I want to run this code only on one multiprocessor, therefore, i call the function only with one thread-block.

The variable “result” is in the global memory. Is it possible to solve this problem in another way to only write once to the global memory, instead of 256 times? (because global memory is slow).

Can i do something like this: ?

```
float sum;
initialvalue sum = 0; // only the very first thread which goes through the routine setts sum to 0, so the threads after can work with the results of the threads before
for (int j = 1; j < 256; j++){
sum += array[j + 256*threadIdx.x] * array[j + 256*threadIdx.x];
}
if (all threads are through) {result = sum;}
```

-> to work like this, i need to detect the very first run of the function an the last run

Is there any possibility to do this?

Thanks a lot!

burnie