Variable Initialisation on Device Routine


I have a global function, which summs up all Values (which were multiplicated before).

 float sum = 0;

   for (int j = 1; j < 256; j++){

    sum += array[j + 256*threadIdx.x] * array[j + 256*threadIdx.x];


  result += sum;

I call the function like this: function<<< 1, 256 >>>(arraypointer, resultpointer)

I want to run this code only on one multiprocessor, therefore, i call the function only with one thread-block.

The variable “result” is in the global memory. Is it possible to solve this problem in another way to only write once to the global memory, instead of 256 times? (because global memory is slow).

Can i do something like this: ?

 float sum;

  initialvalue sum = 0; // only the very first thread which goes through the routine setts sum to 0, so the threads after can work with the results of the threads before

   for (int j = 1; j < 256; j++){

    sum += array[j + 256*threadIdx.x] * array[j + 256*threadIdx.x];


  if (all threads are through) {result = sum;}

-> to work like this, i need to detect the very first run of the function an the last run

Is there any possibility to do this?

Thanks a lot!


Just a quick guess, don’t you need to synchronize access to the global memory as you access the same variable?

A solution would be to write the values to shared mem and do a reduction there. After that the thread with the highest value writes out.
So every thread reads shmem[i] till iDim-1. As all threads read the same value there are no conflicts. The thread with the highest value gets to write the result out.


You are running the loop 256 times inside a kernel. Every thread within the block (1B,256T) is going to execute that loop.

Is that what you intended to do?

Or is 256 the total payload of the array?

Also your sum variable is local to the thread only , by no means it is visible to other threads. Here is a small change to your code

__shared__ float sum[BLOCK_SIZE]; //in this case it will be 256 memory is shared between threads

sum[threadIdx.x] = 0; 

 for (int j = 1; j < 256; j++)


   sum[threadIdx.x] += array[j + 256*threadIdx.x] * array[j + 256*threadIdx.x];


  __syncthreads(); //if (all threads are through) {result = sum;}

//just for the sake of simplicity dont do a reduction. just let the master thread sum it   up (Bad Performance) use reduction here

if(threadIdx.x == 0)


 float fTemp = 0;  

 for(int j=0;j<BLOCK_SIZE;j++)




 result = temp;


Thanks a lot for your help! Now i see a bit clearer how to work with this things! With your solution, 256 threads all work together to build the multiplys. But after the syncthreads() there is only one thread who adds thogether all the multiply-results.

Am I right, that this blocks the whole multiprocessor ond takes “long” time since there is only one streamprocessor of the multiprocessor working? Is there no other possibility to sum up after multiply? Would it be a possibility to let work 28 threads in parallel to match the 8 streamprocessors and again save the results to a new array sharedmem[256/(82)] and afther this sum up the resulting 16 values to the final result?

Would this be faster than only let one thread finally sum up the 256 values? Are there any other thoughts to fasten up such structures?

Also I am thinking of the use of the global memory. Would there be a speedup when I use Textures instead of global memory to provide the data from cpu to gpu? Even when I only read once from each Cell?

Greets burnie

Can anyone help me on this?

Thanks in advance!