Variable Initialisation on Device Routine

burnie · May 19, 2008, 4:31pm

Hello,

I have a global function, which summs up all Values (which were multiplicated before).

 float sum = 0;

   for (int j = 1; j < 256; j++){

    sum += array[j + 256*threadIdx.x] * array[j + 256*threadIdx.x];

   }

  result += sum;

I call the function like this: function<<< 1, 256 >>>(arraypointer, resultpointer)

I want to run this code only on one multiprocessor, therefore, i call the function only with one thread-block.

The variable “result” is in the global memory. Is it possible to solve this problem in another way to only write once to the global memory, instead of 256 times? (because global memory is slow).

Can i do something like this: ?

 float sum;

  initialvalue sum = 0; // only the very first thread which goes through the routine setts sum to 0, so the threads after can work with the results of the threads before

   for (int j = 1; j < 256; j++){

    sum += array[j + 256*threadIdx.x] * array[j + 256*threadIdx.x];

   }

  if (all threads are through) {result = sum;}

→ to work like this, i need to detect the very first run of the function an the last run

Is there any possibility to do this?

Thanks a lot!

burnie

JHHPC · May 19, 2008, 4:45pm

Just a quick guess, don’t you need to synchronize access to the global memory as you access the same variable?

A solution would be to write the values to shared mem and do a reduction there. After that the thread with the highest value writes out.
So every thread reads shmem[i] till iDim-1. As all threads read the same value there are no conflicts. The thread with the highest value gets to write the result out.
Johannes

Neeraj · May 19, 2008, 5:18pm

Hello,

I have a global function, which summs up all Values (which were multiplicated before).
 float sum = 0;

   for (int j = 1; j < 256; j++){

    sum += array[j + 256*threadIdx.x] * array[j + 256*threadIdx.x];

   }

  result += sum;
I call the function like this: function<<< 1, 256 >>>(arraypointer, resultpointer)

I want to run this code only on one multiprocessor, therefore, i call the function only with one thread-block.

The variable “result” is in the global memory. Is it possible to solve this problem in another way to only write once to the global memory, instead of 256 times? (because global memory is slow).

Can i do something like this: ?
 float sum;

  initialvalue sum = 0; // only the very first thread which goes through the routine setts sum to 0, so the threads after can work with the results of the threads before

   for (int j = 1; j < 256; j++){

    sum += array[j + 256*threadIdx.x] * array[j + 256*threadIdx.x];

   }

  if (all threads are through) {result = sum;}
→ to work like this, i need to detect the very first run of the function an the last run

Is there any possibility to do this?

Thanks a lot!

burnie

[snapback]379492[/snapback]

Burnie,

You are running the loop 256 times inside a kernel. Every thread within the block (1B,256T) is going to execute that loop.

Is that what you intended to do?

Or is 256 the total payload of the array?

Also your sum variable is local to the thread only , by no means it is visible to other threads. Here is a small change to your code

__shared__ float sum[BLOCK_SIZE]; //in this case it will be 256 memory is shared between threads

sum[threadIdx.x] = 0; 

 for (int j = 1; j < 256; j++)

  {

   sum[threadIdx.x] += array[j + 256*threadIdx.x] * array[j + 256*threadIdx.x];

  }

  __syncthreads(); //if (all threads are through) {result = sum;}

//just for the sake of simplicity dont do a reduction. just let the master thread sum it   up (Bad Performance) use reduction here

if(threadIdx.x == 0)

{

 float fTemp = 0;  

 for(int j=0;j<BLOCK_SIZE;j++)

 {

   fTemp+=sum[j]; 

 }

 result = temp;

}

burnie · May 21, 2008, 6:45pm

Burnie,

You are running the loop 256 times inside a kernel. Every thread within the block (1B,256T) is going to execute that loop.

Is that what you intended to do?

Or is 256 the total payload of the array?

Also your sum variable is local to the thread only , by no means it is visible to other threads. Here is a small change to your code
__shared__ float sum[BLOCK_SIZE]; //in this case it will be 256 memory is shared between threads

sum[threadIdx.x] = 0; 

 for (int j = 1; j < 256; j++)

  {

   sum[threadIdx.x] += array[j + 256*threadIdx.x] * array[j + 256*threadIdx.x];

  }

  __syncthreads(); //if (all threads are through) {result = sum;}

//just for the sake of simplicity dont do a reduction. just let the master thread sum it   up (Bad Performance) use reduction here

if(threadIdx.x == 0)

{

 float fTemp = 0;  

 for(int j=0;j<BLOCK_SIZE;j++)

 {

   fTemp+=sum[j]; 

 }

 result = temp;

}
[snapback]379523[/snapback]

Thanks a lot for your help! Now i see a bit clearer how to work with this things! With your solution, 256 threads all work together to build the multiplys. But after the syncthreads() there is only one thread who adds thogether all the multiply-results.

Am I right, that this blocks the whole multiprocessor ond takes “long” time since there is only one streamprocessor of the multiprocessor working? Is there no other possibility to sum up after multiply? Would it be a possibility to let work 28 threads in parallel to match the 8 streamprocessors and again save the results to a new array sharedmem[256/(82)] and afther this sum up the resulting 16 values to the final result?

Would this be faster than only let one thread finally sum up the 256 values? Are there any other thoughts to fasten up such structures?

Also I am thinking of the use of the global memory. Would there be a speedup when I use Textures instead of global memory to provide the data from cpu to gpu? Even when I only read once from each Cell?

Greets burnie

burnie · May 24, 2008, 9:55am

Thanks a lot for your help! Now i see a bit clearer how to work with this things! With your solution, 256 threads all work together to build the multiplys. But after the syncthreads() there is only one thread who adds thogether all the multiply-results.

Am I right, that this blocks the whole multiprocessor ond takes “long” time since there is only one streamprocessor of the multiprocessor working? Is there no other possibility to sum up after multiply? Would it be a possibility to let work 28 threads in parallel to match the 8 streamprocessors and again save the results to a new array sharedmem[256/(82)] and afther this sum up the resulting 16 values to the final result?

Would this be faster than only let one thread finally sum up the 256 values? Are there any other thoughts to fasten up such structures?

Also I am thinking of the use of the global memory. Would there be a speedup when I use Textures instead of global memory to provide the data from cpu to gpu? Even when I only read once from each Cell?

Greets burnie

[snapback]380769[/snapback]

Can anyone help me on this?

Thanks in advance!

burnie

Topic		Replies	Views
Writes in same memory location Cant add numbers from different threads? CUDA Programming and Performance	46	26091	July 5, 2007
How to set the variables in the global memory to zero effectively? initialize global memory CUDA Programming and Performance	5	3927	March 24, 2009
__syncthreads and shared memory CUDA Programming and Performance	21	4703	June 15, 2011
cuda shared memory usage + no reduction with threads CUDA Programming and Performance	5	1199	April 23, 2012
Question regarding summing up outputs Summing outputs from each thread CUDA Programming and Performance	10	8179	March 12, 2008
device global memory update questions CUDA Programming and Performance	7	5984	April 20, 2009
Variables seen by all threads CUDA Programming and Performance	2	4081	November 15, 2011
Getting wrong output from CUDA kernel CUDA Programming and Performance	6	8397	April 15, 2011
Performing multiple summations in one GPU kernel CUDA Programming and Performance	5	1277	August 19, 2013
Syncthread and global memory CUDA Programming and Performance	1	1110	January 7, 2017

Variable Initialisation on Device Routine

Related topics