Execute instruction only once inside a block/grid?

zenosparadox · May 7, 2010, 8:00pm

I know this kind of goes against what the purpose of GPU devices are for, but is it possible that inside each thread block, to execute a section inside there by only one thread?

Example:

[codebox]///// main /////

kernel<<<16,32>>>(params);

///// kernel /////

global void kernel(params)

{

    //do this code section by only one thread

    ...

    ...

    // now every thread executes the following code

    ...

    ...

}[/codebox]

Is this in any way possible with some kind of mutex/lock system? The reason I ask is because I have to define the index pointers of my double **ptr to match the correct locations inside the contiguous block of memory I allocated for double *array. When I copy the matrix over from the host to the GPU, the memory locations will be different for double **ptr, so I have to reset them when I get into the kernel code. However, every thread inside the kernel runs the same code, but it only needs to be set one time in reality.

Any ideas?

MisterAnderson42 · May 7, 2010, 8:04pm

if (threadIdx.x == 0)

	{

	// single thread taskss

	}

__syncthreads();

// all thread tasks

zenosparadox · May 7, 2010, 8:25pm

I guess in my head I was thinking that __syncthreads() would generate too much latency, but I suppose it would be mostly negligible. So simple, haha. Thanks!

Daniel

tmurray · May 7, 2010, 8:31pm

__syncthreads() has hardware behind it so it’s plenty fast most of the time.

Sijo · May 8, 2010, 7:29am

I know this kind of goes against what the purpose of GPU devices are for, but is it possible that inside each thread block, to execute a section inside there by only one thread?

Example:

[codebox]///// main /////

kernel<<<16,32>>>(params);

///// kernel /////

global void kernel(params)

{
    //do this code section by only one thread

    ...

    ...

    // now every thread executes the following code

    ...

    ...
}[/codebox]

Is this in any way possible with some kind of mutex/lock system? The reason I ask is because I have to define the index pointers of my double **ptr to match the correct locations inside the contiguous block of memory I allocated for double *array. When I copy the matrix over from the host to the GPU, the memory locations will be different for double **ptr, so I have to reset them when I get into the kernel code. However, every thread inside the kernel runs the same code, but it only needs to be set one time in reality.

Any ideas?

Yes… Your approach is totally against the purpose of using GPUs. by using the above concepts you will not get the performance because there will come an if - else condition and this will definitely slow down the performance. Another one is that if you use the double pointer, the non coallaced memory issue will come in the Kernel which will degrade the performance. Another thing is that you want to do the data copying multiple to the device memory, since double pointer memory transfer is not present in the cudaMemcopy. Overall this method will not give the performance.you can obtain some what performance by texture binding the input data in device memory, which doesn’t have the coallaced memory issue.

kbam · May 10, 2010, 12:33am

True but all the other threads are doing nothing anyway while ‘The One’ runs the code. Sometimes its quicker for all threads to do the same calculation rather than having to write it to shared then synch then have all threads read it. I’m not sure in your case with the double pointer. How long the calculation will take and the number of warps per block will also influence the optimum.

(also says in manual that sometimes its better to re calculate the same thing rather than tie up registers, local or shared mem )

Cheers,

kbam

PS Sometime I have forgotten the 80:20 rule, (To get 80% of the benefits with just 20% of the effort), and found that I have put far to much time into trying to make a perfect design.

zenosparadox · May 10, 2010, 2:56pm

I’m allocating all my input arrays as a contiguous 1D array each, then setting my pointers from the corresponding double ** var to point to the correct starting locations inside each contiguous 1D array. Also, during my large-size testing, my input will be 400MB+ in data for a single day’s worth of data. Based on the calculations I am having to do, I have to 4-5 matrices of these large arrays to hold the input for my calculations. Meaning, on a single day’s worth of data, I will be taking up over half a single GPU device’s memory at a given moment in the kernel call.

How large is texture memory? I’m using the Tesla C1060 devices. Sorry, I can’t know much about texture memory if I don’t know how to program using it.

Oh, and using the if-else statement increases my performance just barely. It wasn’t really worth the trouble, but my kernel calls do finish < .1 % faster now.

tmurray · May 10, 2010, 4:27pm

The idea that you never want to have a single thread doing something while every other thread in the blocks waits at a barrier is completely wrong. This is often the best way to accomplish something (e.g., load a single variable into shared memory).

Topic		Replies	Views
Simple test, unexpected results: more calculations in each thread, less GPU occupancy time! CUDA Programming and Performance	5	1127	May 27, 2013
Loop inside kernel or over kernels in host code? [performance question] CUDA Programming and Performance	8	6741	September 25, 2008
Scheduling Blocks on a Multi-Processor Block Scheduling on Multiprocessor CUDA Programming and Performance	11	6393	December 6, 2007
Running Just One Thread on each Core CUDA Programming and Performance	7	4566	April 28, 2011
Multiple GPUs Devise a synchro mechanism for host threads CUDA Programming and Performance	7	4199	May 13, 2010
copying to shared block mem CUDA Programming and Performance	11	4179	April 6, 2008
CUDA Use Cases run serial algorithms on composite data CUDA Programming and Performance	14	4497	October 24, 2008
Global memory reads optimization with texture cache CUDA Programming and Performance	2	1403	August 2, 2009
Here are my timing results, not impressive. Help. CUDA Programming and Performance	5	7010	January 30, 2008
Some CUDA/GPU implementation related questions CUDA Programming and Performance	6	2259	May 30, 2009

Execute instruction only once inside a block/grid?

Related topics