only-once-instruction in kernel

ribbldibbl · May 12, 2008, 10:17pm

Hello,
what I’ve got right now is a multithreaded kernel and another one, that executes only one instruction X in a single thread. I’d like to merge them, however somthing like

if ( condition that is only true in one of the threads )
X;

is leading to strange results. The performance guidelines deprecate control flow instructions, but in this case i would prefer them over using a second kernel. I wonder why it doesn’t work.

Another (related question): I’m not sure what exactly happens, if I do

shared int c;
c++;

in a kernel with n threads. How much will c be increased?

I would be grateful for any explanations or reading suggestions, I didn’t find exact specifcation about this in the programing guidelines.

Regards,
R.

MisterAnderson42 · May 12, 2008, 10:46pm

There is nothing wrong, even performance-wise for using if statements and branches. Optimizing for divergent warps is the lowest priority optimization and rarely (in my experience) improves performance.

There shouldn’t be anything wrong with your if statement unless it depends on data written to global memory in another one of the threads.

In:
shared int c;
c++;

c will be incremented an undefined number of times. This is invalid code. I thought the programming guide made this clear, but it’s been a while since I read it. The histogram sample in the SDK may be of interest to you if you.

ribbldibbl · May 12, 2008, 11:35pm

Thank you for your quick answer.
Hmm, that’s what I thought.
Here is what i want to do:
c[i*n+j]= (i==j) ? sqrt(sum) : sum;
where i is a parameter to the kernel and j is the unique block-index.
Now I wonder why the result is different (e.g. correct) when I write instead
c[i*n+j]=sum;
and subsequently call a single-thread-kernel that does
c[i*n+i]=sqrt(c[i*n+i]);

wumpus · May 13, 2008, 10:18am

If you add volatile, and execute this with a blocksize of at most 32 threads (warp size), this will result in c being increased with exactly 1. As all threads execute in tandem, they will all read c and write c+1.

In case of multiple warps you can never be sure.

Topic		Replies	Views
Impact of control flow on thread performance CUDA Programming and Performance	11	14059	January 17, 2008
cant call any kernel function CUDA Programming and Performance	8	4931	June 6, 2011
Thread divergence due to IF CUDA Programming and Performance	3	6914	September 13, 2007
if-else WARP divergence WARP divergence CUDA Programming and Performance	17	17030	January 5, 2008
Question about divergent branching CUDA Programming and Performance	3	6504	May 21, 2009
Question about divergence and branch granularity CUDA Programming and Performance	1	925	April 25, 2012
Branching in kernel CUDA Programming and Performance	3	5420	June 5, 2008
Are loop incrementations performed by all threads? CUDA Programming and Performance	3	2001	July 31, 2008
How subject to performance loss is : if (idx < n) { .... } ? CUDA Programming and Performance	7	1606	July 13, 2015
how to optimize if-else in the kernel CUDA Programming and Performance	1	1080	June 22, 2011

only-once-instruction in kernel

Related topics