Odd divergent branching phenomenon attributed to shared memory variables and the powf() function

I am observing an odd divergent branching phenomenon attributed to shared memory variables and the powf() function. The problem was zeroed in on two code blocks. I need your help finding out what causes the problem, since I couldn’t find an appropriate answer on the forum nor on the net.

The algorithm simulates biological processes. Each process has a fixed amount of input variables (varIn) and output variables (varOut) denoted by maxIn and maxOut. VarIn and VarOut change over time according to the following state equation


varIn[n] = varIn[n-1] - x

varOut[n] = varOut[n-1] + x

where “x” is the change factor. The change factor is calculated through


x = x * powf(varIn[iii],beta[iii])

varIn is an element of R[sub]0[/sub][sup]+[/sup]

beta is an element of R

Each thread simulates one process and maxIn and maxOut are the same among all threads. The variables are loaded into shared memory. Here are the two code blocks causing the branching problem:


global void genericprocess(…)

// Initialize variables ,set up shared memory space and load variables from global memory to shared memory

float x=1;

// Block 1 - Calculate process’ change factor

for(int iii=0;iii<maxIn;iii++){

    x = x * powf(s_varIn[iii+offsetInShared],s_beta[iii+offsetInShared])



// Block 2 - Update process’ input and output variables

for(int iii=0;iii<maxIn;iii++){

s_varIn[iii+offsetInShared] = s_varIn[iii+offsetInShared] - x;


for(int iii=0;iii<maxOut;iii++){

s_varOut[iii+offsetOutShared] = s_varOut[iii+offsetOutShared] + x;



Analyzing the program with the profiler reveals maxIn divergent branches ascribed to the “change factor for-loop”. If I comment out either of the blocks no divergent branching occurs. The same is true if the calculation of x does not depend neither on s_varIn nor on s_beta.

Any ideas what is causing this behaviour?

Thanks in advance!


It’s hard to give an answer without seeing the full code. But if the maxIn and MaxOut varables are stored in registers, then different threads will repeat the loops different numbers of times, leading to divergence… some threads will uselessly idle.

But going beyond that point, it’s unclear the code you posted is doing what you want it to. Since the full code isn’t posted I can’t be sure, but there are two big issues which you may need to think about.

First, your code may have race conditions. You’re reading s_varIn[iii+offsetInShared] in Block #1, but what if some threads have already moved onto Block 2 and are writing into those values that other threads still expect to be unchanged? [This may not be an issue if you’ve set up non-overlapping regions per thread with appropriate per-thread pointers, which is why I say you need to post more of your initialization code.]

Second, if maxIn is fairly large, you’re doing a lot of loops, and perhaps you’re doing them inefficiently with serial work per thread, instead of in parallel. Your posted code again isn’t enough to tell, but those tight loops with no syncs inbetween them just smell like inefficient data handling.

Here is the entire kernel function. Hopefully it will be of help.

[codebox]#ifndef genericprocess_KERNEL_H

#define genericprocess_KERNEL_H

constant int c_MarkingVectorSize, c_maxIn, c_maxOut;

constant float c_SI;

global void genericprocess(float *g_markings, float *g_thresholds, float *g_alpha, float *g_beta, float *g_posIn, float *g_posOut, float *g_PID, float *g_ProcessType, float *g_delayTime, float *g_waitingTime, float *g_waitingState, float *g_C1, float *g_C2, float *g_minValues, float *g_maxValues){

// Thread ID - idx

int idx = blockIdx.x*blockDim.x + threadIdx.x;

// Process ID - idp

int idp = g_PID[idx];

// Put constant values to local registers

int MarkingVectorSize	= c_MarkingVectorSize;

int maxIn				= c_maxIn;

int maxOut				= c_maxOut;

float SI				= c_SI;

// Declare Shared Memory

extern __shared__ float s_data[];

// Variables residing in Shared Memory

float *s_markingsIn, *s_posIn, *s_thresholds, *s_alpha, *s_beta, *s_markingsOut, *s_posOut, *s_minValuesIn, *s_maxValuesIn, *s_minValuesOut, *s_maxValuesOut;

// Set up and organize shared memory space

s_markingsIn   = s_data;

s_posIn        = &s_data[    maxIn * blockDim.x];

s_thresholds   = &s_data[2 * maxIn * blockDim.x];

s_alpha		   = &s_data[3 * maxIn * blockDim.x];

s_beta		   = &s_data[4 * maxIn * blockDim.x];

s_minValuesIn  = &s_data[5 * maxIn * blockDim.x];

s_maxValuesIn  = &s_data[6 * maxIn * blockDim.x];

s_markingsOut  = &s_data[7 * maxIn * blockDim.x];

s_posOut       = &s_data[(7 * maxIn * blockDim.x) + (    maxOut * blockDim.x)];

s_minValuesOut = &s_data[(7 * maxIn * blockDim.x) + (2 * maxOut * blockDim.x)];

s_maxValuesOut = &s_data[(7 * maxIn * blockDim.x) + (3 * maxOut * blockDim.x)];

// Process variables

float x=1;

// Temporary variables

int   itmp1=0;

float ftmp1=0;

// Initialize offset values

int offsetInGlobal  = maxIn  * idp;

int offsetOutGlobal = maxOut * idp;

int offsetInShared  = maxIn  * idx;

int offsetOutShared = maxOut * idx;

// Copy values from global memory to shared memory

for(int iii=0;iii<maxIn;iii++){

	s_posIn[iii+offsetInShared]      = g_posIn[iii+offsetInGlobal];

	s_markingsIn[iii+offsetInShared] = g_markings[(int)s_posIn[iii+offsetInShared]];

	s_thresholds[iii+offsetInShared] = g_thresholds[iii+offsetInGlobal];

	s_alpha[iii+offsetInShared]		 = g_alpha[iii+offsetInGlobal];

	s_beta[iii+offsetInShared]	     = g_beta[iii+offsetInGlobal];

	s_minValuesIn[iii+offsetInShared]= g_minValues[(int)s_posIn[iii+offsetInShared]];

	s_maxValuesIn[iii+offsetInShared]= g_maxValues[(int)s_posIn[iii+offsetInShared]];		


for(int iii=0;iii<maxOut;iii++){		

	s_posOut[iii+offsetOutShared]      = g_posOut[iii+offsetOutGlobal];

	s_markingsOut[iii+offsetOutShared] = g_markings[(int)s_posOut[iii+offsetOutShared]];

	s_minValuesOut[iii+offsetOutShared]= g_minValues[(int)s_posOut[iii+offsetOutShared]];

	s_maxValuesOut[iii+offsetOutShared]= g_maxValues[(int)s_posOut[iii+offsetOutShared]];



// Generic Process


for(int iii=0;iii<maxIn;iii++){

	x = x * powf(s_markingsIn[iii+offsetInShared],s_beta[iii+offsetInSha



// Update Input Markings

for(int iii=0;iii<maxIn;iii++){

	s_markingsIn[iii+offsetInShared] = s_markingsIn[iii+offsetInShared] - x;


// Update Output Markings

for(int iii=0;iii<maxOut;iii++){

	s_markingsOut[iii+offsetOutShared] = s_markingsOut[iii+offsetOutShared] + x;


// Write results back to global memory	

for(int iii=0;iii<maxIn;iii++){

	g_markings[(int)s_posIn[iii+offsetInShared]] = s_markingsIn[iii+offsetInShared];


for(int iii=0;iii<maxOut;iii++){

	g_markings[(int)s_posOut[iii+offsetOutShared]] = s_markingsOut[iii+offsetOutShared];


}// End generic process function

#endif // #ifndef cudaX2_KERNEL_H


You don’t need to worry about divergence.

You’re currently very limited by memory access speeds. I think you know this which is why you’re doing a lot of copying to shared memory.
But this isn’t buying you very much, since much of what you copy is a single value per-thread, which you then access once… meaning there was no need to copy it at all.
In fact you copy some values over that you don’t use at all, like the entire g_alpha and g_thesholds arrays… these are purely wasted.

But perhaps you have extended code which does use these… even so you have some severe inefficiencies because of the arbitrary indirections you have.
CUDA supports them but it can slow down memory access 16X if all the threads indirect to different areas of your source array.
So for example all of your indirections like g_markings[(int)s_posIn[iii+offsetInShared]]; may be very inefficient if all your threads point to distant values.
If you reorganized your data to be properly packed, then 16 threads could read 16 words in one memory transaction. Currently, you may be using 16.

Some other little details: you can clean up the code a lot by realizing you’re storing array pointers per-thread anyway, so there’s no need to complexify them with extra offsets… just move those into the pointers.
Ie, you have:

s_posIn = &s_data[ maxIn * blockDim.x];


s_posIn[iii+offsetInShared] = g_posIn[iii+offsetInGlobal];

but you could write like:

s_posIn = &s_data[ maxIn * blockDim.x+offsetInShared];
s_posIn[iii] = g_posIn[iii];

Stepping back to the algorithm as a whole, you must first realize that the computation is all free… your kernel is entirely memory dominated. If you’re optimizing for speed, your entire focus should be on organizing your data for efficient memory access. It may be that you can get rid of those extra indirections by interleaving your data… it depends a bit on what you know about the data ahead of time… those indirections are ugly but perhaps unnecessary if you know something about your g_PID values.

Also, thinking about it, your code doesn’t do what you described in your first post.
You mention a state update function of

varIn[n] = varIn[n-1] - x

but what you coded is:

varIn[n] = varIn[n] - x

Thank you for your answer! It was really helpful. You are right. Currently the memory access pattern is very inefficient. Furthermore, the indexing scheme bears the potential of bank conflicts.

Yes, other parts of the code use the values like g_thresholds, but I decided to omit these parts because otherwise it would render my post unnecessarily complex. Now I need to reorganize the data for efficient access and implement a bank-conflict-free indexing scheme. I think scan_best_kernel.cu of the SDK might be a good starting point. Hopefully, this will improve the performance because the kernel truly is memory dominated.

I’ll implement these proposals and report my findings.

Sorry for the ambiguity.

varIn[n] = varIn[n-1] - x

describes the temporal development of one particular variable, where n is the point in time. Since each process alters many variables, these variables are stored in an array and are indexed. The index iii reaches from 0 to maxIn. Therefore (omitting the offset value),

s_varIn[iii] = s_varIn[iii] - x

updates all the process’ variables according to the state equation by looping through the array, where the left-hand-side (lhs) is the new value at point n emerging from the rhs at point n-1 in time. The external loop responsible for the temporal development of the data was not included in the posted code snippet as this loop resides in host code which calls the kernel.