Experiencing weird slowdown - Please explain Change in global memory throughput due to simple code

I am making a persistent threads based managed-worklist type implementation.

BlockSize - 32
GPU - Tesla 2070

I am experiencing a weird speedup when I make a very small change in my code. Below, I have tried to explain my situation.

Rough workflow -
-Each thread is in a continuous while loop - persistent thread
-There is a global worklist - on each threadblock iteration - one thread out of the block goes out - gets a global lock and gets work for all the threads of the block.
-There are two shared Flag arrays of size equal to BLOCK_SIZE that I maintain - which tell if a thread in the block has some “work” to do in the current iteration (generated by repeated while loop) or not.

Note - kernel is very heavy - here is the spill info just in case - 568 bytes stack frame, 1020 bytes spill stores, 1128 bytes spill loads.

To enter the Work region I have a simple if condition of following type -

if(will_get_job[threadIdx.x] == 1 ||shared_have_job[threadIdx.x] == 0)
     {
        // DO tons of work
     }

will_get_job and shared_have_job – are two shared arrays of size 32 (block_size)

My kernel takes about 7.2 seconds to complete

Now, if I make another local variable - say - isJobThere and do the following -

if(will_get_job[threadIdx.x] == 1){
// get work
isJobThere =1;
}
if(shared_have_job[threadIdx.x] == 0)
{
isJobThere = 1;
}

if(isJobThere == 1)
{
// Do Tons of Job
}

If you notice I just rearranged a very very simple if condition - And my kernel ends in 3.7 seconds!!
That means I have almost a 2X speedup just by doing this?

I tried to profile both executions on visual profiler –
NVVP numbers are almost same for everything other than the Global memory throughput.
----------------------------------------- --------- 2nd Version(fast) 1st version (Slow)
Global Store Throughput(bytes/sec) 8065255190 — 4215765049
Global Load Throughput(bytes/sec) 53137914415 — 27009967686

DRAM Write Throughput(bytes/sec) 70907875980 — 36590962667
DRAM Read Throughput(bytes/sec) 66918869189 — 33718067502

Can anyone explain this change in global memory throughput due to such small change?

Sid

I will really appreciate any help or comments on this?
Should I re-frame or present my question in a better way?

The change in memory throughput is trivially explained - the same number of transactions are performed in half the time, so throughput doubles.

The question remains though why your second version is so much faster. I could imagine it has something to do with suboptimal selection of reconvergence points by ptxas, but there is not enough code in your post to tell. What other constructs are there impacting control flow (particularly break and early return statements)?

EDIT: typo

I seem to have figured out the reason -

The problem was with the way I constructed the if statement before -

Basically the IF statement states whether there is a job to be done for each of the threads or not. There can be two flags that indicate presence of a job separated by a Conditional OR.

Probably the conditional OR leads to a divergence - first the threads satisfying the first condition work and then the threads satisfying the second condition.

This divergence leads the slowdown which is removed when I combine the two conditions into one and Enter in the “Big Job region” with just one condition.

A valuable lesson learned!

Sid