Experiencing weird slowdown - Please explain Change in global memory throughput due to simple code

sidxavier · April 4, 2012, 7:22pm

I am making a persistent threads based managed-worklist type implementation.

BlockSize - 32
GPU - Tesla 2070

I am experiencing a weird speedup when I make a very small change in my code. Below, I have tried to explain my situation.

Rough workflow -
-Each thread is in a continuous while loop - persistent thread
-There is a global worklist - on each threadblock iteration - one thread out of the block goes out - gets a global lock and gets work for all the threads of the block.
-There are two shared Flag arrays of size equal to BLOCK_SIZE that I maintain - which tell if a thread in the block has some “work” to do in the current iteration (generated by repeated while loop) or not.

Note - kernel is very heavy - here is the spill info just in case - 568 bytes stack frame, 1020 bytes spill stores, 1128 bytes spill loads.

To enter the Work region I have a simple if condition of following type -

if(will_get_job[threadIdx.x] == 1 ||shared_have_job[threadIdx.x] == 0)
     {
        // DO tons of work
     }

will_get_job and shared_have_job – are two shared arrays of size 32 (block_size)

My kernel takes about 7.2 seconds to complete

Now, if I make another local variable - say - isJobThere and do the following -

if(will_get_job[threadIdx.x] == 1){
// get work
isJobThere =1;
}
if(shared_have_job[threadIdx.x] == 0)
{
isJobThere = 1;
}

if(isJobThere == 1)
{
// Do Tons of Job
}

If you notice I just rearranged a very very simple if condition - And my kernel ends in 3.7 seconds!!
That means I have almost a 2X speedup just by doing this?

I tried to profile both executions on visual profiler –
NVVP numbers are almost same for everything other than the Global memory throughput.
----------------------------------------- --------- 2nd Version(fast) 1st version (Slow)
Global Store Throughput(bytes/sec) 8065255190 — 4215765049
Global Load Throughput(bytes/sec) 53137914415 — 27009967686

DRAM Write Throughput(bytes/sec) 70907875980 — 36590962667
DRAM Read Throughput(bytes/sec) 66918869189 — 33718067502

Can anyone explain this change in global memory throughput due to such small change?

Sid

sidxavier · April 6, 2012, 1:17am

I will really appreciate any help or comments on this?
Should I re-frame or present my question in a better way?

tera · April 6, 2012, 10:59am

The change in memory throughput is trivially explained - the same number of transactions are performed in half the time, so throughput doubles.

The question remains though why your second version is so much faster. I could imagine it has something to do with suboptimal selection of reconvergence points by ptxas, but there is not enough code in your post to tell. What other constructs are there impacting control flow (particularly break and early return statements)?

EDIT: typo

sidxavier · April 6, 2012, 4:54pm

I seem to have figured out the reason -

The problem was with the way I constructed the if statement before -

Basically the IF statement states whether there is a job to be done for each of the threads or not. There can be two flags that indicate presence of a job separated by a Conditional OR.

Probably the conditional OR leads to a divergence - first the threads satisfying the first condition work and then the threads satisfying the second condition.

This divergence leads the slowdown which is removed when I combine the two conditions into one and Enter in the “Big Job region” with just one condition.

A valuable lesson learned!

Sid

Topic		Replies	Views
Impact of control flow on thread performance CUDA Programming and Performance	11	14062	January 17, 2008
Compute Visual Profiler- global memory throughput Legacy PGI Compilers	1	2890	April 14, 2011
Tips to improve overall global memory throughtput Legacy PGI Compilers	3	3383	August 7, 2012
Kernel execution takes AGES CUDA Programming and Performance	7	3055	March 28, 2012
evaluating global memory access trade-off CUDA Programming and Performance	0	870	April 2, 2009
Why is the program only 5 times faster? Please help!!!! CUDA Programming and Performance	1	951	January 4, 2010
Global Memoy latencies and NVIDIA cards Latency CUDA Programming and Performance	15	8978	January 11, 2008
Slow Performance CUDA Programming and Performance	19	8317	November 24, 2008
Help: understanding the differences in NVProf DRAM Read/Write Throughput CUDA Programming and Performance	6	824	December 25, 2019
Global memory access bottleneck CUDA Programming and Performance	8	3616	September 4, 2015

Experiencing weird slowdown - Please explain Change in global memory throughput due to simple code

My kernel takes about 7.2 seconds to complete

Related topics