I am making a persistent threads based managed-worklist type implementation.
BlockSize - 32
GPU - Tesla 2070
I am experiencing a weird speedup when I make a very small change in my code. Below, I have tried to explain my situation.
Rough workflow -
-Each thread is in a continuous while loop - persistent thread
-There is a global worklist - on each threadblock iteration - one thread out of the block goes out - gets a global lock and gets work for all the threads of the block.
-There are two shared Flag arrays of size equal to BLOCK_SIZE that I maintain - which tell if a thread in the block has some “work” to do in the current iteration (generated by repeated while loop) or not.
Note - kernel is very heavy - here is the spill info just in case - 568 bytes stack frame, 1020 bytes spill stores, 1128 bytes spill loads.
To enter the Work region I have a simple if condition of following type -
if(will_get_job[threadIdx.x] == 1 ||shared_have_job[threadIdx.x] == 0)
{
// DO tons of work
}
will_get_job and shared_have_job – are two shared arrays of size 32 (block_size)
My kernel takes about 7.2 seconds to complete
Now, if I make another local variable - say - isJobThere and do the following -
if(will_get_job[threadIdx.x] == 1){
// get work
isJobThere =1;
}
if(shared_have_job[threadIdx.x] == 0)
{
isJobThere = 1;
}
if(isJobThere == 1)
{
// Do Tons of Job
}
If you notice I just rearranged a very very simple if condition - And my kernel ends in 3.7 seconds!!
That means I have almost a 2X speedup just by doing this?
I tried to profile both executions on visual profiler –
NVVP numbers are almost same for everything other than the Global memory throughput.
----------------------------------------- --------- 2nd Version(fast) 1st version (Slow)
Global Store Throughput(bytes/sec) 8065255190 — 4215765049
Global Load Throughput(bytes/sec) 53137914415 — 27009967686
DRAM Write Throughput(bytes/sec) 70907875980 — 36590962667
DRAM Read Throughput(bytes/sec) 66918869189 — 33718067502
Can anyone explain this change in global memory throughput due to such small change?
Sid