Efficiently loading into smem, with divergent branches Interesting problem, maybe possible solution?

Hey all,

I’m currently implementing a cascading image algorithm, that is an algorithm is run on every single pixel in an image (independently), going through various ‘stages’ or ‘levels’ in the cascade, the idea being each stage/level in the cascade can early-exit, no longer requiring computation of the later (and generally more computationally intensive) stages.

My problem is, to reduce register counts - and in other cases just do things more efficiently, I need to load data into smem - however with 1 thread mapped to each pixel, once I’m at say, stage X out of Y (where X isn’t 0), there’s no garentee ALL threads in the block have made it past all cascade stages up to X - so I may have a bunch of threads who’ve since stopped processing entirely - and thus I don’t have a contiguous set of threads in which I can efficiently load data into smem with. (eg: each consecutive thread loads consecutive gmem words into consecutive smem words). Similarly, I have no guarantee that thread 0 (the thread I generally use for block-level loading of trivial smem data)

Is there a solution to this fragmented block / shared memory problem?

At present, to get reliable results I have to use a single thread for loading all smem data (inefficient), and worse - I have to modify the algorithm so that particular thread iterates through all stages in the cascade - even if it never passed the previous stages (this adding additional divergent branches, and logic complexity to an already computation bound kernel).

Any advice/ideas would be greatly appreciated.

My current performance measurements are down to 16 registers (8 bytes of lmem :( which is essentially 2 registers) and 2.2 million instructions in 9.1ms on a 560x600 image - 300,000 branches, 11,000 divergent (however my branches are as minimalistic as can be, short of introducing exponentially more instructions as a result of removing branches), no warp divergence, 256 threads per block, 0.667 Occupancy on an 8800 (1.1 compute capability) - 150 uncoalesced stores (impossible to coalesce), 2805 coalesced reads.

So I’ve spent the past 16 working hours literally trying to reduce register count / lmem usage / instruction counts - with little luck, my only hope appears to be a solution to the above problem.

this is kind of a data dump… could you re-explain why you need to use one thread to load shared memory data? A better algorithm description would help more than profile counters.

Once again, certain data is best kept in smem and processing done on it using a single thread (because nvcc seems to generate better, less register intensive code in this case) - and data in general is best loaded into smem using the entire block (again, thread x loading in word x from gmem, writing into bank x%16 in smem).

An example where I’d use 1 thread for doing stuff in smem is where I have a single counter or some kind of accumulator in smem (in smem to reduce register counts, because nvcc is … dumb) - example:

[codebox]shared unsigned int counter;

if(tid == 0) counter = 0;

for(… iterate over some range …) {


code, relies on counter


if(tid == 0)

    counter += something_relating_to_previous_iteration;

}[/codebox] (Example 1 - primarily used to reduce register counts)

An example where I’d use as much of a block as possible for handling smem is where I’m loading data from gmem, into smem - eg: a struct of X words (X*4 bytes) [codebox]shared unsigned int word_buffer;

for(unsigned int i(tid); i < X; i += blockDim.x*blockDim.y)

word_buffer[i] = gmem[i];[/codebox] <b>(Example 2 - primarily used for loading data from gmem to smem)</b>

My problem is, as my outer-most iteration continues, more and more threads will stop processing completely (which threads terminate is arbitary, data dependant) - thus I don’t know what single thread is still around to perform Example 1 for simple tricks to reduce register counts by using smem - and I have no guarantee of consecutive threads to stream from gmem to smem using code like Example 2.

Worse, to enforce the above guarantees required to execute code like in Examples 1 & 2, I have to keep threads iterating through this algorithm - but not necessarily computing data (because I’m computation bound) - but still keeping it’s old result, which adds additional branches and instructions required to keep these threads in the iteration, so they can keep managing smem appropriately for the block.

(I’m quite sure I explained this in my post above, minus the code examples with pretty formatting.)

Yes, the code examples help. I have definitely used both constructs. I haven’t actually ever benchmarked the first to be faster (the only measurement I took in which it made a difference, it was slower), so I decided to go with the simplest for loop (assuming the compiler can, or will sometime optimize more naive code).

You should consider remapping threads with a prefix sum. This may prevent half-warps from diverging, and allow for coalesced loading. Since you have a thread mapped to a pixel, let the finished threads take new pixels. This may be a log2 operation, but if you have a finite number of threads, the overall algorithm is linear (or bounded by more complex computation).

“keep threads iterating through this algorithm” - That’s what I’d do; are you sure this is so much slower? When a good proportion of the threads finish, consider restarting the block. The inactive threads will consume register/shared space whether they’re sitting on a syncthreads or iterating through for your loop constructs.

Please understand that I don’t know everything you do; therefore, second explanations help.

Also, I have some kernels which ended up with quite a bit of lmem, and it wasn’t killer. The stat is for entire compilation, not the common case. If you’re reloading a lmem word 10 times during iteration, surely it can’t be slower than loading so much data?

The problem there is, I’m heavily compute bound (or I appear to be, certainly not bound by memory bandwidth nor texture lookup latency, so it’s either warp divergence (branches) or pure instruction cound) - so the logic (if(thread_hasnt_exited_early) { do stuff }) just adds more instructions - which for me, ended up for a slower implementation compared to having them exit early (yes, it used less registers, because I could do the smem tricks - but it still ended up being slower than the naive version which makes threads exit early, which has 16 registers, and 2 registers in local memory).

The reason I say I “appear” to be computation bound and not memory bound, is because the majority of my memory reads are via textures (due to random access to some small arrays, and un-coalesceable access to a 2D image for many cases in 1.0/1.1 hardware) - and because I’m using textures, I have no real count of memory transactions on 1.0/1.1 hardware. The only transactions I have besides those texture reads, are fully coalesced loads when the kernel first starts - and very very few uncoalesced writes at the end of the kernel (only the threads that get through the entire cascade write their results).

Another reason I say I appear to be computation bound, is my execution time almost linearly scales with the instruction count (and in fact, this itself almost certainly points to me being computation bound) - which comes back to why I want to be able to store data in smem, using blocks that can’t guarantee every single thread is still executing - because this will reduce register count (remember I’m using 8 bytes of lmem, because my kernel really needs 18 registers, not 16), reducing local loads/stores, reducing instruction count by about 20% if i remember correctly (I’m at home now, don’t have the profiler in front of me).

Hmm, looks like I’m only computation bound up to a certain point (I’ve halved my instruction count now, and I’m down to 7ms instead of 9ms - and now appear to be limited by bandwidth again).

Only 4 bytes of lmem being used now - so I only need to get rid of 1 more required register - and this is without using these smem tricks I want to use… so it seems I may not need a solution to this problem immediately (however to get 100% occupancy, which will likely be required on newer hardware for maximum performance, I’ll still need to figure out a solution to the problem at hand).

“because my kernel really needs 18 registers, not 16” - I take it you’re using the maxrregcount flag? Also, you can try ptxas -O to use “less optimized” code with fewer registers (I haven’t done it myself to be honest).

Have you counted the bandwidth? This could be a good indicator if you’re instruction or memory limited.

Yes, I’m using the maxrregcount flag (set to 16) - I haven’t tried telling pxtas to not optimize my code though, that could be potentially interesting.

Hmm, I guess I could add some debugging variables to the kernel that count how many texture fetches I’m doing - which would give me the ‘most’ transactions I’m making (excluding caching) - I’ll give that a shot if I don’t make any progress - but for the time being I’m somewhat certain I’m bandwidth limited.

You can use the new profiler to easily check your bandwidth used. Of course you can only reach the maximum bandwidth if you use only 128 byte reads. Using 64 byte reads you will only reach half.