Hey all,
I’m currently implementing a cascading image algorithm, that is an algorithm is run on every single pixel in an image (independently), going through various ‘stages’ or ‘levels’ in the cascade, the idea being each stage/level in the cascade can early-exit, no longer requiring computation of the later (and generally more computationally intensive) stages.
My problem is, to reduce register counts - and in other cases just do things more efficiently, I need to load data into smem - however with 1 thread mapped to each pixel, once I’m at say, stage X out of Y (where X isn’t 0), there’s no garentee ALL threads in the block have made it past all cascade stages up to X - so I may have a bunch of threads who’ve since stopped processing entirely - and thus I don’t have a contiguous set of threads in which I can efficiently load data into smem with. (eg: each consecutive thread loads consecutive gmem words into consecutive smem words). Similarly, I have no guarantee that thread 0 (the thread I generally use for block-level loading of trivial smem data)
Is there a solution to this fragmented block / shared memory problem?
At present, to get reliable results I have to use a single thread for loading all smem data (inefficient), and worse - I have to modify the algorithm so that particular thread iterates through all stages in the cascade - even if it never passed the previous stages (this adding additional divergent branches, and logic complexity to an already computation bound kernel).
Any advice/ideas would be greatly appreciated.
My current performance measurements are down to 16 registers (8 bytes of lmem :( which is essentially 2 registers) and 2.2 million instructions in 9.1ms on a 560x600 image - 300,000 branches, 11,000 divergent (however my branches are as minimalistic as can be, short of introducing exponentially more instructions as a result of removing branches), no warp divergence, 256 threads per block, 0.667 Occupancy on an 8800 (1.1 compute capability) - 150 uncoalesced stores (impossible to coalesce), 2805 coalesced reads.
So I’ve spent the past 16 working hours literally trying to reduce register count / lmem usage / instruction counts - with little luck, my only hope appears to be a solution to the above problem.