Conditional Statement Divergent

I am implementing a blur compute shader and reading the texture values into shared memory. My thread group is 256x1x1. Each thread reads in a texel value into shared memory. From an ATI presentation, they recommend that the boundary threads (how wide the boundary is depends on the blur radius) read in an additional texel, as to blur 256 pixels will require 256 + 2*BlurRadius samples.

So I have code like: if(localThreadID.x < gBlurRadius) read extra sample, and similar code for the right boundary.

256 threads means there are 8 warps. So only 2 warps (one at each boundary) out of the 8 should be divergent, correct?

Is this something I should worry about? Is it possible to improve? Someone told me the gain of loading into shared memory outweighs the divergance cost and not to worry about it.

For not too large values of the blur radius this operation will be entirely memory bound. For large blur radii the convolution operation in shared memory will dominate the time spent. In both cases you don’t need to care about the cost of divergence.

You might achieve small gains by using a texture to read the data, or by having the same warp read values at both boundaries.