Help needed in understanding "SobelFilter"!

I’m a student who’s trying to use CUDA for image processing algorithms. It’s not long since I started learning it. I just read through the Programming Guide and am now trying to understand some sample code, particularly “SobelFilter” for the moment. Everything looks good until it came to the mode “SOBELSHARED” which uses the shared memory.

Codes I have questions(part 1): in
void SobelFilter(…)

dim3 threads(16,4);
int BlockWidth = 80; // must be divisible by 16 for coalescing
dim3 blocks = dim3(iw/(4BlockWidth)+(0!=iw%(4BlockWidth)),
int SharedPitch = ~0x3f&(4*(BlockWidth+2Radius)+0x3f);
int sharedMem = SharedPitch

// for the shared kernel, width must be divisible by 4
iw &= ~3;


What’s the meaning of “BlockWidth”?

Why is “blocks” calculated like that? I know the meaning of (0!=iw%(4BlockWidth)), just why (iw/(4BlockWidth))? “iw” refers to number of pixels in width of image, right? Why divided by BlockWidth, not threads.x? Why product 4?

I guess SharedPitch is the width in bytes of shared memory a block owns. Again why calculated like that? Why product 4? Is the “0x3f” stuff some alignment?

I guess I’ll understand sharedMem if I do SharedPitch. It looks right because “threads.y+2*Radius” is number of rows of shared memory for one block.

If “iw” is not multiple of 4 in practice, we just don’t care the remaining pixels?

I have more questions regarding to the kernel code, but a little tired now. Hopefully many of them will disappear after the above ones solved. Any help would be appreciated.

The complexities in sobelShared that you have identified are an artifact of how applications must interact with shared memory for best performance. To avoid shared memory conflicts, each thread in a half-warp (16 threads) must access a different 32-bit word in shared memory. This is still true of apps that traffic in smaller-than-32-bit words: if you access every fourth byte you get best shared memory performance.

For sobelFilter, the inner loop is unrolled by 4 to do exactly that: it accesses every fourth byte in shared memory to avoid conflicts. The unrolled loop also reduces the number of reads from shared memory - the vertical strip of 3 leftmost pixels are replaced by the incoming vertical strip of pixels.

When computing SharedPitch, the add and mask of 0x3f is to align the shared memory pitch to the next-largest multiple of 64. As a result, shared memory conflicts are avoided as the app accesses different rows of pixel data. (Not unlike padding rows of pixels to be 16-byte aligned and SSE2 pixel processing.)

BlockWidth specifies the width of the block of output pixels computed by each thread block. Since the code loops using BlockWidth, and shared memory is instanced per thread block, this number may be different than the number of threads.

Each thread block emits a block of pixels BlockWidth*4 wide. By enabling BlockWidth to be varied at runtime (subject to certain constraints) or fixed at compile time (FIXED_BLOCKWIDTH), we were able to optimize the thread block size and amount of shared memory used (derived from BlockWidth) by empirically measuring performance of different configurations.

Thanks a lot! Although I finally figured it out just before seeing your reply (I’ve been thinking about it for a whole day…), you put it in precise words. Thanks again.