Block indexing

Hi Guys,

This is armstrong - A newbie in CUDA Programming - and sending you the first post.

I am trying to learn convolution separable code in CUDA SDK.

Its been a week since i am trying to understand. But no real breakthrough.

Very diffcult to understand the coalesced acess.

How will this code segment, facilitate coalesced acess.

const int apronStartAligned = tileStart - KERNEL_RADIUS_ALIGNED;

const int loadPos = apronStartAligned + threadIdx.x;


Furhter where can I find simple programming tutorial than the hard to crack ones in SDK.

David Armstrong

Havent spent time on this one in particular but take a look at the transpose exemple.
The naive kernel is each to understand and the coalesced kernel will show you how to tackle the job of coalescing something that isnt not naturaly.


this is from the convolutionSeparable example, right?

The rule is this:

(globalArrayIndex modulo 16) == (threadIdx.x modulo 16) —> coalesced access

tileStart is always a multiple of 16. KERNEL_RADIUS_ALIGNED is always a multiple of 16 - it’s “aligned up” to the next 16. Hence, apronStartAligned is always a multiple of 16.

So, Position (A):

(apronStartAligned + 0) modulo 16 -> 0.

Then, threadIdx.x must be 0 when reading that position.

Similarly, let’s take a look at position (B),

(apronStartAligned + 1) modulo 16 -> 1.

threadIdx.x must be 1 when reading position (B).

Hope this explanation works. Keep at it, it WILL click into place.