Coalesced Loads Question

Will this type of load be automatically coalesced?

Thread  1 -> Address 132

Thread  2 -> Address 136

Thread  3 -> Address 140

Thread  4 -> Address 144

Thread  5 -> Address 148

Thread  6 -> Address 152

Thread  7 -> Address 156

Thread  8 -> Address 160

Thread  9 -> Address 164

Thread 10 -> Address 168

Thread 11 -> Address 172

Thread 12 -> Address 176

Thread 13 -> Address 180

Thread 14 -> Address 184

Thread 15 -> Address 188

Thread  0 -> Address 192

I know that if Thread 0 accessed Address 128, it would be. However, I’m wondering if I read in this pattern from global memory, would it be coalesced into 2 loads? Right now, I’m thinking that unless I manually split them, it will be 16 loads.

Thanks.

I’d say this was OK, at least from what I read from the diagrams in the programming manual.

I’ve done some further testing, with simple kernels and here’s what I have to say.

__global__ void kernel1(float* memory) {

        int i = (threadIdx.x - 1) % blockDim.x;

        memory[i] = 2*memory[i];

}

__global__ void kernel2(float* memory) {

        int i = (threadIdx.x - 1) % blockDim.x;

       if (i <= threadIdx.x)

                memory[i] = 2*memory[i];

        if (i > threadIdx.x)

                memory[i] = 2*memory[i];

}

int main() {

        float *memory;

        cudaMalloc((void**)&memory, 4*30);

        kernel1 <<< dim3(1,1,1), dim3(16,1,1) >>>(memory+1);

        kernel2 <<< dim3(1,1,1), dim3(16,1,1) >>>(memory+1);

}

Both of these are run on <<< dim3(1,1,1), dim3(16,1,1) >>>.

The first one uses 16 loads, and 32 stores.

The second uses 2 loads, and 8 stores.

Now, I’m not sure at all why there are more stores than loads, but it would be nice if nvcc realized that kernel1 and kernel2 were exactly the same.

Edit: note, I am replying to your original post.

What hardware do you plan on running on?

If you will only run on compute 1.2 and newer hardware and you are listing byte addresses (I’m assuming you are accessing 4 byte words), you will be fine as all addresses fall within the 128 byte segment from 128 to 256.

If you plan to run on compute 1.1 or 1.0 hardware, this pattern will not be coalesced and split into 16 accesses.

I currently only have access to compute 1.1 and 1.0. I am using floats (4 byte words).

What I currently do is as follows:

unsigned const tidx = (threadIdx.x - blockIdx.x) % blockDim.x;

unsigned const xIndex = (blockDim.x-1) * blockIdx.x + tidx;

unsigned const yIndex = (blockDim.y-1) * blockIdx.y + threadIdx.y;

unsigned const xyIndex = xIndex + yIndex*n1;

Then I access based on xyIndex. Would that work on compute 1.2? (or 1.3?)