Will this type of load be automatically coalesced?
Thread 1 -> Address 132
Thread 2 -> Address 136
Thread 3 -> Address 140
Thread 4 -> Address 144
Thread 5 -> Address 148
Thread 6 -> Address 152
Thread 7 -> Address 156
Thread 8 -> Address 160
Thread 9 -> Address 164
Thread 10 -> Address 168
Thread 11 -> Address 172
Thread 12 -> Address 176
Thread 13 -> Address 180
Thread 14 -> Address 184
Thread 15 -> Address 188
Thread 0 -> Address 192
I know that if Thread 0 accessed Address 128, it would be. However, I’m wondering if I read in this pattern from global memory, would it be coalesced into 2 loads? Right now, I’m thinking that unless I manually split them, it will be 16 loads.
Thanks.
I’d say this was OK, at least from what I read from the diagrams in the programming manual.
I’ve done some further testing, with simple kernels and here’s what I have to say.
__global__ void kernel1(float* memory) {
int i = (threadIdx.x - 1) % blockDim.x;
memory[i] = 2*memory[i];
}
__global__ void kernel2(float* memory) {
int i = (threadIdx.x - 1) % blockDim.x;
if (i <= threadIdx.x)
memory[i] = 2*memory[i];
if (i > threadIdx.x)
memory[i] = 2*memory[i];
}
int main() {
float *memory;
cudaMalloc((void**)&memory, 4*30);
kernel1 <<< dim3(1,1,1), dim3(16,1,1) >>>(memory+1);
kernel2 <<< dim3(1,1,1), dim3(16,1,1) >>>(memory+1);
}
Both of these are run on <<< dim3(1,1,1), dim3(16,1,1) >>>.
The first one uses 16 loads, and 32 stores.
The second uses 2 loads, and 8 stores.
Now, I’m not sure at all why there are more stores than loads, but it would be nice if nvcc realized that kernel1 and kernel2 were exactly the same.
Edit: note, I am replying to your original post.
What hardware do you plan on running on?
If you will only run on compute 1.2 and newer hardware and you are listing byte addresses (I’m assuming you are accessing 4 byte words), you will be fine as all addresses fall within the 128 byte segment from 128 to 256.
If you plan to run on compute 1.1 or 1.0 hardware, this pattern will not be coalesced and split into 16 accesses.
Edit: note, I am replying to your original post.
What hardware do you plan on running on?
If you will only run on compute 1.2 and newer hardware and you are listing byte addresses (I’m assuming you are accessing 4 byte words), you will be fine as all addresses fall within the 128 byte segment from 128 to 256.
If you plan to run on compute 1.1 or 1.0 hardware, this pattern will not be coalesced and split into 16 accesses.
[snapback]406864[/snapback]
I currently only have access to compute 1.1 and 1.0. I am using floats (4 byte words).
What I currently do is as follows:
unsigned const tidx = (threadIdx.x - blockIdx.x) % blockDim.x;
unsigned const xIndex = (blockDim.x-1) * blockIdx.x + tidx;
unsigned const yIndex = (blockDim.y-1) * blockIdx.y + threadIdx.y;
unsigned const xyIndex = xIndex + yIndex*n1;
Then I access based on xyIndex. Would that work on compute 1.2? (or 1.3?)