coalesced read short integer cuda

say I want to load an array of short from global memory to shared memory. I am not sure how coalescing works here. On best practice guide, it says on device of compute capability 1.0 or 1.1, the k-th thread in a half warp must access the k-th word in a segment aligned to 16 times the size of the elements being accessed.

If I understand it correctly, in case I break my data into 32bytes (16 shorts) segments, thread id 0, 16, 32 … has to access first element of each segment? do i need to consider 64bytes alignment or 128 bytes alignment as well? I have a gts 250, so i guess this is important. Advices are welcomed. thanks.

say I want to load an array of short from global memory to shared memory. I am not sure how coalescing works here. On best practice guide, it says on device of compute capability 1.0 or 1.1, the k-th thread in a half warp must access the k-th word in a segment aligned to 16 times the size of the elements being accessed.

If I understand it correctly, in case I break my data into 32bytes (16 shorts) segments, thread id 0, 16, 32 … has to access first element of each segment? do i need to consider 64bytes alignment or 128 bytes alignment as well? I have a gts 250, so i guess this is important. Advices are welcomed. thanks.

There’s no need to read and write as shorts. Copy the N shorts into shared memory as N/2 full words… you can do perfect coalesced reads that way. After they’re in shared you can access them as shorts… just cast the shared int pointer into a short pointer.

There’s no need to read and write as shorts. Copy the N shorts into shared memory as N/2 full words… you can do perfect coalesced reads that way. After they’re in shared you can access them as shorts… just cast the shared int pointer into a short pointer.

Do you mean as following?

[indent]short* global_mem;

inside the kernel:

[indent] shared int share_mem[N];

... ...

int* cast_int_ptr = (int*) & global_mem;

share_mum[N] = cast_int_ptr[N];

... ...

short* cast_ptr = (short*) & share_mem;[/indent]

... ...[/indent]

Do you mean as following?

[indent]short* global_mem;

inside the kernel:

[indent] shared int share_mem[N];

... ...

int* cast_int_ptr = (int*) & global_mem;

share_mum[N] = cast_int_ptr[N];

... ...

short* cast_ptr = (short*) & share_mem;[/indent]

... ...[/indent]