say I want to load an array of short from global memory to shared memory. I am not sure how coalescing works here. On best practice guide, it says on device of compute capability 1.0 or 1.1, the k-th thread in a half warp must access the k-th word in a segment aligned to 16 times the size of the elements being accessed.
If I understand it correctly, in case I break my data into 32bytes (16 shorts) segments, thread id 0, 16, 32 … has to access first element of each segment? do i need to consider 64bytes alignment or 128 bytes alignment as well? I have a gts 250, so i guess this is important. Advices are welcomed. thanks.
say I want to load an array of short from global memory to shared memory. I am not sure how coalescing works here. On best practice guide, it says on device of compute capability 1.0 or 1.1, the k-th thread in a half warp must access the k-th word in a segment aligned to 16 times the size of the elements being accessed.
If I understand it correctly, in case I break my data into 32bytes (16 shorts) segments, thread id 0, 16, 32 … has to access first element of each segment? do i need to consider 64bytes alignment or 128 bytes alignment as well? I have a gts 250, so i guess this is important. Advices are welcomed. thanks.
There’s no need to read and write as shorts. Copy the N shorts into shared memory as N/2 full words… you can do perfect coalesced reads that way. After they’re in shared you can access them as shorts… just cast the shared int pointer into a short pointer.
There’s no need to read and write as shorts. Copy the N shorts into shared memory as N/2 full words… you can do perfect coalesced reads that way. After they’re in shared you can access them as shorts… just cast the shared int pointer into a short pointer.
Do you mean as following?
[indent]short* global_mem;
inside the kernel:
[indent] shared int share_mem[N];
... ...
int* cast_int_ptr = (int*) & global_mem;
share_mum[N] = cast_int_ptr[N];
... ...
short* cast_ptr = (short*) & share_mem;[/indent]
... ...[/indent]
Do you mean as following?
[indent]short* global_mem;
inside the kernel:
[indent] shared int share_mem[N];
... ...
int* cast_int_ptr = (int*) & global_mem;
share_mum[N] = cast_int_ptr[N];
... ...
short* cast_ptr = (short*) & share_mem;[/indent]
... ...[/indent]