coalesced read short integer cuda

small_potato · October 19, 2010, 7:25am

say I want to load an array of short from global memory to shared memory. I am not sure how coalescing works here. On best practice guide, it says on device of compute capability 1.0 or 1.1, the k-th thread in a half warp must access the k-th word in a segment aligned to 16 times the size of the elements being accessed.

If I understand it correctly, in case I break my data into 32bytes (16 shorts) segments, thread id 0, 16, 32 … has to access first element of each segment? do i need to consider 64bytes alignment or 128 bytes alignment as well? I have a gts 250, so i guess this is important. Advices are welcomed. thanks.

small_potato · October 19, 2010, 7:25am

say I want to load an array of short from global memory to shared memory. I am not sure how coalescing works here. On best practice guide, it says on device of compute capability 1.0 or 1.1, the k-th thread in a half warp must access the k-th word in a segment aligned to 16 times the size of the elements being accessed.

If I understand it correctly, in case I break my data into 32bytes (16 shorts) segments, thread id 0, 16, 32 … has to access first element of each segment? do i need to consider 64bytes alignment or 128 bytes alignment as well? I have a gts 250, so i guess this is important. Advices are welcomed. thanks.

SPWorley · October 19, 2010, 7:52am

There’s no need to read and write as shorts. Copy the N shorts into shared memory as N/2 full words… you can do perfect coalesced reads that way. After they’re in shared you can access them as shorts… just cast the shared int pointer into a short pointer.

SPWorley · October 19, 2010, 7:52am

There’s no need to read and write as shorts. Copy the N shorts into shared memory as N/2 full words… you can do perfect coalesced reads that way. After they’re in shared you can access them as shorts… just cast the shared int pointer into a short pointer.

small_potato · October 21, 2010, 6:59am

Do you mean as following?

[indent]short* global_mem;

inside the kernel:

[indent] shared int share_mem[N];

... ...

int* cast_int_ptr = (int*) & global_mem;

share_mum[N] = cast_int_ptr[N];

... ...

short* cast_ptr = (short*) & share_mem;[/indent]

... ...[/indent]

small_potato · October 21, 2010, 6:59am

Do you mean as following?

[indent]short* global_mem;

inside the kernel:

[indent] shared int share_mem[N];

... ...

int* cast_int_ptr = (int*) & global_mem;

share_mum[N] = cast_int_ptr[N];

... ...

short* cast_ptr = (short*) & share_mem;[/indent]

... ...[/indent]

Topic		Replies	Views
Question about coalesced memory access CUDA Programming and Performance	10	2763	September 24, 2009
Another question about coalesced reads/writes CUDA Programming and Performance	10	2148	August 18, 2009
coalesced access to global memory block-wise access vs element-wise access CUDA Programming and Performance	0	1502	March 17, 2010
Global Memory Coalescing on Devices with Compute Capability 1.2 and Higher CUDA Programming and Performance	3	651	June 4, 2015
coalesced access to global memory CUDA Programming and Performance	6	1177	May 8, 2014
Loading global memory into shared memory: alignment? CUDA Programming and Performance	2	854	December 8, 2017
Coalescing memory accesses Need help with coalescing CUDA Programming and Performance	2	1166	March 30, 2009
Coalescing into shared memory CUDA Programming and Performance	1	1970	December 13, 2008
Question regarding transfer from global to shared memory CUDA Programming and Performance	5	5971	November 27, 2010
Please help with __shared__ memory different usage than in samples CUDA Programming and Performance	30	3334	January 10, 2010

coalesced read short integer cuda

Related topics