moving data between Device Global to Device Shared

Miki · February 11, 2009, 3:35pm

Hi,

What is the fastest way to move Array from device global memory to Device shared memory per one Thread?

Thanks

Miki

cbuchner1 · February 11, 2009, 5:38pm

Are you sure you wanna do this with ONE thread?

Because you cannot get coalesced memory access with a single thread, so you’re demanding the impossible.

Quoc_Vinh · February 12, 2009, 1:18pm

the fastest way is one thread will copy one element from global memory to shared memory, be careful of bank conflicts. It’s depend on the sizeof(type).

cbuchner1 · February 12, 2009, 1:30pm

I don’t think so (assuming the OP meant to use one thread to copy the entire array)

Copy float4 types which allows copying 16 bytes at once. This creates the largest single memory transactions that one thread can generate. This requires some pointer magic to work - and you need to compute how many float16 are enough to cover the size of your original data. Use a for loop to copy the required number of float16 types. If the size is known in advance, unroll the loop.

Christian

cbuchner1 · February 12, 2009, 1:56pm

Bank conflicts do not seem possible when using a single thread only.

However if the original poster meant to use multiple threads, but needs instructions what to do in each single thread, then my answer would be: copy elements of size 4 bytes (float, int) per thread, regardless of the type in your array. Just make sure to use the right number of threads in total to cover your array size - and not much more. Allow for up to 3 bytes of padding at the end of the array to cover any excess reads. And make sure your array is tightly packed (no filler bytes in between) as to avoid copying useless padding bytes between elements.

Reasons why I suggest this:

a ) 64 byte coalesced memory access is fastest, according to the programming guide. 128 byte access is a little slower, 256 byte access is about half the speed of a 64 byte access (this is with compute capability 1.1 - not sure about 1.3). 64 byte coalesced access is achieved by having each thread of one half warp access 4 bytes - where the starting address is a multiple of 64 bytes (important!)

b ) copying one float or int per thread in a consecutive way avoids bank conflicts alltogether.

Christian

Miki · February 12, 2009, 3:17pm

Thanks

It is very helpful.

if you may please elaborate :
Allow for up to 3 bytes of padding at the end of the array to cover any excess reads

Thanks
Miki

cbuchner1 · February 12, 2009, 3:37pm

Assuming your allocated memory for the array is 189 bytes (say 63 elements of 3 bytes each), but you’re using a copy method using 48 threads, each copying 4 bytes. So you read 192 bytes in total. With this, you might be reading across the end of your allocated global memory segment (assuming you alloc’ed 189 bytes). In the worst case you’d see the kernel crash.

So allocate your global memory with enough padding space to allow for reading/writing in 4 byte blocks.

alloc_size = ((true_array_size + 3) / 4) * 4

I am not sure with which granularity memory allocations are made in CUDA, so with above formula you should be on the safe side of things.

Christian

Miki · February 12, 2009, 3:41pm

Thank you Christian.

you help alot,

Thanks

Miki

Topic		Replies	Views
Question regarding transfer from global to shared memory CUDA Programming and Performance	5	5959	November 27, 2010
Another question about coalesced reads/writes CUDA Programming and Performance	10	2128	August 18, 2009
What is the fastest way to copy 512 bytes from global to shared memory? CUDA Programming and Performance	5	976	December 24, 2014
Copying data from global memory to shared memory by each thread CUDA Programming and Performance	6	16779	January 7, 2022
Small const array accessable globally? Is it easy and possible? CUDA Programming and Performance	6	1410	April 16, 2009
copying to shared block mem CUDA Programming and Performance	11	4167	April 6, 2008
performance for global and shared memory CUDA Programming and Performance	2	6232	January 15, 2008
copy global memory by kernel threads CUDA Programming and Performance	1	5957	January 23, 2011
copy global memory by CUDA threads CUDA Programming and Performance	3	1208	January 17, 2011
Copy from texture memory to shared memory Confused about best transfer strategy CUDA Programming and Performance	4	1554	February 11, 2010

moving data between Device Global to Device Shared

Related topics