moving data between Device Global to Device Shared

Hi,

What is the fastest way to move Array from device global memory to Device shared memory per one Thread?

Thanks

Miki

Are you sure you wanna do this with ONE thread?

Because you cannot get coalesced memory access with a single thread, so you’re demanding the impossible.

the fastest way is one thread will copy one element from global memory to shared memory, be careful of bank conflicts. It’s depend on the sizeof(type).

I don’t think so (assuming the OP meant to use one thread to copy the entire array)

Copy float4 types which allows copying 16 bytes at once. This creates the largest single memory transactions that one thread can generate. This requires some pointer magic to work - and you need to compute how many float16 are enough to cover the size of your original data. Use a for loop to copy the required number of float16 types. If the size is known in advance, unroll the loop.

Christian

Bank conflicts do not seem possible when using a single thread only.

However if the original poster meant to use multiple threads, but needs instructions what to do in each single thread, then my answer would be: copy elements of size 4 bytes (float, int) per thread, regardless of the type in your array. Just make sure to use the right number of threads in total to cover your array size - and not much more. Allow for up to 3 bytes of padding at the end of the array to cover any excess reads. And make sure your array is tightly packed (no filler bytes in between) as to avoid copying useless padding bytes between elements.

Reasons why I suggest this:

a ) 64 byte coalesced memory access is fastest, according to the programming guide. 128 byte access is a little slower, 256 byte access is about half the speed of a 64 byte access (this is with compute capability 1.1 - not sure about 1.3). 64 byte coalesced access is achieved by having each thread of one half warp access 4 bytes - where the starting address is a multiple of 64 bytes (important!)

b ) copying one float or int per thread in a consecutive way avoids bank conflicts alltogether.

Christian

Thanks

It is very helpful.

if you may please elaborate :
Allow for up to 3 bytes of padding at the end of the array to cover any excess reads

Thanks
Miki

Assuming your allocated memory for the array is 189 bytes (say 63 elements of 3 bytes each), but you’re using a copy method using 48 threads, each copying 4 bytes. So you read 192 bytes in total. With this, you might be reading across the end of your allocated global memory segment (assuming you alloc’ed 189 bytes). In the worst case you’d see the kernel crash.

So allocate your global memory with enough padding space to allow for reading/writing in 4 byte blocks.

alloc_size = ((true_array_size + 3) / 4) * 4

I am not sure with which granularity memory allocations are made in CUDA, so with above formula you should be on the safe side of things.

Christian

Thank you Christian.

you help alot,

Thanks

Miki