I’m looking forward in getting max performance memory copying between (Global to register) or (global to SHared).
As far as i know we have the following limitations (cuda 2.3):
- we cannot use C structs for register memory nor shared memory, so
copying data with struct data structure is not possible as in C.
-array with variable index is not possible in register memory (but it is Ok with shared memory)
-max size builtin data structure is 16 bytes (float4) which can be copied at once i guess.
-global memory has big latencies for each memory read (and thus memory copy to register or shared memory).
-i did not find any cuda function like C host function strcopy() able to copy memory from global to register
or global to shared.
So, i’m tryng to copy the biggest amount of memory from global to register (best) or global to shared at once.
Is the any possibility to copy more than 16 bytes at once from global memory to register memory (with float4 for instance) ?
I’ve read somewhere G200 memory controler is 512bits, i hope i can copy 64 bytes at once.