I’m looking forward in getting max performance memory copying between (Global to register) or (global to SHared).
As far as i know we have the following limitations (cuda 2.3):
we cannot use C structs for register memory nor shared memory, so
copying data with struct data structure is not possible as in C.
-array with variable index is not possible in register memory (but it is Ok with shared memory)
-max size builtin data structure is 16 bytes (float4) which can be copied at once i guess.
-global memory has big latencies for each memory read (and thus memory copy to register or shared memory).
-i did not find any cuda function like C host function strcopy() able to copy memory from global to register
or global to shared.
So, i’m tryng to copy the biggest amount of memory from global to register (best) or global to shared at once.
Is the any possibility to copy more than 16 bytes at once from global memory to register memory (with float4 for instance) ?
I’ve read somewhere G200 memory controler is 512bits, i hope i can copy 64 bytes at once.
I believe the best way to copy is to copy 4 bytes per thread. If each thread in a warp reads consecutive
addresses from an aligned segment of global memory, the hardware will coalesce them to a memory
reads with full memory bus width, and give you maximum performace.
E.g.: Copy n bytes from offset m of G in global memory to S in shared memory by:
I think your suggestion of three arrays instead of an array of structs is the way to go.
This is also the recommended way in the CUDA Programming Manual.
Reading a float4 from each thread might be slower that reading a float. Don’t know if that
is still the case on the G200-chip. Try!
If you do you loads per warp, you don’t have to synchronize. Within a warp all threads are always
synchronized.
E.g. if your struct has 4 elements (so that you end up having for arrays) you can read one element
at a time in each thread of a warp, do this for times, so you have read 4*32 elements, which means one
struct per warp.
Ok, i’ve tested what is faster for loading and storing between 1 float4 and 4 float (i don’t test loading and storing separately since it doesn’t make sense in real situation) on a GTX260 216SP (G200 chip).
and the winner is… Float4!
To be more accurate, float4 is the same cost a 3 float , so there is a insteresting 25% gain on G200
but i’m not sure if it will be the same on G80.
Here is the code snippet tested :
global memory declaration
float4 PtestAllFrom[BIG_NUMBER];
float4 PtestAllTo[BIG_NUMBER];
float PtestSingleFrom[4][BIG_NUMBER];
float PtestSingleTo[4][BIG_NUMBER];
kernel sample
data->PtestSingleTo[0][i] = data->PtestSingleFrom[0][i];
data->PtestSingleTo[1][i] = data->PtestSingleFrom[1][i];
data->PtestSingleTo[2][i] = data->PtestSingleFrom[2][i];
data->PtestSingleTo[3][i] = data->PtestSingleFrom[3][i];
data->PtestAllTo[i] = data->PtestAllFrom[i];
My conclusion remains unchanged :
it’s really a shame there is no bigger cuda structure than float4 :)
This is the correct advice… but on a slightly more advanced level, it may be that copying 64 bits per thread is slightly more efficient on G200 hardware.
FromBilleter’s paper, he gives a GTX280 bandwidth of 77.8 GB/sec for 32 bits per thread, 102.5 GB/sec for 64 bits per thread, and 73.4 GB/sec for 128 bits per thread.
These are pass-through bandwidths (reading and immediately rewriting). It would be interesting to benchmark other memory access methods, and also try G80/G92 hardware.