Register / Shared memory question memory copy max performance

Hi everyone,

I’m looking forward in getting max performance memory copying between (Global to register) or (global to SHared).
As far as i know we have the following limitations (cuda 2.3):

  • we cannot use C structs for register memory nor shared memory, so
    copying data with struct data structure is not possible as in C.
    -array with variable index is not possible in register memory (but it is Ok with shared memory)
    -max size builtin data structure is 16 bytes (float4) which can be copied at once i guess.
    -global memory has big latencies for each memory read (and thus memory copy to register or shared memory).

-i did not find any cuda function like C host function strcopy() able to copy memory from global to register
or global to shared.

So, i’m tryng to copy the biggest amount of memory from global to register (best) or global to shared at once.

Is the any possibility to copy more than 16 bytes at once from global memory to register memory (with float4 for instance) ?

I’ve read somewhere G200 memory controler is 512bits, i hope i can copy 64 bytes at once.

I believe the best way to copy is to copy 4 bytes per thread. If each thread in a warp reads consecutive
addresses from an aligned segment of global memory, the hardware will coalesce them to a memory
reads with full memory bus width, and give you maximum performace.

E.g.: Copy n bytes from offset m of G in global memory to S in shared memory by:

const int tid = threadIdx.x, nt = blockDim.x;

for(i = tid; i<n; i+=nt)
S[i] = G[m+i];
__syncthreads();

If m i not aligned (assuming G[0] is) and you have a G80 card (wich doesn’t issue coalesced accesses
for non-64-byte-aligned access) you can do:

for(i = m-(m&15)+tid; i<m+n; i+=nt)
if(i >= m) S[i-m] = G[i];

Thanks for you response, i totally agree with this analysis but actually my problem is that each thread need to

work on (and thus need to load from global memory) about 60-80 bytes.

Let choose the first of your exemples since i use a G200 board.

The for loop is issued within a thread, so i guess the copying instruction will introduce a big latency

at each iteration, doesn’t it ?

Is the solution for my problem to split my 60 bytes data structure in arrays of 4 bytes

ad load them as suggested 4 bytes in each thread so that i get a coalesced read from global memory and thus get the full bandwidth ?

in that case i have to read each array after another in each thread :

so instead of having :

struct {

float f1;

int i1;

float f2;

} St;

St ar[bignumber];

as data structure and

i=threadIdx.x + blockDim.x * blocksize;

Rf1 = ar[i].f1;

Ri1 = ar[i].i1;

Rf2 = ar[i].f2;

__syncthreads();

as kernel code (assuming Rf1 Ri1 and Rf2 are register memory)

i should use :

float arf1[bignumber];

float ari1[bignumber];

float arf2[bignumber];

as data structure and

i=threadIdx.x + blockDim.x * blocksize;

Rf1=arf1[i];

__syncthreads();

Ri1=ari1[i];

__syncthreads();

Rf2=arf2[i];

__syncthreads();

as kernel code.

Is this the best way to get th maximum bandwidth from global memory in this case ?

Maybe i could load float4 instead of float for more convenient use and reduce

the number of synchtreads calls ?

I think your suggestion of three arrays instead of an array of structs is the way to go.
This is also the recommended way in the CUDA Programming Manual.

Reading a float4 from each thread might be slower that reading a float. Don’t know if that
is still the case on the G200-chip. Try!

If you do you loads per warp, you don’t have to synchronize. Within a warp all threads are always
synchronized.
E.g. if your struct has 4 elements (so that you end up having for arrays) you can read one element
at a time in each thread of a warp, do this for times, so you have read 4*32 elements, which means one
struct per warp.

I only have a G200 (GTX260+ with 216SP) and yes I’ll give a try.

But i wonder why reading float4 might be slower than float (on G80 architecture),

do you have a clue about that ?

Mmmm it sounds great if i can avoid to synchronize (precious cycles saved). Your description is really

closed to my program. With this, i guess you mean that i will have one full struct per thread.

Since each struct can be processed regardless of all other structs, i think i can go this way.

Thanks for you responses!

Ok, i’ve tested what is faster for loading and storing between 1 float4 and 4 float (i don’t test loading and storing separately since it doesn’t make sense in real situation) on a GTX260 216SP (G200 chip).

and the winner is… Float4!

To be more accurate, float4 is the same cost a 3 float , so there is a insteresting 25% gain on G200
but i’m not sure if it will be the same on G80.

Here is the code snippet tested :
global memory declaration
float4 PtestAllFrom[BIG_NUMBER];
float4 PtestAllTo[BIG_NUMBER];
float PtestSingleFrom[4][BIG_NUMBER];
float PtestSingleTo[4][BIG_NUMBER];
kernel sample
data->PtestSingleTo[0][i] = data->PtestSingleFrom[0][i];
data->PtestSingleTo[1][i] = data->PtestSingleFrom[1][i];
data->PtestSingleTo[2][i] = data->PtestSingleFrom[2][i];
data->PtestSingleTo[3][i] = data->PtestSingleFrom[3][i];
data->PtestAllTo[i] = data->PtestAllFrom[i];

My conclusion remains unchanged :
it’s really a shame there is no bigger cuda structure than float4 :)

This is the correct advice… but on a slightly more advanced level, it may be that copying 64 bits per thread is slightly more efficient on G200 hardware.

FromBilleter’s paper, he gives a GTX280 bandwidth of 77.8 GB/sec for 32 bits per thread, 102.5 GB/sec for 64 bits per thread, and 73.4 GB/sec for 128 bits per thread.

These are pass-through bandwidths (reading and immediately rewriting). It would be interesting to benchmark other memory access methods, and also try G80/G92 hardware.