global memory latency

Does a global memory RW operation really take 400-600 cycles even when it’s coalesced? Same question for local device memory .

And other question… is this coalesced?

__device__ float4 d[2048];

__global__ void myKernel ()


    const unsigned int tidx(blockDim.x*blockIdx.x+threadIdx.x);

    d[tidx] = make_float4(1.0f,2.0f,3.0f,4.0f);


or, to be coalesced needs to be

__device__ float d[2048];

in 32bit size and not 128.


Yes but the system transfers more than 4 bytes per latency.

Haven’t workes with e.g. float4 yet, sry. But try profiling with the cuda visual profiler and look at the coherent/incoherent counts. If float4 works like the normal float than you should have a coalesced load, as thread 0 block 0 treats element 0 and so on


Yep this should work with float4s. Float3 cannot be coalesced (well, not out of the box anyway) but float4s are no problem since each thread will be fetching/writing 128bits words.

I figured it… any aligned type should be coalesced, thx!

And… is the local memory faster than the global one or is it excatly the same?

ps: the CUDa profiled does not work with my app.

Section of the programming guide. (beta2v2)