global memory latency

santyhyammer · June 21, 2008, 7:09pm

Does a global memory RW operation really take 400-600 cycles even when it’s coalesced? Same question for local device memory .

And other question… is this coalesced?

__device__ float4 d[2048];

__global__ void myKernel ()

{

    const unsigned int tidx(blockDim.x*blockIdx.x+threadIdx.x);

    d[tidx] = make_float4(1.0f,2.0f,3.0f,4.0f);

}

or, to be coalesced needs to be

__device__ float d[2048];

in 32bit size and not 128.

thx

JHHPC · June 22, 2008, 10:36am

Yes but the system transfers more than 4 bytes per latency.

And other question… is this coalesced?
__device__ float4 d[2048];

__global__ void myKernel ()

{

    const unsigned int tidx(blockDim.x*blockIdx.x+threadIdx.x);

    d[tidx] = make_float4(1.0f,2.0f,3.0f,4.0f);

}
or, to be coalesced needs to be
__device__ float d[2048];
in 32bit size and not 128.

thx

[snapback]397678[/snapback]

Haven’t workes with e.g. float4 yet, sry. But try profiling with the cuda visual profiler and look at the coherent/incoherent counts. If float4 works like the normal float than you should have a coalesced load, as thread 0 block 0 treats element 0 and so on

Johannes

Ailleur · June 22, 2008, 2:58pm

Does a global memory RW operation really take 400-600 cycles even when it’s coalesced? Same question for local device memory .

And other question… is this coalesced?
__device__ float4 d[2048];

__global__ void myKernel ()

{

 ï¿½ Â Â const unsigned int tidx(blockDim.x*blockIdx.x+threadIdx.x);

 ï¿½ Â Â d[tidx] = make_float4(1.0f,2.0f,3.0f,4.0f);

}
or, to be coalesced needs to be
__device__ float d[2048];
in 32bit size and not 128.

thx

[snapback]397678[/snapback]

Yep this should work with float4s. Float3 cannot be coalesced (well, not out of the box anyway) but float4s are no problem since each thread will be fetching/writing 128bits words.

santyhyammer · June 22, 2008, 3:02pm

I figured it… any aligned type should be coalesced, thx!

And… is the local memory faster than the global one or is it excatly the same?

ps: the CUDa profiled does not work with my app.

Ailleur · June 22, 2008, 3:16pm

Section 5.1.2.2 of the programming guide. (beta2v2)

Topic		Replies	Views
global memory latency CUDA Programming and Performance	12	16195	December 13, 2007
hwo to make float2 and float4 data coalesced? CUDA Programming and Performance	1	3559	May 27, 2008
About global memory CUDA Programming and Performance	0	1929	October 19, 2008
Memory coalescing in one thread CUDA Programming and Performance	17	16659	March 31, 2011
memory latency CUDA Programming and Performance	5	3945	March 21, 2007
read from global mem vs write to global mem CUDA Programming and Performance	13	6462	January 22, 2009
Global memory access time Time to read from global to share memor CUDA Programming and Performance	4	3255	July 16, 2007
Isn't that Coalesced?! writing to global memory in a coalesced way CUDA Programming and Performance	9	10196	June 28, 2009
Coalesced vs non-coalesced in reduction example Why float4-reads are not coalesced? CUDA Programming and Performance	10	4116	October 15, 2008
coalesced access to global memory block-wise access vs element-wise access CUDA Programming and Performance	0	1510	March 17, 2010

global memory latency

Related topics