Does a global memory RW operation really take 400-600 cycles even when it’s coalesced? Same question for local device memory .
And other question… is this coalesced?
__device__ float4 d[2048];
__global__ void myKernel ()
{
const unsigned int tidx(blockDim.x*blockIdx.x+threadIdx.x);
d[tidx] = make_float4(1.0f,2.0f,3.0f,4.0f);
}
or, to be coalesced needs to be
__device__ float d[2048];
in 32bit size and not 128.
thx
JHHPC
June 22, 2008, 10:36am
2
Yes but the system transfers more than 4 bytes per latency.
And other question… is this coalesced?
__device__ float4 d[2048];
__global__ void myKernel ()
{
const unsigned int tidx(blockDim.x*blockIdx.x+threadIdx.x);
d[tidx] = make_float4(1.0f,2.0f,3.0f,4.0f);
}
or, to be coalesced needs to be
__device__ float d[2048];
in 32bit size and not 128.
thx
[snapback]397678[/snapback]
Haven’t workes with e.g. float4 yet, sry. But try profiling with the cuda visual profiler and look at the coherent/incoherent counts. If float4 works like the normal float than you should have a coalesced load, as thread 0 block 0 treats element 0 and so on
Johannes
Does a global memory RW operation really take 400-600 cycles even when it’s coalesced? Same question for local device memory .
And other question… is this coalesced?
__device__ float4 d[2048];
__global__ void myKernel ()
{
�   const unsigned int tidx(blockDim.x*blockIdx.x+threadIdx.x);
�   d[tidx] = make_float4(1.0f,2.0f,3.0f,4.0f);
}
or, to be coalesced needs to be
__device__ float d[2048];
in 32bit size and not 128.
thx
[snapback]397678[/snapback]
Yep this should work with float4s. Float3 cannot be coalesced (well, not out of the box anyway) but float4s are no problem since each thread will be fetching/writing 128bits words.
Yep this should work with float4s. Float3 cannot be coalesced (well, not out of the box anyway) but float4s are no problem since each thread will be fetching/writing 128bits words.
[snapback]398034[/snapback]
I figured it… any aligned type should be coalesced, thx!
And… is the local memory faster than the global one or is it excatly the same?
ps: the CUDa profiled does not work with my app.
I figured it… any aligned type should be coalesced, thx!
And… is the local memory faster than the global one or is it excatly the same?
ps: the CUDa profiled does not work with my app.
[snapback]398036[/snapback]
Section 5.1.2.2 of the programming guide. (beta2v2)