Hi
I am new to CUDA and have a question about coalesced memory reads.
I’m working on a sum reduction kernel as described in.
http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf
On page 18 the code gets optimized so each thread requests 2 elements from global memory. As i understand each requests is coalesced so each half warp makes 2 requests of 64b.
Would it be possible to get it all coalesced so each half warp only requests 1 128b instead?
As i understand this would increase performance since only 1 memory read would be necessary.
Thanks
Peter
Hi
I am new to CUDA and have a question about coalesced memory reads.
I’m working on a sum reduction kernel as described in.
http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf
On page 18 the code gets optimized so each thread requests 2 elements from global memory. As i understand each requests is coalesced so each half warp makes 2 requests of 64b.
Would it be possible to get it all coalesced so each half warp only requests 1 128b instead?
As i understand this would increase performance since only 1 memory read would be necessary.
Thanks
Peter
Is your device a Fermi one?
If yes then the L1 cacheline would always be 128-byte.
If no, then for a line like this:
double d = doubleSource[threadIdx.x];
, there would be 2 128-byte accesses.
Is your device a Fermi one?
If yes then the L1 cacheline would always be 128-byte.
If no, then for a line like this:
double d = doubleSource[threadIdx.x];
, there would be 2 128-byte accesses.
No I’m working on the Quadro FX 5800 so it’s tesla.
The 2 requests are type float.
unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;
shared_data[tid] = global_data[i] + global_data[i+blockDim.x];
would it be possible to change the pattern so that it only reads one time from global?
No I’m working on the Quadro FX 5800 so it’s tesla.
The 2 requests are type float.
unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;
shared_data[tid] = global_data[i] + global_data[i+blockDim.x];
would it be possible to change the pattern so that it only reads one time from global?