Coalesced double memory read.

Hi

I am new to CUDA and have a question about coalesced memory reads.

I’m working on a sum reduction kernel as described in.
http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf

On page 18 the code gets optimized so each thread requests 2 elements from global memory. As i understand each requests is coalesced so each half warp makes 2 requests of 64b.

Would it be possible to get it all coalesced so each half warp only requests 1 128b instead?

As i understand this would increase performance since only 1 memory read would be necessary.

Thanks
Peter

Hi

I am new to CUDA and have a question about coalesced memory reads.

I’m working on a sum reduction kernel as described in.
http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf

On page 18 the code gets optimized so each thread requests 2 elements from global memory. As i understand each requests is coalesced so each half warp makes 2 requests of 64b.

Would it be possible to get it all coalesced so each half warp only requests 1 128b instead?

As i understand this would increase performance since only 1 memory read would be necessary.

Thanks
Peter

Is your device a Fermi one?

If yes then the L1 cacheline would always be 128-byte.

If no, then for a line like this:

double d = doubleSource[threadIdx.x];

, there would be 2 128-byte accesses.

Is your device a Fermi one?

If yes then the L1 cacheline would always be 128-byte.

If no, then for a line like this:

double d = doubleSource[threadIdx.x];

, there would be 2 128-byte accesses.

No I’m working on the Quadro FX 5800 so it’s tesla.

The 2 requests are type float.

unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;
shared_data[tid] = global_data[i] + global_data[i+blockDim.x];

would it be possible to change the pattern so that it only reads one time from global?

No I’m working on the Quadro FX 5800 so it’s tesla.

The 2 requests are type float.

unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;
shared_data[tid] = global_data[i] + global_data[i+blockDim.x];

would it be possible to change the pattern so that it only reads one time from global?