Shared Memory/Large Strides

I’m trying to program a kernel to perform a decimation on a 32k set of Complex data. The input size is fixed at 32k elements. The output size depends on the dec_times/dec_size which for testing purposes is 32/1024 respectively, but dec_size could range from 512-4096 which would change dec_times = 32k/dec_size.

decimate<<<32,32>>>(…) is the optimal parameters I have found for my base case.

// if idx=0

OutData[0] = InData[0]…InData[1024]…Indata[2048]… //example of what a decimation does

I’ve been looking into using shared memory to try and speed up my kernel, but the problem I’m finding is that reading global memory into shared memory and than execution the reduction on the sdata is to small to hide the latency and I take a hit in performance.

I just want to ask some people with more experience for any tips to optimize my kernel. I’ve looked at reduction.cu in the SDK and matrix transpose.pdf for ideas on how to use shared memory…and related cuda .pdf guides.

  1. Since my stride is 1024 and sample is only 32k is there really a benefit using shared memory in my case?

  2. I’m using a GTX 480(2.0 compute)… so my GPU always(Q??) reads from global memory in 128bits(16bytes) transactions? So each time if a thread reads one single Complex elements, I’m wasting half my transaction memory each read?

This is my current kernel I have not using shared.

Total time initial for 10 iterations : 0.390752 ms | Avg. 0.0390752 ms ms= milliseconds //DOESN’T include memory copies

//typedef float2 Complex;

__global__ void decimate(Complex* InData, Complex* OutData, int dec_times, int dec_size)

{

  int idx = blockIdx.x*blockDim.x + threadIdx.x;

  Complex val;

  val.x = 0.0;

  val.y = 0.0;

for(int i =0; i < dec_times; ++i){

     val.x += InData[i*dec_size + idx].x;

     val.y += InData[i*dec_size + idx].y;

  }

OutData[idx].x = val.x;

  OutData[idx].y = val.y;

}

Any help/tips is appreciated.

I’m trying to program a kernel to perform a decimation on a 32k set of Complex data. The input size is fixed at 32k elements. The output size depends on the dec_times/dec_size which for testing purposes is 32/1024 respectively, but dec_size could range from 512-4096 which would change dec_times = 32k/dec_size.

decimate<<<32,32>>>(…) is the optimal parameters I have found for my base case.

// if idx=0

OutData[0] = InData[0]…InData[1024]…Indata[2048]… //example of what a decimation does

I’ve been looking into using shared memory to try and speed up my kernel, but the problem I’m finding is that reading global memory into shared memory and than execution the reduction on the sdata is to small to hide the latency and I take a hit in performance.

I just want to ask some people with more experience for any tips to optimize my kernel. I’ve looked at reduction.cu in the SDK and matrix transpose.pdf for ideas on how to use shared memory…and related cuda .pdf guides.

  1. Since my stride is 1024 and sample is only 32k is there really a benefit using shared memory in my case?

  2. I’m using a GTX 480(2.0 compute)… so my GPU always(Q??) reads from global memory in 128bits(16bytes) transactions? So each time if a thread reads one single Complex elements, I’m wasting half my transaction memory each read?

This is my current kernel I have not using shared.

Total time initial for 10 iterations : 0.390752 ms | Avg. 0.0390752 ms ms= milliseconds //DOESN’T include memory copies

//typedef float2 Complex;

__global__ void decimate(Complex* InData, Complex* OutData, int dec_times, int dec_size)

{

  int idx = blockIdx.x*blockDim.x + threadIdx.x;

  Complex val;

  val.x = 0.0;

  val.y = 0.0;

for(int i =0; i < dec_times; ++i){

     val.x += InData[i*dec_size + idx].x;

     val.y += InData[i*dec_size + idx].y;

  }

OutData[idx].x = val.x;

  OutData[idx].y = val.y;

}

Any help/tips is appreciated.

It looks like you may be able to get a speedup if you look at the reduction kernel in the SDK and the accompanying slides. I may be mistaken, but it looks like your current kernel implementation is already coalesced. Anyone else want to chime in?