NVIDIA Developer Forums

Shared Memory/Large Strides

Accelerated Computing CUDA CUDA Programming and Performance

gt3_2011 June 16, 2011, 4:51pm 1

I’m trying to program a kernel to perform a decimation on a 32k set of Complex data. The input size is fixed at 32k elements. The output size depends on the dec_times/dec_size which for testing purposes is 32/1024 respectively, but dec_size could range from 512-4096 which would change dec_times = 32k/dec_size.

decimate<<<32,32>>>(…) is the optimal parameters I have found for my base case.

// if idx=0

OutData[0] = InData[0]…InData[1024]…Indata[2048]… //example of what a decimation does

I’ve been looking into using shared memory to try and speed up my kernel, but the problem I’m finding is that reading global memory into shared memory and than execution the reduction on the sdata is to small to hide the latency and I take a hit in performance.

I just want to ask some people with more experience for any tips to optimize my kernel. I’ve looked at reduction.cu in the SDK and matrix transpose.pdf for ideas on how to use shared memory…and related cuda .pdf guides.

Since my stride is 1024 and sample is only 32k is there really a benefit using shared memory in my case?
I’m using a GTX 480(2.0 compute)… so my GPU always(Q??) reads from global memory in 128bits(16bytes) transactions? So each time if a thread reads one single Complex elements, I’m wasting half my transaction memory each read?

This is my current kernel I have not using shared.

Total time initial for 10 iterations : 0.390752 ms | Avg. 0.0390752 ms ms= milliseconds //DOESN’T include memory copies

//typedef float2 Complex;

__global__ void decimate(Complex* InData, Complex* OutData, int dec_times, int dec_size)

{

  int idx = blockIdx.x*blockDim.x + threadIdx.x;

  Complex val;

  val.x = 0.0;

  val.y = 0.0;

for(int i =0; i < dec_times; ++i){

     val.x += InData[i*dec_size + idx].x;

     val.y += InData[i*dec_size + idx].y;

  }

OutData[idx].x = val.x;

  OutData[idx].y = val.y;

}

Any help/tips is appreciated.

gt3_2011 June 16, 2011, 4:51pm 2

I’m trying to program a kernel to perform a decimation on a 32k set of Complex data. The input size is fixed at 32k elements. The output size depends on the dec_times/dec_size which for testing purposes is 32/1024 respectively, but dec_size could range from 512-4096 which would change dec_times = 32k/dec_size.

decimate<<<32,32>>>(…) is the optimal parameters I have found for my base case.

// if idx=0

OutData[0] = InData[0]…InData[1024]…Indata[2048]… //example of what a decimation does

I’ve been looking into using shared memory to try and speed up my kernel, but the problem I’m finding is that reading global memory into shared memory and than execution the reduction on the sdata is to small to hide the latency and I take a hit in performance.

I just want to ask some people with more experience for any tips to optimize my kernel. I’ve looked at reduction.cu in the SDK and matrix transpose.pdf for ideas on how to use shared memory…and related cuda .pdf guides.

Since my stride is 1024 and sample is only 32k is there really a benefit using shared memory in my case?
I’m using a GTX 480(2.0 compute)… so my GPU always(Q??) reads from global memory in 128bits(16bytes) transactions? So each time if a thread reads one single Complex elements, I’m wasting half my transaction memory each read?

This is my current kernel I have not using shared.

Total time initial for 10 iterations : 0.390752 ms | Avg. 0.0390752 ms ms= milliseconds //DOESN’T include memory copies

//typedef float2 Complex;

__global__ void decimate(Complex* InData, Complex* OutData, int dec_times, int dec_size)

{

  int idx = blockIdx.x*blockDim.x + threadIdx.x;

  Complex val;

  val.x = 0.0;

  val.y = 0.0;

for(int i =0; i < dec_times; ++i){

     val.x += InData[i*dec_size + idx].x;

     val.y += InData[i*dec_size + idx].y;

  }

OutData[idx].x = val.x;

  OutData[idx].y = val.y;

}

Any help/tips is appreciated.

tgramicc June 23, 2011, 2:47pm 3

It looks like you may be able to get a speedup if you look at the reduction kernel in the SDK and the accompanying slides. I may be mistaken, but it looks like your current kernel implementation is already coalesced. Anyone else want to chime in?

Topic		Replies	Views	Activity
Extremely slow smem reads Possible bank conflicts? CUDA Programming and Performance	11	3474	November 25, 2008
Optimization suggestions for reading from main memory to registers and share memory CUDA Programming and Performance	10	167	May 22, 2024
Disappointing shared memory performance CUDA Programming and Performance	3	734	September 8, 2011
comparision: shared mem <=> global mem actually no difference CUDA Programming and Performance	6	7551	July 21, 2008
shared memory Computation become slower when using the shared memory CUDA Programming and Performance	8	1803	August 20, 2010
Performance test sharedmemory <-> globalmemory CUDA Programming and Performance	2	3931	May 30, 2008
Parallel reduction not as fast as nVidia's no idea why - can anyone figure this one out? CUDA Programming and Performance	2	2302	August 12, 2009
How to Access Global Memory Efficiently in CUDA C/C++ Kernels Technical Blog	7	635	December 5, 2019
How can I configure this problem is it too big to fit in shared memory? CUDA Programming and Performance	7	3721	October 14, 2008
Help to understand the frame of CUDA programming CUDA Programming and Performance	2	1412	November 30, 2014