CUDA Reduction Using Register

andyzhangcr7 · July 22, 2011, 2:39am

This is my first time post questions here. So I need to calculate N signals’ mean values using reduction. The input is 1D array of size MN, M is the length of each signal. Originally I had additional shared memory to first copy the data and do the reduction on each signal. However, the original corrupt is corrupted. So I was wondering how I can use registers to do reduction sum on N signals. (I know how to do with sequential addition using multi-thread programming and 1 register). The reason is that I want to reduce the shared memory I declared in the most to minimum for latter use.

Anyone would give me some hints? Thanks for your help!

andyzhangcr7 · July 22, 2011, 2:50am

Uh, I should explain this more clearly. I have N threads, a shared memory (float) s_m[N*M], 0…M-1 is the first signal, etc. Do I need N registers (or one) to store do mean value of N different signals? The next step I want to do is subtract every value in the input from its correspondent signal’s mean. That’s why I don’t want additional shared memory. Hope that explains a little bit more. Thanks

Skybuck · July 22, 2011, 6:51am

I think I understand what it is you want to do, some pseudo code or real code of what you trying to do could still make it more crystal clear, maybe then others or me can help you better External Image

tera · July 22, 2011, 10:53am

You need additional shared memory because (apart from slow global memory) that is the only way different threads can exchange data.

Did you use __syncthreads() in the proper places when you first tried the reduction in shared memory?

andyzhangcr7 · July 23, 2011, 2:29am

So suppose I’m doing reduction with shared memory.

I did like

tid = threadIdx.x

for (i = blockDim.x / 2; i > 0; i >>= 1)

{ if (tid < i)

 s_x[tid] += s_x[tid + i];

__syncthreads();

}

if(tid == 0)

mean = s_x[tid]/M;

for (i = 0; i < M; i++)

s_x[tid + i * N] = dev_x[tid * M + i];

(I copied it in this way to avoid bank conflicts);

for (i = 0; i < M; i++)

mean += s_x[tid + i * N];

mean /= M;

So the register “mean” will store the value, but I’m wondering if there is another to do summation not sequentially (in a reduction style).

andyzhangcr7 · July 23, 2011, 2:31am

Yes, I think so.