# Double loop, sum on j for each i. atomicAdd()?

Simply:

I have been trying this for days, and I am thankful for those who helped. However I have not yet figure this out. I am trying to implement a double loop with outside loop over j and inside loop over i. I want for each i, all value of j will be summed then put into this i. However, I could not get this kernel right. I think the most reasonable program structure for what I want to do is:

global

void afc_example

(

``````    double *d1,*d2;

double *afr,

double *afl,

const double *x,

const double *g,

const int Ns,

const int Nt
``````

)

{ /* j: outside loop; i: inside loop */

``````    int j = blockIdx.x*blockDim.x + threadIdx.x;

int i;

double art1,art2;

double phs;

double zr, zi;
``````

while(j < Ns)

``````    {   /* initialize */
``````

for(i = Nt-1;i >= 0;i–)

``````            {

phs = x[j]*g[i];

zr = cos(phs/2.0);

zi = sin(phs/2.0);

art1 = afr[j]*zr + afl[j]*zi;

art2 = afr[j]*zi - afl[j]*zr;

d1[i] += art1;

d2[i] += art2;
``````

} /* i */

j += blockDim.x*gridDim.x;

``````    } /* j */
``````

}

``````__device__ inline void atomicAdd(double *address, double value) {

unsigned long long oldval, newval, readback;

newval = __double_as_longlong(__longlong_as_double(oldval) + value);

while ((readback = atomicCAS((unsigned long long *) address, oldval, newval)) != oldval) {

newval = __double_as_longlong(__longlong_as_double(oldval) + value);

}

}
``````

It is basically a double loop. I choose doing parallel computing on j, because some variables are updated with each iteration of i. So each ith iteration is related to its previous, (i-1)th, iteration. j is good on parallel computing because j computing is only related to jth location. In j dimension, processing is independent. It is like there is a 2D matrix that I want to fill in. The 2D matrix has size of Ns X Nt. I can fill(process) in each column at same time, which is equivalent to process all j simultaneously for a fixed i. Then I move to next i, do the same thing, processing all j again at the same time. Then go on, after all i are processed, meaning, hitting Nt.

After getting this Ns X Nt matrix, I am adding along j dimension, which is along column direction, from top to bottom. By doing so, finally I will have a 1 X Nt vector, which is d[i]

I think I still need to use atomicAdd(). Just have not figured it out yet.

I highly appreciate any thoughts or pointers. Thanks a lot.

Hello,

Firs let me see if I understood your problem correctly. From this code:

``````_global__ void afc_example( double *d1,*d2;double *afr, double *afl, const double *x, const double *g, const int Ns, const int Nt )

// .....

while(j < Ns)

{ /* initialize */

for(i = Nt-1;i >= 0;i--)

{

phs = x[j]*g[i];

zr = cos(phs/2.0);

zi = sin(phs/2.0);

art1 = afr[j]*zr + afl[j]*zi;

art2 = afr[j]*zi - afl[j]*zr;

d1[i] += art1;

d2[i] += art2;

} /* i */

j += blockDim.x*gridDim.x;

} /* j */

}
``````

From your code I understand that you have a problem of NtxNs with two arrays of g[Nt] and g[Ns] which do not depend on d1 and d2 and you are trying to get d1[i]=sum_{over j}(afr[j]*cos(x[j]*g[i]/2.0)+afi[j]*sin(x[j]*g[i]/2.0) and d1[i]=sum_{over j}(afr[j]*cos(x[j]*g[i]/2.0)-afi[j]*sin(x[j]*g[i]/2.0)

First) My guess is that you want to do a Fourier transform of an array af=complex(afr,afi). The cufft library is very good at that. If I got it wrong, please put the math formula you are trying to calculate.

If this does not satisfy you then, next. This has a double loop anf maybe you can rewrite the problem a matrix vector multiplication in which d=M*af, where d=complex(d1,d2), af=complex(afr,afi) and M[i][j]=complex(cos(x[j]*g[i]/2.0),sin(x[j]*g[i]/2.0)).

Now if none of this apply then here is the advice. In practice you need to perform NtxNs operations. You can submit NtxNs threads which would compute just this:

``````phs = x[j]*g[i];

zr = cos(phs/2.0);

zi = sin(phs/2.0);

art1 = afr[j]*zr + afl[j]*zi;

art2 = afr[j]*zi - afl[j]*zr;
``````

Now the problem would be to do the summation over j. So you define 2 shared array:

shared double tmpd1[blocksize];

shared double tmpd2[blocksize];

You use those to store the intermediate results;

so that you have now:

no you syncronize the threads in a block with :

Now that you have the values you need to sum them up on each block, I use a reduction algorithm

``````__syncthreads();

for(int offset=bsl;offset>=1;offset=offset/2) //here I assume blockDim.x=2*bsl

{

if(tid<offset)

{

tempene[tid]=tempene[tid]+tempene[tid+offset]

;

}

}

if(tid==0)

{

// the resut is stored in tempene and you can use atomic add to complete the sum to the

}

}

}

}
``````

The saving from last you can do it with atomic line. Now you submit NtxNs threads grouped in blocks with bsize threads per block.

hi pasoleatis:

Thank you so much for carefully reviewing my questions.

I am not doing FFT to some 2D signal, although there is phase accrue involved in each step. Forgive me I would stick with Ns X Nt, not Nt X Ns, the basic idea is (in CPU coding), for each j, the i is going to change from Nt-1 to 0; I changed the code a little for the sake of simplicity. However, the some phase terms are changing after d1 and d2 are changed, in the inner loop.

I think the reduction algorithm is absolutely worth trying. I was thinking the atomicAdd() was simpler, so I tried that one first.

I think I will definitely have new questions, and I will post it here, and hopefully you can give some pointers too.

Thank you so much.

BTW: when you declare

``````__shared__ double tmpd1[blocksize];

__shared__ double tmpd2[blocksize];
``````

What is the blocksize? Or do I just put “blocksize” there?

blocksize is the numebr of threads per block

You calculation look to me a lot as a 1D transform FFT, or if you have enough memory you could put he problem as a matrix -vector multiplication.

But if those do not work, my idea is the following. You have NsxNt (or NtxNs) operations and then summation so that you are left with Ns numbers. In m algorithm you ubmit NsxNt threads which perform the operations. After that each block will make a partial summation, so in the end we are left with NsxNt/blocksize numebr which have to be reduced by summation to Ns numbers. so you submit NsxNt threads which get all the numbers, then the partial summation is made on each block, then you use atomic add to add the partial value to the d1[i] and d2[i].

This is not he smartest possible algorithm but it should work. In principle instead of atomic add something smarter can be done to reduce from NsxNt/blocksize to Ns numbers.

Ns and Nt are input arguments to my kernel, and they are 78400 and 28374, respectively.

For GPU programming, initially, my Ns X Nt matrix is empty, or zeroed. Then for each iteration of i (the inner loop variable), one column (j from 0 to Ns-1) inside the matrix is filled. After all i iterations (i from Nt-1 to 0) are done, the entire matrix are filled. However, int the way from right to left.

So the parameter blocksize can be defined arbitrarily by me? I dont think I can declare a matrix something like :

``````__shared__ double tmpd1[Ns*Nt];

__shared__ double tmpd2[Ns*Nt];
``````

I dont this is going to work. So what value is exactly for blocksize?

I kept getting error at

``````if(tid<offset)
``````

I am really confused why if(tid<offset) would give me so much trouble…

I assume

``````tid = threadIdx.x;
``````

and

``````offset = blockDim.x/2;
``````

Hello,

You have to do the initialization in a different function.

blocksize=blockDim.x;

The reduction is just adding the numbers in pairs.

OK. So if I have a column vector with length of 38271, what value my blocksize should be?

This depends on how much memory you have and how fast you code runs with different blocksizes. You can use if you have the cc 2.0 up to 1024. You can make a call for each i and have a very small memory usage. or if you have enough memory you can make a call for all i. this will results in a mtrix of Nsx32. If Ns is below 4096 it should fir in 3 or 6 GB.

I am doing an example and right now I have successfully tested summing a vector that has 16 elements. Code is like this:

``````#include<iostream>

#include<stdio.h>

#define BLOCKSIZE 16

unsigned long long oldval, newval, readback;

newval = __double_as_longlong(__longlong_as_double(oldval) + value);

while ((readback = atomicCAS((unsigned long long *) address, oldval, newval)) != oldval) {

newval = __double_as_longlong(__longlong_as_double(oldval) + value);

}

}

__global__ void agpuhelloworld(double* g_output, double* g_input)

{

__shared__ double s_data[BLOCKSIZE];

unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;

// compute sum for the threadblock

for ( int dist = BLOCKSIZE/2; dist > 0; dist /= 2 )

{

if ( threadIdx.x < dist )

}

printf("delete successful");

if ( threadIdx.x == 0 )

{

}

}
``````

What I really want is to extend this program to use more than one block, with each block having more threads, regardless how much memory I have. I have no idea how to code it. I define the block size to be 16, and right now I think I only have 1 block.

If I sum over a long vector, it will use more blocks? and the length of the vector does not have to be the power of 2. That is something I don’t know how to solve.

The size of a warp is 32. With 16 threads per block you can not even fill one warp. I recommend something between 128 and 1024.

Your code is good except of atomicAdd(&g_output[idx], s_data ); you should have anther index idxout=(blockIdx.x * blockDim.x + threadIdx.x)/BLOCKSIZE;

This way all the blocks with index blockIdx.x between 0 and BLOCKSIZE-1 will compute the partial sums for i=0, then all blocks with blockIdx.x betwen BLOCKSIZE and 2*BLOCKSIZE-1 will compute the value i=1 and so on.

In main code your submission should be
agpuhelloworld<<<Ns*Nt/BLOCKSIZE>>>(g_output, g_input);

I assume g_output of size Ns and g_input of size Nt.

This code will work but it will be slow for large Nt;

So, if I have a vector of length 78654, the program will be like?

``````#include<iostream>

#include<stdio.h>

#define BLOCKSIZE 78654

unsigned long long oldval, newval, readback;

newval = __double_as_longlong(__longlong_as_double(oldval) + value);

while ((readback = atomicCAS((unsigned long long *) address, oldval, newval)) != oldval) {

newval = __double_as_longlong(__longlong_as_double(oldval) + value);

}

}

__global__ void agpuhelloworld(double* g_output, double* g_input)

{

__shared__ double s_data[BLOCKSIZE];

unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;

// compute sum for the threadblock

for ( int dist = BLOCKSIZE/2; dist > 0; dist /= 2 )

{

if ( threadIdx.x < dist )

}

printf("delete successful");

if ( threadIdx.x == 0 )

{

}

}
``````

Well. Try to compile the code. See if it works. You defined BLOCKSIZE, but you did not put the line which you use to call the function, the one one in the main function. From this line if

I assume that you expect for a block to run with 78654 threads, that will not happen with any of the current nvidia gpu.

Thanks for reply alot. I think 70000+ threads in a block is too much, and I changed my code to the following, using warp you mentioned:

``````#define blockSize 256

#include <stdio.h>

unsigned long long oldval, newval, readback;

newval = __double_as_longlong(__longlong_as_double(oldval) + value);

while ((readback = atomicCAS((unsigned long long *) address, oldval, newval)) != oldval) {

newval = __double_as_longlong(__longlong_as_double(oldval) + value);

}

}

//template <unsigned int blockSize>

__global__ void myreduce(double *g_odata, double *g_idata, unsigned int n)

{

__shared__ double sdata[blockSize];

unsigned int i = blockIdx.x*(blockSize*2) + tid;

unsigned int gridSize = blockSize*2*gridDim.x;

sdata[tid] = 0;

while (i < n)

{

sdata[tid] += g_idata[i] + g_idata[i+blockSize];

i += gridSize;

}

if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }

if (blockSize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); }

if (blockSize >= 128) { if (tid < 64)  { sdata[tid] += sdata[tid + 64]; }  __syncthreads(); }

if (tid < 32)

{

if (blockSize >= 64) sdata[tid] += sdata[tid + 32];

if (blockSize >= 32) sdata[tid] += sdata[tid + 16];

if (blockSize >= 16) sdata[tid] += sdata[tid + 8];

if (blockSize >= 8)  sdata[tid] += sdata[tid + 4];

if (blockSize >= 4)  sdata[tid] += sdata[tid + 2];

if (blockSize >= 2)  sdata[tid] += sdata[tid + 1];

}

//if (tid == 0) g_odata[blockIdx.x] = sdata;

if (tid == 0) atomicAdd(&g_odata, sdata);

}
``````

The function that calls this kernel is a MATLAB script:

``````function out= amyreduce_cuda(b)

persistent a;

N = length(b);

disp 'Initializing GPU drf calculation...';

tmp = gpuDevice;

gpuadd.GridSize = [min(1024,tmp.MaxGridSize(1)) 1]; % min(1024, maxblocks)

a = gpuArray(zeros(5,1));

end;

out = gather(a);
``````

The input, b, to MATLAB function, which is equivalent to the input to kernel, g_idata, is created by MATLAB code:

``````b = ones(78765,1)';
``````

which is basically a column vector with 78765 entries. And all the entries are 1.

So I figure if the whole things goes right , then what comes out should be the sum of the vector, 78765 of 1s, 1+1+1…+1 = 78765; But I got wierd result : 3.1913*10^31

although the .cu file can be compiled and the kernel can be created without any problem. Where I messed it up? Any pointers?

Thanks a lot.

I am sorry, I can not follow. Just start from scratch and check every time that ever works.

I apologize for keeping evolving my code.

But one thing I’d appreciate if you could explain a little bit more: If I have a large size vector, how do I process it? Do I split the vector into different blocks? Say my vector has 76543 entries, if my block size is 256, does that mean I only process 256 threads in my code?

Very confused.

Let take the example of vector addition

c[i]=a[i]+b[i]; with i<Nv.

Now we assume that each thread does one addition. This means you need to ask for the function to be executed with Nv threads. When the code is executed the job is sent to the gpu in blocks. Each block will execute some blocksize. blocksize must be smaller than 1024 or even less. The blocks are independent of each other. Now we have defined that each block will execute blocksize threads, we still have to provide the numebr of blocks in a such a way that all valeus i from 0 to Nv-1 are done.

So in order to do this simple vector addition we have the following kernel:

``````__global__ void vecadd(int *a,int *b,int *c,const int Nv)

{

if(i<Nv) //make sure it does not try to access outofbounds

c[i]=a[i]+b[i];

}

}
``````

Now we have the kernel, but we also have to specifiy the way it is going to be executed

in the main code in cu I would do something like this:

This line instructs the gpu to launch (Nv/blocksize+1) blocks with blocksize threads in each block. Totally you will get Nv threads doing Nv additions . I hope this answers your question.

when I do something new I start with simple stuff and go more complex only after I know it works for sure.

I suggest you to start first with this vector addition and see how it works, then add more complexity to your code. This way you learn while you program. You are missing many things. Read the cuda programming guide.

Do not do complicated things if you can not do the simple ones.

The problem you put in the first post of this topic, I have a general idea how to do it in CUDA C (assuming I got it right).

Late edit: I just realized the code in the previous post has some errors. If you decide to implement something similar I can give more details, but you have to have some basics in order to understand the code I put.

Yes, that helps.

But I am using MATLAB, so I guess I have to dig into MATLAB to see how block size is defined.

Thanks a lot!

``````gpuadd.GridSize = [min(1024,tmp.MaxGridSize(1)) 1]; % min(1024, maxblocks)