I am doing 2 loops with i being out side loop and j on inside loop; The inside loop using multithreads in CUDA. Basically I wanted sum over j, then put the sum at one location of i. Here is the code (I deleted some code for the sake of simplicity):

__global__
void afc_example
(
double *d,
double *afr,
double *afl,
const double *x,
const double *g,
const int Ns,
const int Nt,
)
{ /* j: inside loop; i: outside loop */
int j = blockIdx.x*blockDim.x + threadIdx.x;
int i;
double art,ait;
double phs;
double zr,zi;
for(i = Nt-1;i >= 0;i--)
{ /* initialize */
while(j < Ns)
{
phs = x[j]*g[i];
zr = cos(phs/2.0);
zi = sin(phs/2.0);
art = afr[j]*zr + afl[j]*zi;
d[i] += art; // multiple threads wrting value to d[i] to sum up
j += blockDim.x*gridDim.x;
} /* j */
} /* i */
}

I tested the code, the result is not right. So I suspect it was the following line that causes the problem:

d[i] += art;

It seems, at this step, many art from all threads are being added onto d[i] at same time. Is that OK. Will they be added up into d[i] as I expected? The d[i] is an input to this global function from a outside variable created by MATLAB. It is basically a Nt X 1 0 vector created by MATLAB line:

gpuArray(zeros(Nt, 1))

If I cannot do that, then how do I sum over j then put the sum into d[i]?

Indeed the line you pointed out triggers race conditions. What you would need is using (cleverly) an atomic operation but since your variables are double, the atomicAdd is not supported by the hardware (yet?). You can define yourself an atomicAdd function based on atomicCAS this way:

Disclaimer: I found this function about 2 years ago in one of those forums, and I used it since (only when necessary). It works but is about 20x slower than what could be expected from a proper hardware support of atomicAdd for double. However, in our case, that shouldn’t be too big a deal.

__device__ inline void atomicAdd(double *address, double value) {
unsigned long long oldval, newval, readback;
oldval = __double_as_longlong(*address);
newval = __double_as_longlong(__longlong_as_double(oldval) + value);
while ((readback = atomicCAS((unsigned long long *) address, oldval, newval)) != oldval) {
oldval = readback;
newval = __double_as_longlong(__longlong_as_double(oldval) + value);
}
}

Then, with this function defined, here is how your code should be modified:

__global__
void afc_example
(
double *d,
double *afr,
double *afl,
const double *x,
const double *g,
const int Ns,
const int Nt
)
{ /* j: inside loop; i: outside loop */
__shared__ double s_d; //for per block accumulation
int j = blockIdx.x*blockDim.x + threadIdx.x;
int i;
double art;
double phs;
double zr, zi, l_d; //for local accumulation (over a fixed i)
if (threadIdx.x == 0) s_d = 0;
__syncthreads();
for(i = Nt-1;i >= 0;i--)
{ /* initialize */
l_d = 0;
while(j < Ns)
{
phs = x[j]*g[i];
zr = cos(phs/2.0);
zi = sin(phs/2.0);
art = afr[j]*zr + afl[j]*zi;
l_d += art; // local accumulation of art
j += blockDim.x*gridDim.x;
} /* j */
// local reduction of l_d into the shared variable
atomicAdd(&s_d, l_d);
__syncthreads();
// global accumulation of the shared variables into d[i]
if (threadIdx.x == 0) atomicAdd(&d[i], s_d); // only one thread per block do the update
__syncthreads();
} /* i */
}

I haven’t tested the code so it might need some slight adjustments. The idea should work however.

atomicAdd(&s_d1, l_d1);
atomicAdd(&s_d2, l_d2);
__syncthreads();
// global accumulation of the shared variables into d[i]
if (threadIdx.x == 0)
{
atomicAdd(&d1[i], s_d1);
atomicAdd(&d2[i], s_d2);
} // only one thread per block do the update
__syncthreads();

Well, if I understand you, what you need is two shared variables s_d1 and s_d2, two local variables l_d1 and l_d2, and you just copy/past the lines with s_d and l_d in the code I gave you, for having one line for “_d1” and one line for “_d2” each time… No need to add more __syncthread(). Simply, take care to add some “{ }” for delimiting the blocks of the “if (threadIdx.x == 0)” since you now have two statements to encompass instead of one in my example.
And I hope that will do.

EDIT (since you completed your post in-between): yes, you’ve got it right.

hmmm…I have not got the right result…
but it could be something wrong with my MATLAB code. I need to take some time double check it. I will keep you updated.
I really appreciate your help.

Sorry, I just realised I forgot to reset the s_d variables to 0 right after the accumulation into the global variables.

This could translate into:

if (threadIdx.x == 0) {
atomicAdd(&d1[i], s_d1);
atomicAdd(&d2[i], s_d2);
s_d1 = sd2 = 0;
} // only one thread per block do the update
__syncthreads();

You can also put it before the I loop begins, but this way it spares you a __syncthread per j loop.

Hi again,
I realized suddenly that what you are doing is indeed generic reductions, not histograms like the “[i]” index let me think at the beginning.
So the solution I gave you, albeit valid, is far from being the most effective one. You can just apply a “standard” reduction algorithm to your variables such as the ones that are described on the reduction directory of the SDK samples. That should be much more effective since the reduction is done in parallel whereas the use of atomics serialise the process. so a good solution is to make some partial reduction at the block level using shared memory, and then to do the inter-block reduction using atomics (since you cannot enforce synchronisation).
I don’t have time right now to give you the code, but if you want I’ll do it on Monday (unless somebody else wants to contribute).

Error occurred when my MATLAB code calling the kernel:

Error using parallel.gpu.CUDAKernel
An error occurred during PTX compilation of <image>.
The information log was:
The error log was:
ptxas application ptx input, line 1563; error : Global state space expected for
instruction 'atom'
ptxas application ptx input, line 1575; error : Global state space expected for
instruction 'atom'
ptxas application ptx input, line 1589; error : Global state space expected for
instruction 'atom'
ptxas application ptx input, line 1601; error : Global state space expected for
instruction 'atom'
ptxas fatal : Ptx assembly aborted due to error
The CUDA error code was: CUDA_ERROR_NO_BINARY_FOR_GPU.
Error in derivab_ij_optcont_cuda (line 16)
gpuderivab =
parallel.gpu.CUDAKernel('afc_example.ptx','afc_example.cu','afc_example)

Shouldn’t we use atomicAdd() here, since results from all j are coming into l_d? It seems to me the race condition is happening, and isn’t it the case of using atomicAdd()? Something like:

The trouble comes from the compute capability you are using (I guess) since the function atomicAdd I made you define uses atomicCAS for long long, which is only available for compute capability 2.0 onwards. If you’ve got a card with such a compute capability (whatever Fermi based card is), replace the “-arch=compute_13” by “-arch=compute_20” and that should do.

No. Here “l_d” is “local”, ie private to each thread and very likely to be stored in register. So there is no risk of race condition since there is no sharing of data at this level.

Now, I revisited your code entirely, and provided your Nt is large enough, I guess that parallelising the outer loop will be much simpler than parallelising inner one. This way, you don’t have any tricky reduction over double to deal with. And then, if you want / need to speed-up the code starting from that, you can consider tiling it at the j index level, to cache afr, afl and x either automatically or manually.

What would the following code give you?

__global__
void afc_example(double *d, double *afr, double *afl, const double *x,
const double *g, const int Ns, const int Nt) {
int tid = blockIdx.x*blockDim.x + threadIdx.x;
double art;
double phs;
double zr, zi;
for(int i=tid; i<Nt; i+=blockDim.x*gridDim.x) {
art = 0;
for (int j=0; j<Ns; j++) {
phs = 0.5*x[j]*g[i];
zr = cos(phs);
zi = sin(phs);
art += afr[j]*zr + afl[j]*zi;
} /* j */
d[i] += art;
} /* i */
}

I am not able to got through all posts, but this is a problem which can be done by using shared memory. First stored the art variable in the shared memory, then make a reduction to do the sum locally and at the end use atomic add to add the result of each block to the total sum. The atomicadd fucntion does not support double, but you can use the example from the programming guide if you have cc 2.0.

Here is the function from the guide:

__device__ double atomicdoubleAdd(double* address, double val)
{
unsigned long long int* address_as_ull =(unsigned long long int*)address;
unsigned long long int old = *address_as_ull, assumed;
do
{
assumed = old;
old = atomicCAS(address_as_ull, assumed,__double_as_longlong(val +__longlong_as_double(assumed)));
} while (assumed != old);
return __longlong_as_double(old);
}

The reduction algorithm on a block works for block size which is power of 2:

// here get the intermidiate stuff an array temp[2*bsl] (the size of the block for my case is blockDim.x=2*bsl)
__syncthreads();
for(int offset=bsl;offset>=1;offset=offset/2)
{
if(tid<offset)
{
tempene[tid]=tempene[tid]+tempene[tid+offset];
}
__syncthreads();
}
if(tid==0)
{
// the resut is stored in tempene[0] and you can use atomic add to complete the sum
}
}
}
}

I used arch=compute_20, seems the PTX error was gone. I got new error, but I used atomicdoubleAdd() which you can see right after your last post. But I still got error CUDA_ERROR_LAUNCH_TIMEOUT. I think there is something wrong in my code still. I am posting it here:

__device__ double atomicdoubleAdd(double* address, double val)
{
unsigned long long int* address_as_ull =(unsigned long long int*)address;
unsigned long long int old = *address_as_ull, assumed;
do
{
assumed = old;
old = atomicCAS(address_as_ull, assumed,__double_as_longlong(val +__longlong_as_double(assumed)));
} while (assumed != old);
return __longlong_as_double(old);
}
__global__
void afc_example
(
double *d1,*d2;
double *afr,
double *afl,
const double *x,
const double *g,
const int Ns,
const int Nt
)
{ /* j: inside loop; i: outside loop */
__shared__ double s_d1; //for per block accumulation
__shared__ double s_d2;
int j = blockIdx.x*blockDim.x + threadIdx.x;
int i;
double art1,art2;
double phs;
double zr, zi, l_d1, l_d2; //for local accumulation (over a fixed i)
if (threadIdx.x == 0) {s_d1 = 0; s_d2 = 0;}
__syncthreads();
for(i = Nt-1;i >= 0;i--)
{ /* initialize */
l_d = 0;
while(j < Ns)
{
phs = x[j]*g[i];
zr = cos(phs/2.0);
zi = sin(phs/2.0);
art1 = afr[j]*zr + afl[j]*zi;
art2 = afr[j]*zi - afl[j]*zr;
l_d1 += art1; // local accumulation of art
l_d2 += art2;
j += blockDim.x*gridDim.x;
} /* j */
// local reduction of l_d into the shared variable
atomicdoubleAdd(&s_d1 l_d1);
atomicdoubleAdd(&s_d2 l_d2);
__syncthreads();
// global accumulation of the shared variables into d[i]
if (threadIdx.x == 0)
{
atomicdoubleAdd(&d1[i], s_d1); // only one thread per block do the update
atomicdoubleAdd(&d2[i], s_d2);
s_d1 = 0; s_d2 = 0;
}
__syncthreads();
} /* i */
}

I am new to CUDA,but the code makes sense to me. It uses local l_d to save the results from each thread then put them into a shared memory s_d.

But somehow somewhere got mistaken. My outside loop from Nt-1 to 0, it is not increasing, but decreasing. Does that matter?

I have not tried your new code yet. Perhaps we should get one thing to work then move to the other? Thanks a lot. I truly appreciate your help.

I don’t think doing parallel computating on i is a good idea because some variables are updated with each iteration of i. So each ith iteration is related to its previous, (i-1)th, iteration. j is good on parallel computing because j computing is only related to jth location. In j dimension, processing is independent. It is like a 2D matrix that I want to fill in. The 2D matrix has size of Ns X Nt. I can fill(process) in each column at same time, which is equivalent to process all j simultaneously for a fixed i. Then I move to next i, do the same thing, processing all j again at the same time. Then go on, after all i are processed, meaning, hitting Nt.

After getting the Ns X Nt matrix, I am going to add along j direction, which is along the column direction from top to bottom. By doing so, finally I will have a 1 X Nt vector, d[i].

I think I still need to use atomicAdd(). Just have not figured it out yet.