Hi,
I am trying to write a prog for a C1060 (Fedora 10) and I am having problems with (what else?) __threadfence.
The problem is: I have to iterate a huge number of globally coupled maps, that is maps that need a value that depends on all other maps.
Something like:
Mn(t+1)= some function + A(t)
where Mn is the nth map and A(t) is the sum of all Mn(t).
Problem is that a single thread can iterate about 100 Maps (memory problems…) so I need several blocks of nt threads to accomodate all the maps.
I tried with this kernel, sort of, lots of stuff is missing but the idea is there.
…
“res” is for the results and “avg” is where I store the averages of each thread, both are defined in the main().
device double ta;
global void calc(double *res,double *avg)
{
double x[100],tavg;
int nt = blockDim.x; // How many Threads in Block
int nb = gridDim.x; // How many Blocks in Grid
int i = threadIdx.x; // My thread
int b = blockIdx.x; // My block
int m; // My index
int n,t;
int tl;
// my index
m=b*nt+i;
// initialize x for the thread and calculate local average
tavg=0.;
for(n=0;n<100;n++)
{ x[n]=(something randomized);
tavg+=x[n];
}
tavg=tavg/100;
///
// store in avg[m], sync, and build ta, the total average
avg[m]=tavg;
__threadfence(); // wait until all threads sync and avg is filled
if (m==0) //one of the threads does the total average
{ tl=0.;
for(n=0;n<nbnt;n++)
tl+=avg[n];
ta=tl/(nbnt);
res[0]=ta;
}
__threadfence(); // wait until all threads sync and ta is filled
start the loop where sort of the same is done
iterate the maps using “ta” as the total average.
each thread “m” calculate the local average and stores it in avg[m].
__threadfence() , total average, __threadfence()
save it in res[t] and loop.
Right now I am not interested in speed just in getting a result (I am not…). If __threadfence() is supposed to make sure that all the writes to global memory i.e. avg are done why doesn’t this work? Or maybe it does work and the problem is elsewhere and I am too stupid to see it?
I cannot see what is happening with the debugger as it fails on this prog (a different problem).
Is there a different approach for this?
Thanks for any help.
GiP