 Hi,

I am trying to write a prog for a C1060 (Fedora 10) and I am having problems with (what else?) __threadfence.

The problem is: I have to iterate a huge number of globally coupled maps, that is maps that need a value that depends on all other maps.
Something like:
Mn(t+1)= some function + A(t)
where Mn is the nth map and A(t) is the sum of all Mn(t).

Problem is that a single thread can iterate about 100 Maps (memory problems…) so I need several blocks of nt threads to accomodate all the maps.

I tried with this kernel, sort of, lots of stuff is missing but the idea is there.

“res” is for the results and “avg” is where I store the averages of each thread, both are defined in the main().

device double ta;

global void calc(double *res,double *avg)
{
double x,tavg;
int nt = blockDim.x; // How many Threads in Block
int nb = gridDim.x; // How many Blocks in Grid
int b = blockIdx.x; // My block
int m; // My index
int n,t;
int tl;

// my index
m=b*nt+i;

// initialize x for the thread and calculate local average

tavg=0.;
for(n=0;n<100;n++)
{ x[n]=(something randomized);
tavg+=x[n];
}
tavg=tavg/100;

///
// store in avg[m], sync, and build ta, the total average
avg[m]=tavg;

if (m==0) //one of the threads does the total average
{ tl=0.;
for(n=0;n<nbnt;n++)
tl+=avg[n];
ta=tl/(nb
nt);
res=ta;
}

start the loop where sort of the same is done

iterate the maps using “ta” as the total average.
each thread “m” calculate the local average and stores it in avg[m].
save it in res[t] and loop.

Right now I am not interested in speed just in getting a result (I am not…). If __threadfence() is supposed to make sure that all the writes to global memory i.e. avg are done why doesn’t this work? Or maybe it does work and the problem is elsewhere and I am too stupid to see it?

I cannot see what is happening with the debugger as it fails on this prog (a different problem).

Is there a different approach for this?

Thanks for any help.

GiP

__threadfence() doesn’t sync. All it does is ensures that all memory instructions before the call have been completed before continuing - it’s not all memory instructions in all threads, it’s only memory instructions in the calling thread. I found the thread fencing reduction example in the SDK explains it well.

I don’t get it. What is the point of being sure that writes “in a thread” have been completed I thought that the __threadfence_block worked at block level and __threadfence above that and as long as all writes to the global array are done I thought I was fine.

I looked at the reduction example but I saw it uses separate kernel invocations and I cannot do that, I think. I need the values of the maps in the thread to calculate the next iteration, unless there is a way of declaring a global variable for all the function in a thread (a single thread actually calls some other functions) and this global variable mantain its values between invocations I don’t see how I can do this.

Thanks for the help!

GiP