Problems with __threadfence

GiP · November 11, 2009, 2:38pm

Hi,

I am trying to write a prog for a C1060 (Fedora 10) and I am having problems with (what else?) __threadfence.

The problem is: I have to iterate a huge number of globally coupled maps, that is maps that need a value that depends on all other maps.
Something like:
Mn(t+1)= some function + A(t)
where Mn is the nth map and A(t) is the sum of all Mn(t).

Problem is that a single thread can iterate about 100 Maps (memory problems…) so I need several blocks of nt threads to accomodate all the maps.

I tried with this kernel, sort of, lots of stuff is missing but the idea is there.
…
“res” is for the results and “avg” is where I store the averages of each thread, both are defined in the main().

device double ta;

global void calc(double *res,double *avg)
{
double x[100],tavg;
int nt = blockDim.x; // How many Threads in Block
int nb = gridDim.x; // How many Blocks in Grid
int i = threadIdx.x; // My thread
int b = blockIdx.x; // My block
int m; // My index
int n,t;
int tl;

// my index
m=b*nt+i;

// initialize x for the thread and calculate local average

tavg=0.;
for(n=0;n<100;n++)
{ x[n]=(something randomized);
tavg+=x[n];
}
tavg=tavg/100;

///
// store in avg[m], sync, and build ta, the total average
avg[m]=tavg;

__threadfence(); // wait until all threads sync and avg is filled

if (m==0) //one of the threads does the total average
{ tl=0.;
for(n=0;n<nbnt;n++)
tl+=avg[n];
ta=tl/(nbnt);
res[0]=ta;
}
__threadfence(); // wait until all threads sync and ta is filled

start the loop where sort of the same is done

iterate the maps using “ta” as the total average.
each thread “m” calculate the local average and stores it in avg[m].
__threadfence() , total average, __threadfence()
save it in res[t] and loop.

Right now I am not interested in speed just in getting a result (I am not…). If __threadfence() is supposed to make sure that all the writes to global memory i.e. avg are done why doesn’t this work? Or maybe it does work and the problem is elsewhere and I am too stupid to see it?

I cannot see what is happening with the debugger as it fails on this prog (a different problem).

Is there a different approach for this?

Thanks for any help.

GiP

Tigga · November 11, 2009, 2:45pm

__threadfence() doesn’t sync. All it does is ensures that all memory instructions before the call have been completed before continuing - it’s not all memory instructions in all threads, it’s only memory instructions in the calling thread. I found the thread fencing reduction example in the SDK explains it well.

GiP · November 11, 2009, 3:16pm

I don’t get it. What is the point of being sure that writes “in a thread” have been completed I thought that the __threadfence_block worked at block level and __threadfence above that and as long as all writes to the global array are done I thought I was fine.

I looked at the reduction example but I saw it uses separate kernel invocations and I cannot do that, I think. I need the values of the maps in the thread to calculate the next iteration, unless there is a way of declaring a global variable for all the function in a thread (a single thread actually calls some other functions) and this global variable mantain its values between invocations I don’t see how I can do this.

Thanks for the help!

GiP

Topic		Replies	Views
Doubt on __threadfence() require a detail description of this function. CUDA Programming and Performance	5	2995	January 25, 2010
__threadfence() problem CUDA Programming and Performance	2	9498	January 11, 2011
syncronize all threads from all blocks cudaThreadSynchronize() the only way ? CUDA Programming and Performance	11	8332	November 15, 2010
__threadfence() Can I use it for ... CUDA Programming and Performance	9	7228	June 12, 2009
Question related __threadfence CUDA Programming and Performance	13	5183	January 12, 2016
interblock sync without __threadfence() ? CUDA Programming and Performance	17	8594	May 7, 2009
difference between __threadfence_block and __syncthreads CUDA Programming and Performance	17	29512	April 22, 2015
About __threadfence... CUDA Programming and Performance	1	1292	March 11, 2010
Correct use of _threadfence() to remove the RAW race Cannot remove race condition CUDA Programming and Performance	14	3861	April 23, 2012
Global memory coherence in compute capability 2.0 Does __threadfence() really do what's on the t CUDA Programming and Performance	1	3739	April 11, 2012

Problems with __threadfence

Related topics