Suppose each thread block has a full 32 warps so each block has 32 * 32 threads.
Each thread has references some global memory by block id (bid) and thread id (tid)
Each thread acts independently of every other thread in its thread block.
Output for each thread is indivudually indexed by (bid,tid) too.
Each ThreadStep must take the unit of work assigned to it through 2000 steps to arrive at the output for that tid.
There are 12 doubles which serve as an input state, for the ThreadStep, and 12 doubles the are the output state.
There area also multiple thread blocks (64 probably)
device int ThreadStep(int step, int tid, int bid GlobalStuff *g, double Sin[12], double Sout[12])
{
… move from Sin to Sout. Read only access to g required.
}
pseudoKernel ( GlobalStuff *g )
{
double Sin[12];
double Sout[12];
get my tid and bid number
Load Sin based on (tid, bid);
int s;
for ( s=0;s<2000;s++)
{
ThreadStep(s,tid, bid, g,Sin,Sout);
memcpy(Sin,Sout, 12 * sizeof( double));
}
WriteToGlobalForThisThread(g,tid, bid, Sout);
}
The state information is private to each thread, but 12 doubles * 32 * 32 begins to add up
12832*32= 98,304 bytes for each thread block.
There might be 64 thread blocks.
My question is
- will the state information Sin[12],Sout[12] be truly private and fully coherent to the thread that is using it.
- will there be performance problems passing state information this way from one thread step to the next.
- I assume I dont have to do any thread sync in this kernel because all threads in the thread block work
independently.
The ThreadStep’s only use of g is read only.
I plan to use zerocopy and mark lots of it CudaHostAllocWriteCombined because its read only.
Eventually I might add a reduction step for the Sout’s so only a single Sout is produced for all thread blocks
but not for now.