Passing state infomation between thread steps

Suppose each thread block has a full 32 warps so each block has 32 * 32 threads.
Each thread has references some global memory by block id (bid) and thread id (tid)

Each thread acts independently of every other thread in its thread block.
Output for each thread is indivudually indexed by (bid,tid) too.

Each ThreadStep must take the unit of work assigned to it through 2000 steps to arrive at the output for that tid.

There are 12 doubles which serve as an input state, for the ThreadStep, and 12 doubles the are the output state.
There area also multiple thread blocks (64 probably)

device int ThreadStep(int step, int tid, int bid GlobalStuff *g, double Sin[12], double Sout[12])
{
… move from Sin to Sout. Read only access to g required.
}

pseudoKernel ( GlobalStuff *g )
{
double Sin[12];
double Sout[12];
get my tid and bid number
Load Sin based on (tid, bid);
int s;
for ( s=0;s<2000;s++)
{
ThreadStep(s,tid, bid, g,Sin,Sout);
memcpy(Sin,Sout, 12 * sizeof( double));
}
WriteToGlobalForThisThread(g,tid, bid, Sout);
}


The state information is private to each thread, but 12 doubles * 32 * 32 begins to add up
12832*32= 98,304 bytes for each thread block.
There might be 64 thread blocks.
My question is

  1. will the state information Sin[12],Sout[12] be truly private and fully coherent to the thread that is using it.
  2. will there be performance problems passing state information this way from one thread step to the next.
  3. I assume I dont have to do any thread sync in this kernel because all threads in the thread block work
    independently.

The ThreadStep’s only use of g is read only.
I plan to use zerocopy and mark lots of it CudaHostAllocWriteCombined because its read only.
Eventually I might add a reduction step for the Sout’s so only a single Sout is produced for all thread blocks
but not for now.

Yes

No, not really. If you always access Sin and Sout with index values determined at compile time, the compiler can just fold them into registers. Otherwise, it will spill them to local memory (which is still pretty fast, cached in L1). Don’t expect to get a full 1024 thread block though, that isn’t usually possible even in the best circumstances. If you run into performance problems with these being in local memory, you can also put them in shared memory - but you will then need to carefully limit your thread block size to not overflow it.

Of course.

zero-copy is generally best recommended for read-once or write-once data. Using it for data that you are going to read in over and over again is putting too much trust in the cache/PCI-e hierarchy. Better to just use plain old global memory for this.