Hi,
I’m looking for some advice on my problem, I’m basically out of ideas as to how to get around this.
Problem description
Each block is running on data independent data chunks. When the #active_blocks / SM > 1 and problem size is big enough errors start being introduced into the output. See for example my log file:
My test card was a GTS250 running on recently downloaded drivers.
I initally suspected this was due to mistakes in syncing which usually causes these kinds of errors that appear and disappear. But after having tested with #blocks == #SMs i didnt see the error anymore which lead me to to start thinking that there was something happening when there were more than one active block.
Fix
For example make the blocks unnecessarily big (large smem usage) so that only one block fits on each SM at a time. Thus i can run with 1000s of blocks but each SM is only allowed to process one thread block at a time. This guarantees a 100% success rate, but it gives me over 30% performance penalty since it obviously becomes harder to hide the off-chip latencies.
Basic kernel description
Below I’ve extrapolated what i think are the interesting parts of my code. The main feature / curse is that the kernel uses a large amount of registers but this shouldn’t introduce random errors because of the context switching being done when there are 2 active blocks.
void main (args)
{
float* in;
float* out;
.........
myKernel<0><<<grid,block>>>(in, out); // do 10 iterations
myKernel<10><<<grid,block>>>(out, out); // do another 10, notice that we pick up were the last kernel wrote its data.
myKernel<20><<<grid,block>>>(out, out);
myKernel<30><<<grid,block>>>(out, out);
.........
}
...............
const int some_constant = ....;
/*
* myKernel - Reads data onto on-chip memory, performs a few iterations, writes the answer back to global memory.
*
*
*/
template<int iteration>
_global__ void myKernel(float* in, float* out)
{
// Store data in on-chip registers
float reg_vals[some_constant];
// read data
#pragma unroll
for(int k = 0; k < some_constant; k++)
{
reg_vals[k] = in[blockIdx.x*blockSize + threadIdx.x + k*blockWidth];
}
// Do, for example 10 iterations, on-chip, use 'iteration' to keep track...
....
.....
// Write back to global
#pragma unroll
for(int k = 0; k < some_constant; k++)
{
out[blockIdx.x*blockSize + threadIdx.x + k*blockWidth] = reg_val[k];
}
}
That should be a decent problem description. My hope is that someone will recognize the issue and tell me what I haven’t thought of. Otherwise i will have to produce a repro case…
Thank you for your time!