random error when more than 1 active block do you recognize this?


I’m looking for some advice on my problem, I’m basically out of ideas as to how to get around this.

Problem description

Each block is running on data independent data chunks. When the #active_blocks / SM > 1 and problem size is big enough errors start being introduced into the output. See for example my log file:

My test card was a GTS250 running on recently downloaded drivers.

I initally suspected this was due to mistakes in syncing which usually causes these kinds of errors that appear and disappear. But after having tested with #blocks == #SMs i didnt see the error anymore which lead me to to start thinking that there was something happening when there were more than one active block.


For example make the blocks unnecessarily big (large smem usage) so that only one block fits on each SM at a time. Thus i can run with 1000s of blocks but each SM is only allowed to process one thread block at a time. This guarantees a 100% success rate, but it gives me over 30% performance penalty since it obviously becomes harder to hide the off-chip latencies.

Basic kernel description

Below I’ve extrapolated what i think are the interesting parts of my code. The main feature / curse is that the kernel uses a large amount of registers but this shouldn’t introduce random errors because of the context switching being done when there are 2 active blocks.

void main (args)


float* in;

float* out;


myKernel<0><<<grid,block>>>(in, out); // do 10 iterations

myKernel<10><<<grid,block>>>(out, out); // do another 10, notice that we pick up were the last kernel wrote its data.

myKernel<20><<<grid,block>>>(out, out);

myKernel<30><<<grid,block>>>(out, out);




const int some_constant = ....;


* myKernel - Reads data onto on-chip memory, performs a few iterations, writes the answer back to global memory.




template<int iteration>

_global__ void myKernel(float* in, float* out)


// Store data in on-chip registers

float reg_vals[some_constant];

// read data

#pragma unroll

for(int k = 0; k < some_constant; k++)


   reg_vals[k] = in[blockIdx.x*blockSize + threadIdx.x + k*blockWidth];


// Do, for example 10 iterations, on-chip, use 'iteration' to keep track...



// Write back to global

#pragma unroll

for(int k = 0; k < some_constant; k++)


   out[blockIdx.x*blockSize + threadIdx.x + k*blockWidth] = reg_val[k];



That should be a decent problem description. My hope is that someone will recognize the issue and tell me what I haven’t thought of. Otherwise i will have to produce a repro case…

Thank you for your time!

You haven’t shown the code that produces the output at the top of the post. What are “errors” in this context? Also you use the expression “mistaking in syncing” in your problem description, but I see nothing that could be construed as synchronization anywhere in that “code”. Is there synchronization? What level does it operate at?

if you are looking for a quick suggestion, break the input and output dependencies in the kernel launches by “flip flopping” two pointers between launches. My first reaction is read after write coherence problems when a kernel is simultaneously using the same piece of memory for input and output. If you have a subtle indexing fault or something similar, the character of the “errors” might change.

Each block is working on an independent data piece. The error is when some block computes an erroneous output, which sometimes happens when the number of active blocks per SM is greater than one.

There is a lot of syncing within the block going on ( __syncthreads() ). I’m not sure there is any point in my trying to extrapolate these details, then i think it is better to work on a simple repro case.

Yes I suspected this was an issue early on, tried flip flopping them but to no avail here.

An indexing fault would certainly be able to introduce som erros. But remember the error doesnt show up if I force #active_blocks / SM == 1.

Perhaps it is pointless posting this “useless code” since it in no way describes the whole picture or rather the details that i might have missed.

Thanks for your reply!

Do you use shared memory? You probably may spoil memory of another block in streaming processor. Anctually I do not know, if shared memory protected.
And what means block level sync? Is it sync between threads in block or between blocks?

Sorry, I think i didnt express my previous answer clearly. I meant to say that there was a lot of synchronization within the the blocks ( “__syncthreads()” ).

Yes, from my understanding attempting to synchronize across the grid is best avoided as it often can cause dealocks etc,.

Will reply you again since you edited and added more questions.

Yes, i use a bit of shared memory for communication. I had a similar idea but couldn’t find any evidence of incorrect addressing. If i was causing some sort of index out of bounds problem where i was accessing the other active blocks SMEM i guess that i would also see the error when #active_blocks == 1, since it too would be out of bounds?

Syncing between threads in a block. I don’t think there are many successful cases of syncing between blocks? I’ve seen people post about it but I’m doubtful if they’e been very successful.

“If i was causing some sort of index out of bounds problem where i was accessing the other active blocks SMEM i guess that i would also see the error when #active_blocks == 1, since it too would be out of bounds?”

The error maybe somehow hidden. Just need to know is block shared memory protected. Also number of blocks in one sm can cause strange behaviour with global memory access with wrong indexing. Numbers could be so spoiled that they become not-a-number.

My guess is that it’s not protected. Maybe someone here could be kind enough to tell us.

What kind of strange behaviour? Is it some kind of driver / HW bug ? Can you please explain to me how the addressing would be different if an SM executed two blocks in series or in parallell ?

The adressing scheme has no global read after write dependency, it’s all constants, blockIdx.x, and threadIdx.x etc,.

Thank you for your input.

On pre-Fermi hardware, it isn’t protected (or if it is the protection is pretty minimal). On Fermi, it seems to be protected - out of bounds shared memory access will trigger an unspecified launch failure error.

“Can you please explain to me how the addressing would be different if an SM executed two blocks in series or in parallell ?”

I mean that if two threads write on same global memory location at once, result maybe different if they belong to one sm than if they are on different sm.

Yes that would often cause trouble.

Anyways, i liked your smem idea, im gonna have another paranoid look at it…