random error when more than 1 active block do you recognize this?

Jimmy_Pettersson · April 27, 2010, 9:20am

Hi,

I’m looking for some advice on my problem, I’m basically out of ideas as to how to get around this.

Problem description

Each block is running on data independent data chunks. When the #active_blocks / SM > 1 and problem size is big enough errors start being introduced into the output. See for example my log file:

------- 2010-4-21 : 21:25 --------------

Fail! : nof_blocks 47 num_err_blocks 1

Error block index: 17,

------- 2010-4-21 : 21:26 --------------

Success! : nof_blocks 47 num_high_error_els 0

------- 2010-4-21 : 21:28 --------------

Success! : nof_blocks 47 num_high_error_els 0

------- 2010-4-21 : 21:28 --------------

Fail! : nof_blocks 47 num_err_blocks 1

Error block index: 12,

------- 2010-4-21 : 21:29 --------------

Success! : nof_blocks 47 num_high_error_els 0

Or for bigger problems:

------- 2010-4-25 : 21:14 --------------

Success! : nof_blocks 511 num_high_error_els 0

------- 2010-4-25 : 21:22 --------------

Success! : nof_blocks 511 num_high_error_els 0

------- 2010-4-25 : 21:24 --------------

Success! : nof_blocks 511 num_high_error_els 0

------- 2010-4-25 : 21:49 --------------

Success! : nof_blocks 511 num_high_error_els 0

------- 2010-4-25 : 21:59 --------------

Fail! : nof_blocks 511 num_err_blocks 1

Error block index: 1,

------- 2010-4-25 : 22:02 --------------

Fail! : nof_blocks 511 num_err_blocks 3

Error block index: 42, 170, 359

My test card was a GTS250 running on recently downloaded drivers.

I initally suspected this was due to mistakes in syncing which usually causes these kinds of errors that appear and disappear. But after having tested with #blocks == #SMs i didnt see the error anymore which lead me to to start thinking that there was something happening when there were more than one active block.

Fix

For example make the blocks unnecessarily big (large smem usage) so that only one block fits on each SM at a time. Thus i can run with 1000s of blocks but each SM is only allowed to process one thread block at a time. This guarantees a 100% success rate, but it gives me over 30% performance penalty since it obviously becomes harder to hide the off-chip latencies.

Basic kernel description

Below I’ve extrapolated what i think are the interesting parts of my code. The main feature / curse is that the kernel uses a large amount of registers but this shouldn’t introduce random errors because of the context switching being done when there are 2 active blocks.

void main (args)

{

float* in;

float* out;

.........

myKernel<0><<<grid,block>>>(in, out); // do 10 iterations

myKernel<10><<<grid,block>>>(out, out); // do another 10, notice that we pick up were the last kernel wrote its data.

myKernel<20><<<grid,block>>>(out, out);

myKernel<30><<<grid,block>>>(out, out);

.........

}

...............

const int some_constant = ....;

/*

* myKernel - Reads data onto on-chip memory, performs a few iterations, writes the answer back to global memory.

*

*

*/

template<int iteration>

_global__ void myKernel(float* in, float* out)

{

// Store data in on-chip registers

float reg_vals[some_constant];

// read data

#pragma unroll

for(int k = 0; k < some_constant; k++)

{

   reg_vals[k] = in[blockIdx.x*blockSize + threadIdx.x + k*blockWidth];

}

// Do, for example 10 iterations, on-chip, use 'iteration' to keep track...

....

.....

// Write back to global

#pragma unroll

for(int k = 0; k < some_constant; k++)

{

   out[blockIdx.x*blockSize + threadIdx.x + k*blockWidth] = reg_val[k];

}

}

That should be a decent problem description. My hope is that someone will recognize the issue and tell me what I haven’t thought of. Otherwise i will have to produce a repro case…

Thank you for your time!

avidday · April 27, 2010, 9:39am

You haven’t shown the code that produces the output at the top of the post. What are “errors” in this context? Also you use the expression “mistaking in syncing” in your problem description, but I see nothing that could be construed as synchronization anywhere in that “code”. Is there synchronization? What level does it operate at?

if you are looking for a quick suggestion, break the input and output dependencies in the kernel launches by “flip flopping” two pointers between launches. My first reaction is read after write coherence problems when a kernel is simultaneously using the same piece of memory for input and output. If you have a subtle indexing fault or something similar, the character of the “errors” might change.

Jimmy_Pettersson · April 27, 2010, 9:55am

Each block is working on an independent data piece. The error is when some block computes an erroneous output, which sometimes happens when the number of active blocks per SM is greater than one.

There is a lot of syncing within the block going on ( __syncthreads() ). I’m not sure there is any point in my trying to extrapolate these details, then i think it is better to work on a simple repro case.

Yes I suspected this was an issue early on, tried flip flopping them but to no avail here.

An indexing fault would certainly be able to introduce som erros. But remember the error doesnt show up if I force #active_blocks / SM == 1.

Perhaps it is pointless posting this “useless code” since it in no way describes the whole picture or rather the details that i might have missed.

Thanks for your reply!

Lev · April 27, 2010, 11:06am

Do you use shared memory? You probably may spoil memory of another block in streaming processor. Anctually I do not know, if shared memory protected.
And what means block level sync? Is it sync between threads in block or between blocks?

Jimmy_Pettersson · April 27, 2010, 11:11am

Sorry, I think i didnt express my previous answer clearly. I meant to say that there was a lot of synchronization within the the blocks ( “__syncthreads()” ).

Yes, from my understanding attempting to synchronize across the grid is best avoided as it often can cause dealocks etc,.

Jimmy_Pettersson · April 27, 2010, 12:32pm

Will reply you again since you edited and added more questions.

Yes, i use a bit of shared memory for communication. I had a similar idea but couldn’t find any evidence of incorrect addressing. If i was causing some sort of index out of bounds problem where i was accessing the other active blocks SMEM i guess that i would also see the error when #active_blocks == 1, since it too would be out of bounds?

Syncing between threads in a block. I don’t think there are many successful cases of syncing between blocks? I’ve seen people post about it but I’m doubtful if they’e been very successful.

Lev · April 27, 2010, 1:14pm

“If i was causing some sort of index out of bounds problem where i was accessing the other active blocks SMEM i guess that i would also see the error when #active_blocks == 1, since it too would be out of bounds?”

The error maybe somehow hidden. Just need to know is block shared memory protected. Also number of blocks in one sm can cause strange behaviour with global memory access with wrong indexing. Numbers could be so spoiled that they become not-a-number.

Jimmy_Pettersson · April 27, 2010, 1:25pm

My guess is that it’s not protected. Maybe someone here could be kind enough to tell us.

What kind of strange behaviour? Is it some kind of driver / HW bug ? Can you please explain to me how the addressing would be different if an SM executed two blocks in series or in parallell ?

The adressing scheme has no global read after write dependency, it’s all constants, blockIdx.x, and threadIdx.x etc,.

Thank you for your input.

avidday · April 27, 2010, 1:43pm

On pre-Fermi hardware, it isn’t protected (or if it is the protection is pretty minimal). On Fermi, it seems to be protected - out of bounds shared memory access will trigger an unspecified launch failure error.

Lev · April 27, 2010, 2:39pm

“Can you please explain to me how the addressing would be different if an SM executed two blocks in series or in parallell ?”

I mean that if two threads write on same global memory location at once, result maybe different if they belong to one sm than if they are on different sm.

Jimmy_Pettersson · April 27, 2010, 3:47pm

Yes that would often cause trouble.

Anyways, i liked your smem idea, im gonna have another paranoid look at it…

Topic		Replies	Views
More blocks than SMs may not make sense CUDA Programming and Performance	13	2674	November 11, 2010
Block synchronization issue on Fermi Global memory doesn't seem to by in sync across thread bloc CUDA Programming and Performance	18	3196	April 7, 2011
life span of shared memory CUDA Programming and Performance	15	6953	April 27, 2011
Annoying problems with memory and/or syntax CUDA Programming and Performance	19	4769	April 8, 2008
Using Shared Memory in CUDA C/C++ Technical Blog	36	1991	October 8, 2020
What will be happen in the situation CUDA Programming and Performance	9	6241	December 23, 2008
Concurrently kernels running on one device CUDA Programming and Performance	17	2737	March 2, 2010
why result varied based on different number of threads per block? CUDA Programming and Performance	8	1941	March 1, 2011
help with kernel synchronization? CUDA Programming and Performance	22	13900	August 26, 2010
Shared memory, not being freed! Shared memory not cleard over blocks or runs!! CUDA Programming and Performance	16	3752	September 22, 2009

random error when more than 1 active block do you recognize this?

Related topics