CUDA sync/timing? bug Intermitant misbehaviour 295 GTX device 1

I get strange intermitant behaviour from a 295GTX.

The first indication was every so often it produced different answers from a Tesla.

Device 0 produces the same answers as the Tesla and Device 1 sometimes produces

a difference of 1 (plus or minus) and once a difference of two.

Currently I am running identical calculations on device 0 and device 1 of the 295 GTX

and comparing the results they give.

I think the problem may lie in the code at the end of the outer loop which adds

the values of “fitness” produced by different threads together and returns the sum

to the host Linux PC. This is because if each thread writes its own partial sum to

global memory and they are added together on the host the problem is never seen.

Similarly if all the calculations are done by one thread (so there is no reduction phase)

again the problem is never seen.

I had been using Harris’ log(n) reduction code but the following code is simplier

(if less efficient?) and shows the problem just as my copy of Harris’ code did.

With a block size of one or 32 no errors are seen.

With a block size of 4 or 256 every so often Device 1 returns an answer 1 bigger

or 1 smaller than the answer returned by Device 0.

(offs = 14; was added to see if I could provoke the error into being more frequent.

No difference is seen)

[codebox]extern shared unsigned int shared_array;

static global void evalpopKernel(

int *d_Output,

unsigned char* Pop, 

const int prog0, 

unsigned int N,

int gen, int ntrain, const unsigned int npar2,

const int order, const int LEN) 


const int npar = 1<<npar2; //number of parallel fitness cases

const int ntrainhalf = ntrain>>1;

const unsigned int first_pid = (MUL(blockDim.x, blockIdx.x) + threadIdx.x)>>npar2;

const unsigned int threadN   =  MUL(blockDim.x, gridDim.x) >> npar2;

for(unsigned int prog = first_pid; prog < N; prog += threadN){

  int fitness = 0;

  for(unsigned int index = threadIdx.x & (npar-1); index < ntrain; index+=npar){

...lots of access to shared memory...

fitness += stuff...

  }//endfor testcases

  __syncthreads(); //wait until everyone has finished with shared memory

  int* shared = (int*) shared_array;

  const int offs = 14;

  shared[offs+threadIdx.x] = fitness;


  if(threadIdx.x==0) { //assume only one prog per block

int all = 0;

for(int i = 0; i<npar ;i++) all += shared[offs+i];

d_Output[prog] = all;


  __syncthreads(); //dont let other threads use shared memory until all ready



evalpopKernel<<<grid_size, block_size, block_size15sizeof(unsigned int)>>>(d_Output, Pop, 0, 262144, gen, 256, npar2, 37, 1024);


npar2 is set to log2(block_size), eg for block_size=32 npar2 will be 5.

grid_size is set so that grid_size*block_size = 262144 (up to the max 65535)

With a block_size of 32, the kernel takes up to 57 seconds.

The max temps observed are 67C (device 0) and 70 C (device 1).


Has anybody seen any thing similar?

On GTX 295? Or other GPU?

Could this be a hardware failure?

Any suggestions for checking the card itself?

Any help of comments would be most welcome


    Dr. W. B. Langdon,

    Department of Computer Science,

    King's College London,

    Strand, London, WC2R 2LS, UK

    <a target='_blank' rel='noopener noreferrer' href='""'></a>

FOGA 2011

CIGPU 2010

A Field Guide to Genetic Programming

                   <a target='_blank' rel='noopener noreferrer' href='""'></a>



GP Bibliography