Need help with kernel execution parameters

Hello, I’m having a very strange issue with a simple Kernel I’m trying to make.
The ideia is basically to follow this pseudo-code:

1. Create Random stuff#1
2. while i=0 until i<100
3.   create Random stuff#2
4.   executeKernel with stuff#1 and stuff#2
5.   copy stuff#1 back to host
6.   work on stuff#1
7.   pass stuff#1 on device
8.   end-while
9. write output to file

While it seems very simple to me, there’s something wrong going on, I can execute with absoluttly no problem the loop at the first time, but on the second time forward, it seems that the kernel simply can’t execute completly, like if it were trying to execute on some other data or having some kernel launch faillure.
I’ve tried running it with the cuda-gdb, however there’s no error comming out (and also I’m checking for all errors).
Here’s some snippet of the code (the part with I think it’s relevant):

//main.cu
#define CUDA_CALL(value) {                                                                                      
                cudaError_t _m_cudaStat = value;                                                                                
                if (_m_cudaStat != cudaSuccess) {                                                                               
                        fprintf(stderr, "Error %s at line %d in file %s
",                                     
                                        cudaGetErrorString(_m_cudaStat), __LINE__, __FILE__);           
                                        exit(1);                                                                                                                        
                } }

int main(int argc, char *argv[]){
...
  CUDA_CALL( cudaEventCreate(&start) );
  CUDA_CALL( cudaEventCreate(&stop) );
  CUDA_CALL( cudaEventRecord(start,0) );
  //Allocate the memory and set it up
  CUDA_CALL( cudaMalloc((void**)&d_cas,casMem) );
  CUDA_CALL( cudaMemset(d_cas,'0',casMem) );
  createRandomCA<<<512,512>>>(d_cas,time(NULL));
  CUDA_CALL( cudaGetLastError() );
  CUDA_CALL( cudaMalloc((void**)&d_ics,icsMem) );
  CUDA_CALL( cudaMalloc((void**)&d_binIcs,binIcsMem) );
  CUDA_CALL( cudaMalloc((void**)&d_bestCA,bestCAMem) );
  CUDA_CALL( cudaMemset(d_binIcs,'0',icsMem) );
  CUDA_CALL( cudaMemset(d_ics,'0',icsMem) );
  createRandomIC<<<512,512>>>(d_binIcs,time(NULL),MAXBINICS,DIST_BINOMIAL);
  CUDA_CALL( cudaGetLastError() );
  CUDA_CALL( cudaDeviceSynchronize() );
...
  for(int g=0;g<100;g++){
    CUDA_CALL( cudaGetLastError() );
    CUDA_CALL( cudaDeviceSynchronize() );
    createRandomIC<<<512,512>>>(d_ics,time(NULL),MAXICS,DIST_UNIFORM);
    CUDA_CALL( cudaGetLastError() );
    CUDA_CALL( cudaDeviceSynchronize() );
    //Start the CA execution
    dim3 blocks(512,512);
    dim3 threads(32,32);
    runCAs<<<blocks,threads>>>(d_cas,d_ics);
    CUDA_CALL( cudaGetLastError() );
    CUDA_CALL( cudaDeviceSynchronize() );
    CUDA_CALL( cudaEventRecord(stop,0) );
    CUDA_CALL( cudaEventSynchronize(stop) );
    CUDA_CALL( cudaEventElapsedTime(&elapsedTime,start,stop) );
    printf("[*] Execution took %f ms
",elapsedTime); fflush(stdout);
    CUDA_CALL( cudaDeviceSynchronize());
    CUDA_CALL( cudaEventRecord(start,0) );
    CUDA_CALL( cudaMemcpy(h_cas,d_cas,casMem,cudaMemcpyDeviceToHost) );
    CUDA_CALL( cudaEventRecord(stop,0) );
    CUDA_CALL( cudaEventSynchronize(stop) );
    CUDA_CALL( cudaEventElapsedTime(&elapsedTime,start,stop) );
    printf("[*] Copy took %f ms
",elapsedTime); fflush(stdout);
    printf("[*] Crossing and mutating..."); fflush(stdout);
    crossOver(h_cas);
    printf("OK (best:%d, %d - %d)
",h_cas[0].fitness,h_cas[0].lowFitness,h_cas[0].highFitness);
    CUDA_CALL( cudaEventRecord(start,0) );
    CUDA_CALL( cudaMemcpy(d_cas,h_cas,casMem,cudaMemcpyHostToDevice) );
    CUDA_CALL( cudaEventRecord(stop,0) );
    CUDA_CALL( cudaEventSynchronize(stop) );
    CUDA_CALL( cudaEventElapsedTime(&elapsedTime,start,stop) );
    printf("[*] Copy took %f ms
",elapsedTime); fflush(stdout);
  }
...
  CUDA_CALL( cudaDeviceSynchronize());
  CUDA_CALL( cudaEventRecord(start,0) );
  CUDA_CALL( cudaMemcpy(h_cas,d_cas,casMem,cudaMemcpyDeviceToHost) );
  CUDA_CALL( cudaEventRecord(stop,0) );
  CUDA_CALL( cudaEventSynchronize(stop) );
  CUDA_CALL( cudaEventElapsedTime(&elapsedTime,start,stop) );
  printf("[*] Copy took %f ms
",elapsedTime); fflush(stdout);
  //Write to file
...

And here’s the kernel and the CA structure

//ca.cu
#define RULESIZE 128
#define RADIUS 3
#define NEIGHBOURHOOD 2*RADIUS+1
#define LATSIZE 149
#define RUNTIME 2*LATSIZE
#define MAXICS 100
#define MAXBINICS 10000
#define MAXCAS 100

#define DIST_UNIFORM 0
#define DIST_BINOMIAL 1

typedef struct CA{
  unsigned char rule[RULESIZE];
  unsigned int fitness;
  unsigned int highFitness;
  unsigned int lowFitness;
}CA;

typedef struct IC{
  unsigned char lattice[LATSIZE];
  unsigned int density;
}IC;
...
__global__ void runCAs(CA *cas, IC * ics){
  int caId = threadIdx.x + blockDim.x*blockIdx.x;
  while(caId < MAXCAS){
    int icId = threadIdx.y + blockDim.y*blockIdx.y;
    cas[caId].fitness=cas[caId].lowFitness=cas[caId].highFitness=0;
    while(icId < MAXICS){
...
      icId += blockDim.y*gridDim.y;
    }
    caId += blockDim.x*gridDim.x;
  }
}

And yes, this is a Genetic Algorithm using Cellular Automata that I’m studying.
As I said, the biggest problem is that it gives no error whatsoever, but, no matter how many generations I’m running, the last 21 CAs always have a fitness = 808464432, which obviously means that it wasn’t even ran, since it’s an impossible value for the size of the problem.

If you need more code, please let me now.

Thank you very much!

Your description suggests that the problem with unexpected runtime behavior is probably not due to a CUDA issue. If I understand correctly, the application runs to completion and no errors are reported from any of the checked API calls (and you are checking the status of every single API call and kernel launch).

Here is a suggestion that may seem quaint and old-fashioned, but in my experience works very well. Add some printfs to your code at key decision points that print some key variables, for example the ones used in branch and loop control. When you now run the app, the output of these printfs will create a log of runtime activity and you will likely be able to pinpoint very quickly at what point things start going wrong (variables will take on unexpected values).

Addendum:

It seems your macro for checking errors on kernel launches may not be catchin all errors. There are both configuration errors that are reported synchronously, and runtime errors that are reported asynchronously. Try inserting the following immediately after each kernel launch, this catches problems of both types:

// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR()                                          
do {                                                                  
    /* Check synchronous errors, i.e. pre-launch */                   
    cudaError_t err = cudaGetLastError();                             
    if (cudaSuccess != err) {                                         
        fprintf (stderr, "Cuda error in file %s in line %i : %s.",
                 __FILE__, __LINE__, cudaGetErrorString(err) );       
        exit (EXIT_FAILURE);                                          
    }                                                                 
    /* Check asynchronous errors, i.e. kernel failed (ULF) */         
    err = cudaThreadSynchronize();                                    
    if (cudaSuccess != err) {                                         
        fprintf (stderr, "Cuda error in file %s in line %i : %s.",
                 __FILE__, __LINE__, cudaGetErrorString( err) );      
        exit(EXIT_FAILURE);                                           
    }                                                                 
} while (0)

NOTE: It seems like all the backslashes for line continuation inside the macro disappeared when I inserted the code into this post, so you will have to add those back in.

Thanks njuffa, it seems that in the end I was having some race condition problems with my Kernel.
Now it’s (kinda) fixed, but I’ll need to optimize a little more…it’s still taking a long time (1.5 sec for some very simple stuff).

Thanks for all your help.