kernel works on Gtx280/295/480 but not on C2050 unspecified launch failure

I will do that, thanks!

Now :wacko: :wacko:

This does not work:

[codebox]

if(tid==0) printf(“[0].w = %f \n”,Particles[max_number_of_particles_per_cell-1 -tid].w);

[/codebox]

And this does work:

[codebox]

if(tid==0) printf(“[0].w = %f \n”,Particles[max_number_of_particles_per_cell-1].w);

[/codebox]

tid is an integer! I give up!

Are you sure there aren’t any unsigned int versus int conflicts anywhere? int(0-1) = -1, but unsigned int(0-1)=2^32 -1.

Are you sure there aren’t any unsigned int versus int conflicts anywhere? int(0-1) = -1, but unsigned int(0-1)=2^32 -1.

Hmm, I changed every int in my kernel to unsigned int and also tried the opposite, neither of them worked. I also let the indices write in the terminal via printf (without accessing the array) and there were only indices that are not out of bounds.

Hmm, I changed every int in my kernel to unsigned int and also tried the opposite, neither of them worked. I also let the indices write in the terminal via printf (without accessing the array) and there were only indices that are not out of bounds.

You need to be careful about judging correctness just on the basis of printf - it will implicitly cast to whichever format you specify and may mask errors.

You need to be careful about judging correctness just on the basis of printf - it will implicitly cast to whichever format you specify and may mask errors.

Ok, but how can the lines in post #11 ever by a problem? Zero should be zero independent of the type casting, right?

Ok, but how can the lines in post #11 ever by a problem? Zero should be zero independent of the type casting, right?

It completely depends on precisely how tid and max_number_of_particles_per_cell are defined. Post a concise repro case that illustrates the problem, otherwise this discussion is going nowhere.

It completely depends on precisely how tid and max_number_of_particles_per_cell are defined. Post a concise repro case that illustrates the problem, otherwise this discussion is going nowhere.

Ok, so I made a minimal sample that does not run on the Tesla cards, but on all others. I deleted most of the stuff that isn’t necessary, it’s still long…

[codebox]

// includes, system

#include <stdlib.h>

#include <stdio.h>

#include <string.h>

#include <math.h>

#include <time.h>

#include <float.h>

#include <fenv.h>

// includes, cuda

#include <cutil.h>

#include <cuda_runtime.h>

#include <cuda.h>

#include <cutil_inline.h>

#define blocksize 128

#define number_of_gridpoints_per_cell 4

#define number_of_cells 10

#define max_number_of_particles_per_cell 10000

global void presort_neighbors(float4* Particles, int* Number_of_rejected_particles, int* Number_of_particles_in_cell, int* GPU_left_escape, int* GPU_right_escape, int charge)

{

shared int shared_index[4];

int tid = threadIdx.x;

int cell = blockIdx.x;

int particle_offset = blockDim.x;

if(tid==0) shared_index[0] = Number_of_particles_in_cell[cell]; //current number in this cell

if(tid==1) //get particles from left

{

  if(cell>0) shared_index[1] = Number_of_rejected_particles[cell-1];			//left cell

  else

  {

  shared_index[1] = 0;									//counter for escape in first (left) cell

  }

}

if(tid==2) //get particles from right

{

  if(cell<number_of_cells-1) shared_index[2] = Number_of_rejected_particles[cell+1];	//right cell

  else

  {

  shared_index[2] = 0; 									//counter for escape in last (right) cell

  }

}

if(tid==3)

{

if((cell==0)||(cell==number_of_cells-1))

{

  shared_index[3] = Number_of_rejected_particles[cell];

}

else

{

  shared_index[3] = 0;

}

}

__syncthreads();

if(tid < particle_offset/2)

{

if(cell>0) //get particles from left

  {

while(tid < shared_index[1])	//get particles with half of threads from the left until all rejected ones from the left are done

{

  float4 particle = Particles[(cell-1)*max_number_of_particles_per_cell + max_number_of_particles_per_cell - tid -1];

  int new_particle_cell = __float2int_rd(particle.w / ((float)(number_of_gridpoints_per_cell-1)));

  if(new_particle_cell==cell)

  {

      int new_particle_index_in_current_cell = atomicAdd(&shared_index[0],1);

      Particles[cell*max_number_of_particles_per_cell + new_particle_index_in_current_cell] = particle;

  }

  tid += particle_offset/2;

}

  }

else //if cell == 0 → count left escapes

  {

while(tid < shared_index[3])	//get particles with half of threads from the left until all rejected ones from the left are done

{

// if(tid==0) printf(“[0].w = %f \n”,Particles[max_number_of_particles_per_cell - 1].w); //works

  if(tid==0) printf("[0].w = %f \n",Particles[max_number_of_particles_per_cell - tid - 1].w);		//Doesn't work

  if(tid>0)

  {

      float4 particle = Particles[max_number_of_particles_per_cell -1 -tid];

      printf(".w = %f \n",particle.w);

      //new:

      int new_particle_cell = __float2int_rd(particle.w / ((float)(number_of_gridpoints_per_cell-1)));

      if(new_particle_cell < 0)

      {

	  atomicAdd(&shared_index[1],1);

      }

  }

  tid += (int)particle_offset/2;

}

  }

}

else

{

//not needed here

}

__syncthreads();

//

//not needed here

//

}

int main(int argc, char* argv)

{

cudaError_t error = cudaSetDevice(0);

if(error == cudaSuccess) printf(“device set!\n”);

else

{

printf("device not found!\n");

exit(1);

}

srand48(time(0)); //initialize RNG

int cell,particle, gridpoint; //counter

int total_number_of_particles = 50000;

int init_number_of_particles = total_number_of_particles / number_of_cells; //per cell

printf(“%d particles per cell \n”, init_number_of_particles);

float4* GPU_particle_array;

CUDA_SAFE_CALL(cudaMalloc((void**)&GPU_particle_array, max_number_of_particles_per_cell * number_of_cells * sizeof(float4)));

float4* CPU_particle_array;

CPU_particle_array = (float4*) malloc(max_number_of_particles_per_cell * number_of_cells * sizeof(float4));

int* GPU_number_of_rejected_particles;

CUDA_SAFE_CALL(cudaMalloc((void**)&GPU_number_of_rejected_particles, number_of_cells * sizeof(int)));

int* GPU_number_of_particles_per_cell;

CUDA_SAFE_CALL(cudaMalloc((void**)&GPU_number_of_particles_per_cell, number_of_cells * sizeof(int)));

int* CPU_number_of_rejected_particles;

CPU_number_of_rejected_particles = (int*) malloc(number_of_cells*sizeof(int));

int* CPU_number_of_particles_per_cell;

CPU_number_of_particles_per_cell = (int*) malloc(number_of_cells*sizeof(int));

int *GPU_left_particle_escape, *GPU_right_particle_escape;

CUDA_SAFE_CALL(cudaMalloc((void**)&GPU_left_particle_escape, sizeof(int)));

CUDA_SAFE_CALL(cudaMalloc((void**)&GPU_right_particle_escape, sizeof(int)));

int CPU_left_particle_escape, CPU_right_particle_escape;

for(cell=0; cell < number_of_cells; cell++)

{

for(particle=0; particle<max_number_of_particles_per_cell; particle++)

{

  CPU_particle_array[cell*max_number_of_particles_per_cell + particle].x = drand48()-0.5f;

  CPU_particle_array[cell*max_number_of_particles_per_cell + particle].y = drand48()-0.5f;

  CPU_particle_array[cell*max_number_of_particles_per_cell + particle].z = drand48()-0.5f;

  CPU_particle_array[cell*max_number_of_particles_per_cell + particle].w = (float)cell*(number_of_gridpoints_per_cell-1) + drand48()*(number_of_gridpoints_per_cell-1);

}

CPU_number_of_particles_per_cell[cell] = init_number_of_particles;

}

//put 10 “wrong” particles to the top of each cell:

for(cell=0; cell < number_of_cells; cell++)

{

for(particle=max_number_of_particles_per_cell-1; particle>max_number_of_particles_per_cell-11; particle--)

{

  CPU_particle_array[cell*max_number_of_particles_per_cell + particle].w -= 1.0f;

}

CPU_number_of_rejected_particles[cell] = 10;

}

CUDA_SAFE_CALL(cudaMemcpy(GPU_number_of_particles_per_cell, CPU_number_of_particles_per_cell, number_of_cells * sizeof(int), cudaMemcpyHostToDevice));

CUDA_SAFE_CALL(cudaMemcpy(GPU_number_of_rejected_particles, CPU_number_of_rejected_particles, number_of_cells * sizeof(int), cudaMemcpyHostToDevice));

CUDA_SAFE_CALL(cudaMemcpy(GPU_particle_array, CPU_particle_array, max_number_of_particles_per_cell * number_of_cells * sizeof(float4), cudaMemcpyHostToDevice));

int zero = 0;

CUDA_SAFE_CALL(cudaMemcpy(GPU_left_particle_escape, &zero, sizeof(int), cudaMemcpyHostToDevice));

CUDA_SAFE_CALL(cudaMemcpy(GPU_right_particle_escape, &zero, sizeof(int), cudaMemcpyHostToDevice));

presort_neighbors <<< number_of_cells,blocksize >>> (GPU_particle_array, GPU_number_of_rejected_particles, GPU_number_of_particles_per_cell, GPU_left_particle_escape, GPU_right_particle_escape, 0);

cutilSafeCall( cudaThreadSynchronize() );

printf(“Program finished.\n”);

return 0;

}

[/codebox]

Ok, so I made a minimal sample that does not run on the Tesla cards, but on all others. I deleted most of the stuff that isn’t necessary, it’s still long…

[codebox]

// includes, system

#include <stdlib.h>

#include <stdio.h>

#include <string.h>

#include <math.h>

#include <time.h>

#include <float.h>

#include <fenv.h>

// includes, cuda

#include <cutil.h>

#include <cuda_runtime.h>

#include <cuda.h>

#include <cutil_inline.h>

#define blocksize 128

#define number_of_gridpoints_per_cell 4

#define number_of_cells 10

#define max_number_of_particles_per_cell 10000

global void presort_neighbors(float4* Particles, int* Number_of_rejected_particles, int* Number_of_particles_in_cell, int* GPU_left_escape, int* GPU_right_escape, int charge)

{

shared int shared_index[4];

int tid = threadIdx.x;

int cell = blockIdx.x;

int particle_offset = blockDim.x;

if(tid==0) shared_index[0] = Number_of_particles_in_cell[cell]; //current number in this cell

if(tid==1) //get particles from left

{

  if(cell>0) shared_index[1] = Number_of_rejected_particles[cell-1];			//left cell

  else

  {

  shared_index[1] = 0;									//counter for escape in first (left) cell

  }

}

if(tid==2) //get particles from right

{

  if(cell<number_of_cells-1) shared_index[2] = Number_of_rejected_particles[cell+1];	//right cell

  else

  {

  shared_index[2] = 0; 									//counter for escape in last (right) cell

  }

}

if(tid==3)

{

if((cell==0)||(cell==number_of_cells-1))

{

  shared_index[3] = Number_of_rejected_particles[cell];

}

else

{

  shared_index[3] = 0;

}

}

__syncthreads();

if(tid < particle_offset/2)

{

if(cell>0) //get particles from left

  {

while(tid < shared_index[1])	//get particles with half of threads from the left until all rejected ones from the left are done

{

  float4 particle = Particles[(cell-1)*max_number_of_particles_per_cell + max_number_of_particles_per_cell - tid -1];

  int new_particle_cell = __float2int_rd(particle.w / ((float)(number_of_gridpoints_per_cell-1)));

  if(new_particle_cell==cell)

  {

      int new_particle_index_in_current_cell = atomicAdd(&shared_index[0],1);

      Particles[cell*max_number_of_particles_per_cell + new_particle_index_in_current_cell] = particle;

  }

  tid += particle_offset/2;

}

  }

else //if cell == 0 → count left escapes

  {

while(tid < shared_index[3])	//get particles with half of threads from the left until all rejected ones from the left are done

{

// if(tid==0) printf(“[0].w = %f \n”,Particles[max_number_of_particles_per_cell - 1].w); //works

  if(tid==0) printf("[0].w = %f \n",Particles[max_number_of_particles_per_cell - tid - 1].w);		//Doesn't work

  if(tid>0)

  {

      float4 particle = Particles[max_number_of_particles_per_cell -1 -tid];

      printf(".w = %f \n",particle.w);

      //new:

      int new_particle_cell = __float2int_rd(particle.w / ((float)(number_of_gridpoints_per_cell-1)));

      if(new_particle_cell < 0)

      {

	  atomicAdd(&shared_index[1],1);

      }

  }

  tid += (int)particle_offset/2;

}

  }

}

else

{

//not needed here

}

__syncthreads();

//

//not needed here

//

}

int main(int argc, char* argv)

{

cudaError_t error = cudaSetDevice(0);

if(error == cudaSuccess) printf(“device set!\n”);

else

{

printf("device not found!\n");

exit(1);

}

srand48(time(0)); //initialize RNG

int cell,particle, gridpoint; //counter

int total_number_of_particles = 50000;

int init_number_of_particles = total_number_of_particles / number_of_cells; //per cell

printf(“%d particles per cell \n”, init_number_of_particles);

float4* GPU_particle_array;

CUDA_SAFE_CALL(cudaMalloc((void**)&GPU_particle_array, max_number_of_particles_per_cell * number_of_cells * sizeof(float4)));

float4* CPU_particle_array;

CPU_particle_array = (float4*) malloc(max_number_of_particles_per_cell * number_of_cells * sizeof(float4));

int* GPU_number_of_rejected_particles;

CUDA_SAFE_CALL(cudaMalloc((void**)&GPU_number_of_rejected_particles, number_of_cells * sizeof(int)));

int* GPU_number_of_particles_per_cell;

CUDA_SAFE_CALL(cudaMalloc((void**)&GPU_number_of_particles_per_cell, number_of_cells * sizeof(int)));

int* CPU_number_of_rejected_particles;

CPU_number_of_rejected_particles = (int*) malloc(number_of_cells*sizeof(int));

int* CPU_number_of_particles_per_cell;

CPU_number_of_particles_per_cell = (int*) malloc(number_of_cells*sizeof(int));

int *GPU_left_particle_escape, *GPU_right_particle_escape;

CUDA_SAFE_CALL(cudaMalloc((void**)&GPU_left_particle_escape, sizeof(int)));

CUDA_SAFE_CALL(cudaMalloc((void**)&GPU_right_particle_escape, sizeof(int)));

int CPU_left_particle_escape, CPU_right_particle_escape;

for(cell=0; cell < number_of_cells; cell++)

{

for(particle=0; particle<max_number_of_particles_per_cell; particle++)

{

  CPU_particle_array[cell*max_number_of_particles_per_cell + particle].x = drand48()-0.5f;

  CPU_particle_array[cell*max_number_of_particles_per_cell + particle].y = drand48()-0.5f;

  CPU_particle_array[cell*max_number_of_particles_per_cell + particle].z = drand48()-0.5f;

  CPU_particle_array[cell*max_number_of_particles_per_cell + particle].w = (float)cell*(number_of_gridpoints_per_cell-1) + drand48()*(number_of_gridpoints_per_cell-1);

}

CPU_number_of_particles_per_cell[cell] = init_number_of_particles;

}

//put 10 “wrong” particles to the top of each cell:

for(cell=0; cell < number_of_cells; cell++)

{

for(particle=max_number_of_particles_per_cell-1; particle>max_number_of_particles_per_cell-11; particle--)

{

  CPU_particle_array[cell*max_number_of_particles_per_cell + particle].w -= 1.0f;

}

CPU_number_of_rejected_particles[cell] = 10;

}

CUDA_SAFE_CALL(cudaMemcpy(GPU_number_of_particles_per_cell, CPU_number_of_particles_per_cell, number_of_cells * sizeof(int), cudaMemcpyHostToDevice));

CUDA_SAFE_CALL(cudaMemcpy(GPU_number_of_rejected_particles, CPU_number_of_rejected_particles, number_of_cells * sizeof(int), cudaMemcpyHostToDevice));

CUDA_SAFE_CALL(cudaMemcpy(GPU_particle_array, CPU_particle_array, max_number_of_particles_per_cell * number_of_cells * sizeof(float4), cudaMemcpyHostToDevice));

int zero = 0;

CUDA_SAFE_CALL(cudaMemcpy(GPU_left_particle_escape, &zero, sizeof(int), cudaMemcpyHostToDevice));

CUDA_SAFE_CALL(cudaMemcpy(GPU_right_particle_escape, &zero, sizeof(int), cudaMemcpyHostToDevice));

presort_neighbors <<< number_of_cells,blocksize >>> (GPU_particle_array, GPU_number_of_rejected_particles, GPU_number_of_particles_per_cell, GPU_left_particle_escape, GPU_right_particle_escape, 0);

cutilSafeCall( cudaThreadSynchronize() );

printf(“Program finished.\n”);

return 0;

}

[/codebox]

As I suggested - out of bounds memory access:

avidday@cuda:~$ cuda-memcheck  ./a.out 

========= CUDA-MEMCHECK

5000 particles per cell 

philippe82.cu(199) : cudaSafeCall() Runtime API error : unspecified launch failure.

========= Invalid __global__ read of size 4

=========	 at 0x000003f0 in presort_neighbors

=========	 by thread (0,0,0) in block (0,0)

=========	 Address 0xfd002270fc is out of bounds

=========

========= ERROR SUMMARY: 1 error

(this on a GTX470, Cuda 3.2rc, Ubuntu 9.04). Fix your code.

As I suggested - out of bounds memory access:

avidday@cuda:~$ cuda-memcheck  ./a.out 

========= CUDA-MEMCHECK

5000 particles per cell 

philippe82.cu(199) : cudaSafeCall() Runtime API error : unspecified launch failure.

========= Invalid __global__ read of size 4

=========	 at 0x000003f0 in presort_neighbors

=========	 by thread (0,0,0) in block (0,0)

=========	 Address 0xfd002270fc is out of bounds

=========

========= ERROR SUMMARY: 1 error

(this on a GTX470, Cuda 3.2rc, Ubuntu 9.04). Fix your code.

Ok, thank you, but do you understand, where the problem is?

So it’s thread 0 in block 0 - that is what I figured out, too.
I wrote a “10” into all values of array “Number_of_rejected_particles”, so after the first syncthreads() there is a 10 in shared_index[3].
Then thread 0 runs into the “else” part, and then into the if(tid==0) part.
It reads the value of Particles[max_number_of_particles_per_cell - tid - 1].w which is Particles[9999].w
The Particle array is much larger than this. I don’t understand where the problem is.

Edit: if i type “abs(tid)” instead of just “tid” it works, too… ?

Ok, thank you, but do you understand, where the problem is?

So it’s thread 0 in block 0 - that is what I figured out, too.
I wrote a “10” into all values of array “Number_of_rejected_particles”, so after the first syncthreads() there is a 10 in shared_index[3].
Then thread 0 runs into the “else” part, and then into the if(tid==0) part.
It reads the value of Particles[max_number_of_particles_per_cell - tid - 1].w which is Particles[9999].w
The Particle array is much larger than this. I don’t understand where the problem is.

Edit: if i type “abs(tid)” instead of just “tid” it works, too… ?

My only suggestion is don’t do that sort of evaluation directly inside the arguments of a printf. It really isn’t necessary and makes the compiler’s life harder than it has to be. There are probably all sorts of templating layers and other stuff to get the vaargs mechanism compiled.

My only suggestion is don’t do that sort of evaluation directly inside the arguments of a printf. It really isn’t necessary and makes the compiler’s life harder than it has to be. There are probably all sorts of templating layers and other stuff to get the vaargs mechanism compiled.