same code different results on Quadro 3000M and Tesla C1060


the below code works on Tesla C1060 but does not work on my mobile workstation with a Quadro 3000M.
This is mainly what the code does:

__global__ void gpuKernel
  BYTE *src,float *aux, /* source, destination */
  size_t memPitch, /* memory pitch */
  size_t memPitchAux, 
  int w,int h, /* size of picture  */
  BYTE *Rng,int sizeRng, /* ranges */
  BYTE *Dir,int sizeDir /* directions */ 
 int rx = ((VEC2*)Rng)->x; // range of calculations (sizeXofwindow = 2*rx+1)
  int ry = ((VEC2*)Rng)->y; // range of calculations (sizeYofwindow = 2*ry+1)

  int i = blockIdx.x * blockDim.x + threadIdx.x; // x-coordinate of pixel = column in device memory
  int j = blockIdx.y * blockDim.y + threadIdx.y; // y-coordinate of pixel = row in device memory
  int idx  = j * memPitchAux/sizeof(float) + i; 

int i0 = i-rx, i1 = i+rx; // the range of calculation for columns
  int j0 = j-ry, j1 = j+ry; // the range of calculation for rows
  float g=0.0f ,g0=0.0f;
 if((i0>=0) && (i1< w) && (j0>=0) && (j1< h))
    g = 0.0;
  for(int n=0; n<5; n++)
     // computations....

        g = fmaxf(g,g0*s);
    aux[idx] = g;
  aux[idx] = 0;

int main()
   checkCudaErrors(cudaMemcpy2D(data->BufGPU[0],data->memPitch, /* to GPU (device) */
		data->BufCPU[0],data->imgPitch, /* from CPU (host) */
		data->imgWidth, data->imgHeight, /* size of data (image) */
  dim3 dimBlock(data->dimBLX,data->dimBLY,1);
  dim3 dimGrid(data->memPitch/dimBlock.x ,data->imgHeight/dimBlock.y,1);

  size_t memPitch = data->memPitch;
  float *auxD;	
  size_t auxDPitch;
  size_t auxHPitch;

  auxHPitch = w*sizeof(float);
  float *auxH = (float *) malloc(h*auxHPitch);

		memPitch,auxDPitch,w, h,


  checkCudaErrors(cudaMemcpy2D(auxH,auxHPitch,  // to CPU (host)
               auxD,auxDPitch,  // from GPU (device)
	       auxHPitch, h, // size of data (image)

The execution on the Quadro 3000M simply skips the kernel and outputs a blanck image in few ms.
The execution on the Tesla C1060 outputs a processed image in (say) 100 s. The weird thing is that in the last days also the execution on the c1060 appeared rather unpredictable (sometimes skipping the kernel, sometimes outputting weird numbers as -10^12, …). I do not understand this behaviour.
Could it be a driver version problem?

Thank you in advance for helping.

ps. both machines on ubuntu 11.10

Quadro 3000M —> Cuda compilation tools, release 4.1, V0.2.1221
Tesla C1060 —> Cuda compilation tools, release 4.1, V0.2.1221

The Quadro 3000M is compute capability 2.1 while C1060 is compute capability 1.3. Which flags are you using for compiling and are there any functions supported by Quadro 3000M but not by C1060? Also check the launching and the shared memory usage. The Quadro 3000 M has 1024 threads per block while C1060 has 512.
Late edit: If I compile for arch 2.1, but run it on arch 1.3 I got a bunch of Nan’s. When I left the threads per block 1024 and run it on a .3 device the kernels were not started.

Why not to debug the program to see where error is?

Thank you for your reply.
I’m using a modified version of the SDK to compile in both the machines (, where the only thing I change are some paths to directory (sdk, etc).
I’ll check the number of threads and the smem and come back to you.

You may have out of bound shared memory access which produce error on Fermi card, but no error on 1.3 card and so on.

From other forums I got it that the Fermi architecture is less permissive when there is an errror, like outof bounds or not enough resources and it does not run at all. It is possible that there is an error which makes the program to crash. As Lev suggested an out of bounds access could crash the program on the 2.x but not on 1.3 devices.

sorry but yesterday I had to leave.
I run the code with cuda-memcheck and I get lots of errors for misaligned memory accesses

How could I solve this issue? I thought it was safe to use cudaMallocPitch(), the pitch is got dynamically, isn’t?
The following is the core of the kernel computation (line 19 in my first post)

for(int js=jsB; js<swy-jsE; js++)
	for(int is=isB; is<swx-isE; is++)
	  int ic = (int) fabsf(src[(j0+js)*memPitch+i0+is]-src[(j0+js+jsD)*memPitch+i0+is+isD]);
	  c[ic] += 1.0f; 

So the issue could just be in the ic computation, right?
do you have any suggestions on how to proceed and on how to make this code safe across GPU generations?
I did not write this code but I’m supposed to port it to a Tesla M2090 server, which has indeed a Fermi architecture too.

Thank you for your useful insights!

Misaligned memory accesses not necessary crush program, they only may harm performance. You most likely get those accesses on 1.3 too.
You need to debug your program and see where error is.

Misaligned memory accesses will produce incorrect results on compute capability 1.* devices and will result in a fatal exception on compute capability 2.* and 3.* devices.

The kernel casts BYTE* to VEC2*. VEC2* addresses must be aligned to that natural alignment of VEC2 (probably 4 or 8). It is possible to specify the packing of VEC2 to 1 which will tell the compiler to generate only byte accesses but the performance will be awful.

There are numerous other questionable memory operations in the kernel.

If the code is accessing float then the address must be a multiple of sizeof(float). If the code is accessing float2 then the address must be a multiple of sizeof(float2).

The first step is to fix all of your misaligned accesses.

“The execution on the Tesla C1060 outputs a processed image in (say) 100 s.” Sometimes. So this was luck with misaligned access, looks like I confused this with other type of misaligned access, warps etc. Though, variable rnd should probably be aligned. So, source of misalignment is not in the presented source code.