same code different results on Quadro 3000M and Tesla C1060

fargalaxy · January 17, 2013, 2:32pm

Hallo,

the below code works on Tesla C1060 but does not work on my mobile workstation with a Quadro 3000M.
This is mainly what the code does:

__global__ void gpuKernel
(
  BYTE *src,float *aux, /* source, destination */
  size_t memPitch, /* memory pitch */
  size_t memPitchAux, 
  int w,int h, /* size of picture  */
  BYTE *Rng,int sizeRng, /* ranges */
  BYTE *Dir,int sizeDir /* directions */ 
)
{
 int rx = ((VEC2*)Rng)->x; // range of calculations (sizeXofwindow = 2*rx+1)
  int ry = ((VEC2*)Rng)->y; // range of calculations (sizeYofwindow = 2*ry+1)

  int i = blockIdx.x * blockDim.x + threadIdx.x; // x-coordinate of pixel = column in device memory
  int j = blockIdx.y * blockDim.y + threadIdx.y; // y-coordinate of pixel = row in device memory
  int idx  = j * memPitchAux/sizeof(float) + i; 

int i0 = i-rx, i1 = i+rx; // the range of calculation for columns
  int j0 = j-ry, j1 = j+ry; // the range of calculation for rows
  float g=0.0f ,g0=0.0f;
  
 if((i0>=0) && (i1< w) && (j0>=0) && (j1< h))
 {
    g = 0.0;
  
  for(int n=0; n<5; n++)
    {
     // computations....

        g = fmaxf(g,g0*s);
	
     }     
    aux[idx] = g;
    __syncthreads();
 } 
 else
 {
  aux[idx] = 0;
 }
}

int main()
{
[..]
   checkCudaErrors(cudaMemcpy2D(data->BufGPU[0],data->memPitch, /* to GPU (device) */
		data->BufCPU[0],data->imgPitch, /* from CPU (host) */
		data->imgWidth, data->imgHeight, /* size of data (image) */
		cudaMemcpyHostToDevice));
  cudaThreadSynchronize();
 
[..]
  dim3 dimBlock(data->dimBLX,data->dimBLY,1);
  dim3 dimGrid(data->memPitch/dimBlock.x ,data->imgHeight/dimBlock.y,1);

  size_t memPitch = data->memPitch;
 
  float *auxD;	
  size_t auxDPitch;
  size_t auxHPitch;
  checkCudaErrors(cudaMallocPitch((void**)&auxD,&auxDPitch,w*sizeof(float),h));

  auxHPitch = w*sizeof(float);
  float *auxH = (float *) malloc(h*auxHPitch);

  gpuKernel<<<dimGrid,dimBlock>>>
		(data->BufGPU[0],auxD,
		memPitch,auxDPitch,w, h,
		Rng,sRng,
		Dir,sDir);

  cudaThreadSynchronize();

  checkCudaErrors(cudaMemcpy2D(auxH,auxHPitch,  // to CPU (host)
               auxD,auxDPitch,  // from GPU (device)
	       auxHPitch, h, // size of data (image)
               cudaMemcpyDeviceToHost));
  
  cudaThreadSynchronize();
[..]
}

The execution on the Quadro 3000M simply skips the kernel and outputs a blanck image in few ms.
The execution on the Tesla C1060 outputs a processed image in (say) 100 s. The weird thing is that in the last days also the execution on the c1060 appeared rather unpredictable (sometimes skipping the kernel, sometimes outputting weird numbers as -10^12, …). I do not understand this behaviour.
Could it be a driver version problem?

Thank you in advance for helping.

ps. both machines on ubuntu 11.10

Quadro 3000M —> Cuda compilation tools, release 4.1, V0.2.1221
Tesla C1060 —> Cuda compilation tools, release 4.1, V0.2.1221

pasoleatis · January 17, 2013, 3:17pm

The Quadro 3000M is compute capability 2.1 while C1060 is compute capability 1.3. Which flags are you using for compiling and are there any functions supported by Quadro 3000M but not by C1060? Also check the launching and the shared memory usage. The Quadro 3000 M has 1024 threads per block while C1060 has 512.
Late edit: If I compile for arch 2.1, but run it on arch 1.3 I got a bunch of Nan’s. When I left the threads per block 1024 and run it on a .3 device the kernels were not started.

Lev · January 17, 2013, 3:33pm

Why not to debug the program to see where error is?

fargalaxy · January 17, 2013, 3:39pm

Thank you for your reply.
I’m using a modified version of the SDK common.mk to compile in both the machines (################################################################################ - Pastebin.com), where the only thing I change are some paths to directory (sdk, etc).
I’ll check the number of threads and the smem and come back to you.

Lev · January 17, 2013, 6:24pm

You may have out of bound shared memory access which produce error on Fermi card, but no error on 1.3 card and so on.

pasoleatis · January 17, 2013, 8:00pm

From other forums I got it that the Fermi architecture is less permissive when there is an errror, like outof bounds or not enough resources and it does not run at all. It is possible that there is an error which makes the program to crash. As Lev suggested an out of bounds access could crash the program on the 2.x but not on 1.3 devices.

fargalaxy · January 18, 2013, 10:12am

sorry but yesterday I had to leave.
I run the code with cuda-memcheck and I get lots of errors for misaligned memory accesses

http://pastebin.com/VF4YeKvA

How could I solve this issue? I thought it was safe to use cudaMallocPitch(), the pitch is got dynamically, isn’t?
The following is the core of the kernel computation (line 19 in my first post)

for(int js=jsB; js<swy-jsE; js++)
      {
	for(int is=isB; is<swx-isE; is++)
	{
	  int ic = (int) fabsf(src[(j0+js)*memPitch+i0+is]-src[(j0+js+jsD)*memPitch+i0+is+isD]);
	  c[ic] += 1.0f; 
	}
      }

So the issue could just be in the ic computation, right?
do you have any suggestions on how to proceed and on how to make this code safe across GPU generations?
I did not write this code but I’m supposed to port it to a Tesla M2090 server, which has indeed a Fermi architecture too.

Thank you for your useful insights!

Lev · January 18, 2013, 11:00am

Misaligned memory accesses not necessary crush program, they only may harm performance. You most likely get those accesses on 1.3 too.
You need to debug your program and see where error is.

Greg · January 18, 2013, 9:29pm

Misaligned memory accesses will produce incorrect results on compute capability 1.* devices and will result in a fatal exception on compute capability 2.* and 3.* devices.

The kernel casts BYTE* to VEC2*. VEC2* addresses must be aligned to that natural alignment of VEC2 (probably 4 or 8). It is possible to specify the packing of VEC2 to 1 which will tell the compiler to generate only byte accesses but the performance will be awful.

There are numerous other questionable memory operations in the kernel.

If the code is accessing float then the address must be a multiple of sizeof(float). If the code is accessing float2 then the address must be a multiple of sizeof(float2).

The first step is to fix all of your misaligned accesses.

Lev · January 18, 2013, 9:49pm

“The execution on the Tesla C1060 outputs a processed image in (say) 100 s.” Sometimes. So this was luck with misaligned access, looks like I confused this with other type of misaligned access, warps etc. Though, variable rnd should probably be aligned. So, source of misalignment is not in the presented source code.

Topic		Replies	Views
Weird pointer arithmetic bug ? CUDA Programming and Performance	18	2373	July 12, 2011
cublasZgemm fails on FERMI but not on TESLA CUBLAS_STATUS_NOT_INITIALIZED even if 'cublasInit()& CUDA Programming and Performance	2	5906	February 17, 2011
Help removing misaligned access on C2070 Misaligned access on a Fermi system CUDA Programming and Performance	4	9968	February 22, 2011
Tesla C2050 performance comparision with C1060 CUDA Programming and Performance	63	10180	September 14, 2010
My code fails, is it a bug in CUDA? 50 lines of code, fails only in 64-bit CUDA 2.1 :( CUDA Programming and Performance	16	10764	February 3, 2009
Peak bandwidth to memory on Tesla M2050 CUDA Programming and Performance	10	1686	December 4, 2010
Got out of memory from cudaMemcpy CUDA Programming and Performance	13	3967	January 28, 2022
Problem with arch=sm_20 CUDA Programming and Performance	16	4231	March 4, 2011
Different performance from different GPUs with Identical Code CUDA Programming and Performance	18	4365	April 11, 2012
Cufft performance C1060 vs C2050 CUDA Programming and Performance	8	13792	November 11, 2010

same code different results on Quadro 3000M and Tesla C1060

Related topics