Checking Performance 2º round Trying to reproduce the results .....

Hello Threre.

I am using 8800GTX to prove the matrix multiplication application. I did my first version without using shared memory and i can not get back the matrix Multiplication result.
The amazing thing is that I’ve prove my program on Tesla C870 and it is running perfectly.

Please that kind of things are turning me crazy :(

Anyone of you know any bug related with that ?

This is my kernel

global void matrixMul (float* C, float* A, float* B, int wA, int wB)

int tx = threadIdx.x;  
int ty = threadIdx.y;

int bx = blockIdx.x;
int by = blockIdx.y;    

int indexA = by*BLOCK_SIZE*wA+ty*wA;
int indexB = bx*BLOCK_SIZE+tx;
int indexC= wB* BLOCK_SIZE * by + BLOCK_SIZE * bx;

float aux = 0.0;

for (int i = 0; i < 4096; i++){
	aux+= A[indexA] * B [indexB];
C[indexC+wB*ty+tx] = aux;	


And this is the call to the kernel

// copy host memory to device
CUDA_SAFE_CALL(cudaMemcpy(d_A, h_A, mem_size_A, cudaMemcpyHostToDevice) );
CUDA_SAFE_CALL(cudaMemcpy(d_B, h_B, mem_size_B, cudaMemcpyHostToDevice) );

unsigned int size_C = WC * HC; 

unsigned int mem_size_C = sizeof(float) * size_C;

float * h_C = (float *) malloc (mem_size_C);

float* d_C;
CUDA_SAFE_CALL(cudaMalloc((void**) &d_C, mem_size_C));   

// setup execution parameters
dim3 threads(BLOCK_SIZE, BLOCK_SIZE);
dim3 grid(WC / threads.x, HC / threads.y);

// create and start timer
unsigned int timer = 0;

// execute the kernel
matrixMul<<< grid, threads >>>(d_C, d_A, d_B, WA, WB);


// check if kernel execution generated and error
CUT_CHECK_ERROR("Kernel execution failed");

// copy result from device to host
CUT_SAFE_CALL(cudaMemcpy(h_C, d_C,mem_size_C, 
		    cudaMemcpyDeviceToHost) );


Are you compiling in debug mode so the CUDA and CUT SAFE_CALL macros actually check for errors? The most common cause of incorrect results in known working kernels is that the kernel never launches due to an error. Of course, I can’t come up with an situation where a kernel would launch on Tesla C870 and not 8800 GTX since they are the same chip! Unless you are running out of memory…

Thank you Mr Anderson for your replay.

Ok i hadn’t compiled in debug mode, so as you said me i got an error in the kernel execution. This is the error

Cuda error: Kernel execution failed in file ‘’ in line 107 : the launch timed out and was terminated.

And now my question is why in the Tesla i can execute the kernel perfectly and I can’t do it on the 8800 GTX.

I have read that you had something like this problem, don’t you? I have also read that is because the console time or something like this.

Can you explain me how to fix this issue?

Thank you so much

This is a FAQ, here is the short answer. See the sticky FAQ or any of 100 forum posts on this issue for the details.

  1. Use smaller matrices so the computation takes less than 5s

  2. If in linux: kill X and run your app from the console (or ssh)

  3. If in windows XP: run the absolute latest drivers (posted in the XP subforum) and use a 2nd card for display

  4. If in windows Vista: you are out of luck :(

Yeah Mr Anderson you right it was because the matrix size. Therefore my computation with smaller size takes less than five seconds.