Freeze when running some kernels on 8800gtx

Hi all,

i’ve been developing an NMF implementation with CUDA on a 280gtx. Everything was all right, but the card is now broken, and now im running the code on a 8800gtx. I know that atomic operations doesn’t work on G80 cards, but I have another problem. I upload 2 matrices to the card memory, and I operate them (I split them into 4 pieces, multiply them with cublas) and when it reaches a kernel that makes a point by point mult or div, it freezes.

Here is the code of the kernel:

#include <stdio.h>

#include <stdlib.h>

#define BLOCK_SIZE 16

__global__ void DivPunto(float* A, float* B, int alto, int ancho, float* H)


	int bx = blockIdx.x;

	int by = blockIdx.y;

	int tx = threadIdx.x;

	int ty = threadIdx.y;

	int posicion;

	__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

	__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

	posicion = (by*BLOCK_SIZE + ty)*ancho + bx*BLOCK_SIZE + tx;

	if((bx*BLOCK_SIZE + tx)<ancho && (by*BLOCK_SIZE + ty)<alto)


		As[ty][tx] = A[posicion];

		Bs[ty][tx] = B[posicion];


		if(Bs[ty][tx] == 0)

			Bs[ty][tx] = 0.00001;

		H[posicion] = As[ty][tx]/Bs[ty][tx];



I have bidimensional blocks of 16x16, and the info of used registers and shared memory is this:

ptxas info : Used 6 registers, 2096+1072 bytes smem, 20 bytes cmem[1]

This code was running flawlessly on the 280gtx card… and I don’t know what happens now.

Thanks in advance

(and sorry for my english)

You need to avoid using __syncthreads() inside conditional paths.

Also, there is no need to use shared memory for this simple kernel. You are touching the element of A and B only once.

Thank you!
yes, the shared memory in that kernel was unnecesary. And the problem was the __syncthreads(). Now it runs again :)

From my experience, running in emulation mode generally finds these errors. Did you try this first danieloop?