Kernel not doing anything


I’m learning how to program GPU with CUDA by myself, I read the CUDA By Example book and I was trying to start doing a program, the problem is that it seems that my program is doing nothing when I launch the kernel but dunno what am I missing. The code looks like the following:

__constant__ int dev_width;

__constant__ float dev_pivot;

__global__ void pivotReduction(float *matrix, float *identity)


	int tid;

	tid = threadIdx.x + blockIdx.x * blockDim.x;

	if(tid < dev_width)


		matrix[threadIdx.x] = 2;

		identity[threadIdx.x] = identity[threadIdx.x]/dev_pivot;



int main(){


	float matrixTest[9] = {1,2,5,2,3,7,6,7,1};

	pprMatrix matrix;

	matrix.x = 3;

	matrix.y = 3; = matrixTest;


	pprMatrix *identity;

	identity = createIdentity(3, 3);


	pprInverse(&matrix, identity);





void pprInverse(pprMatrix *matrix, pprMatrix *identity)


	int m;

	int n;

	int	width;

	int indexI;

	//int indexJ;

	//int indexK;

	float pivot;


	m = matrix->x;

	n = matrix->y;

	width = m*n;

	float *dev_matrix, *dev_identity;

	cudaMalloc((void**)&dev_matrix, sizeof(float));

	cudaMalloc((void**)&dev_identity, sizeof(float));

	cudaMemcpyToSymbol(*(&dev_width), &(width), sizeof(int), 0, cudaMemcpyHostToDevice);

	cudaMemcpy(dev_matrix, matrix->data, sizeof(float), cudaMemcpyHostToDevice );

	cudaMemcpy(dev_identity, identity->data, sizeof(float), cudaMemcpyHostToDevice);

	for (indexI = 0; indexI < m; indexI++)


		pivot = ((float*)matrix->data)[indexI*matrix->y+indexI];

		cudaMemcpyToSymbol(*(&dev_pivot), &pivot, sizeof(float), 0, cudaMemcpyHostToDevice);

		pivotReduction<<<(16+width)/16,16>>>(dev_matrix, dev_identity);


	cudaMemcpy(dev_matrix, matrix->data, sizeof(float), cudaMemcpyDeviceToHost);	

	cudaMemcpy(dev_identity, identity->data, sizeof(float), cudaMemcpyDeviceToHost);





What are my errors in this little code. Thanks

First and foremost, the biggest mistake is that you don’t check return codes.

Having said that, the next mistake I see is that you allocate and copy just one float for [font=“Courier New”]dev_matrix[/font] and [font=“Courier New”]dev_identity[/font], but then pass them to a kernel that tries to operate on nine elements each. The resulting out-of-bounds access will then abort your kernel, which you would have noticed if you had checked return codes.

Hi thanks for the help. As I said I’m learning CUDA but yes you are right I should always check the error codes. Is there some routine in the CUDA API to check for the errors or should I do it manually?

As you are learning you should even more check for errors…

Usually you define a macro like

#define CUDA_CALL(x) {cudaError_t cuda_error__ = (x); if (cuda_error__) printf("CUDA error: " #x " returned \"%s\"\n", cudaGetErrorString(cuda_error__));}

and wrap all your Cuda function calls in it.

Ok thanks. Well now it is doing something. Another question, is it good the way I use constant memory? I mean I’m working with vectors which size is always the same, so I load the value of that size in the constant memory of the GPU instead of the global memory is that a good practice?

Using constant memory instead of global memory definitely is a good idea as it is cached even on 1.x devices. Another option is to just pass the size as a kernel parameter. I don’t think there is a big difference performance wise, so I usually use kernel parameters as it is more convenient (saves me the cudaMemcptToSymbol() call). If the parameter list gets too long though I put them in constant memory just as you do.

To pass the size as a kernel parameter should I create another variable and allocate memory with cudaMalloc for use it as parameter? For example:

int main(){

   int size = 20;

   int *dev_size;



   cudaMalloc((void**)&dev_size), ... );

   cudaMemcpy(dev_size, size, ..., CopyFromHostToDevice);





Which one of the two kernel launches is the right one???

First one (assuming the kernel takes an int as argument). Note that already the [font=“Courier New”]cudaMemcpy()[/font] is wrong (should be [font=“Courier New”]cudaMemcpy(dev_size, &size, sizeof (int), cudaMemcpyHostToDevice);[/font]).

Ok so it is not necessary to use cudaMalloc for the parameter of the kernel, I can pass the host-variable directly to the kernel. But in this case why can I pass directly the value of the host-variable as parameter? Can we pass variables allocated in host memory?

And yes I know the cudaMemcpy was wrong just wrote it fast xD.