Kernel not doing anything

BRabbit27 · January 30, 2011, 2:59pm

Hi,

I’m learning how to program GPU with CUDA by myself, I read the CUDA By Example book and I was trying to start doing a program, the problem is that it seems that my program is doing nothing when I launch the kernel but dunno what am I missing. The code looks like the following:

__constant__ int dev_width;

__constant__ float dev_pivot;

__global__ void pivotReduction(float *matrix, float *identity)

{

	int tid;

	tid = threadIdx.x + blockIdx.x * blockDim.x;

	if(tid < dev_width)

	{

		matrix[threadIdx.x] = 2;

		identity[threadIdx.x] = identity[threadIdx.x]/dev_pivot;

	}

}

int main(){

	

	float matrixTest[9] = {1,2,5,2,3,7,6,7,1};

	pprMatrix matrix;

	matrix.x = 3;

	matrix.y = 3;

	matrix.data = matrixTest;

	

	pprMatrix *identity;

	identity = createIdentity(3, 3);

	

	pprInverse(&matrix, identity);

	

	printMatrix(&matrix);

	printMatrix(identity);

}

void pprInverse(pprMatrix *matrix, pprMatrix *identity)

{

	int m;

	int n;

	int	width;

	int indexI;

	//int indexJ;

	//int indexK;

	float pivot;

	

	m = matrix->x;

	n = matrix->y;

	width = m*n;

	float *dev_matrix, *dev_identity;

	cudaMalloc((void**)&dev_matrix, sizeof(float));

	cudaMalloc((void**)&dev_identity, sizeof(float));

	cudaMemcpyToSymbol(*(&dev_width), &(width), sizeof(int), 0, cudaMemcpyHostToDevice);

	cudaMemcpy(dev_matrix, matrix->data, sizeof(float), cudaMemcpyHostToDevice );

	cudaMemcpy(dev_identity, identity->data, sizeof(float), cudaMemcpyHostToDevice);

	for (indexI = 0; indexI < m; indexI++)

	{

		pivot = ((float*)matrix->data)[indexI*matrix->y+indexI];

		cudaMemcpyToSymbol(*(&dev_pivot), &pivot, sizeof(float), 0, cudaMemcpyHostToDevice);

		pivotReduction<<<(16+width)/16,16>>>(dev_matrix, dev_identity);

	}

	cudaMemcpy(dev_matrix, matrix->data, sizeof(float), cudaMemcpyDeviceToHost);	

	cudaMemcpy(dev_identity, identity->data, sizeof(float), cudaMemcpyDeviceToHost);

	

	cudaFree(dev_matrix);

	cudaFree(dev_identity);

}

What are my errors in this little code. Thanks

tera · January 30, 2011, 8:25pm

First and foremost, the biggest mistake is that you don’t check return codes.

Having said that, the next mistake I see is that you allocate and copy just one float for [font=“Courier New”]dev_matrix[/font] and [font=“Courier New”]dev_identity[/font], but then pass them to a kernel that tries to operate on nine elements each. The resulting out-of-bounds access will then abort your kernel, which you would have noticed if you had checked return codes.

BRabbit27 · January 30, 2011, 8:34pm

Hi thanks for the help. As I said I’m learning CUDA but yes you are right I should always check the error codes. Is there some routine in the CUDA API to check for the errors or should I do it manually?

tera · January 30, 2011, 8:49pm

As you are learning you should even more check for errors…

Usually you define a macro like

#define CUDA_CALL(x) {cudaError_t cuda_error__ = (x); if (cuda_error__) printf("CUDA error: " #x " returned \"%s\"\n", cudaGetErrorString(cuda_error__));}

and wrap all your Cuda function calls in it.

BRabbit27 · January 30, 2011, 11:25pm

Ok thanks. Well now it is doing something. Another question, is it good the way I use constant memory? I mean I’m working with vectors which size is always the same, so I load the value of that size in the constant memory of the GPU instead of the global memory is that a good practice?

tera · January 31, 2011, 1:41am

Using constant memory instead of global memory definitely is a good idea as it is cached even on 1.x devices. Another option is to just pass the size as a kernel parameter. I don’t think there is a big difference performance wise, so I usually use kernel parameters as it is more convenient (saves me the cudaMemcptToSymbol() call). If the parameter list gets too long though I put them in constant memory just as you do.

BRabbit27 · January 31, 2011, 12:24pm

To pass the size as a kernel parameter should I create another variable and allocate memory with cudaMalloc for use it as parameter? For example:

int main(){

   int size = 20;

   int *dev_size;

   ...

   ...

   cudaMalloc((void**)&dev_size), ... );

   cudaMemcpy(dev_size, size, ..., CopyFromHostToDevice);

   kernel<<<M,N>>>(size)

   //or

   kernel<<<M,N>>>(dev_size)

}

Which one of the two kernel launches is the right one???

tera · January 31, 2011, 1:07pm

First one (assuming the kernel takes an int as argument). Note that already the [font=“Courier New”]cudaMemcpy()[/font] is wrong (should be [font=“Courier New”]cudaMemcpy(dev_size, &size, sizeof (int), cudaMemcpyHostToDevice);[/font]).

BRabbit27 · January 31, 2011, 1:09pm

Ok so it is not necessary to use cudaMalloc for the parameter of the kernel, I can pass the host-variable directly to the kernel. But in this case why can I pass directly the value of the host-variable as parameter? Can we pass variables allocated in host memory?

And yes I know the cudaMemcpy was wrong just wrote it fast xD.

Topic		Replies	Views
First CUDA trial but unexpected behavior CUDA Programming and Performance	4	1003	March 29, 2012
CUDA kernels giving bad results CUDA Programming and Performance	7	12507	February 17, 2011
Why is the Kernel faster when my matrices are not initialized CUDA Programming and Performance	2	737	December 18, 2017
kernel problem CUDA Programming and Performance	6	2756	August 15, 2008
New to CUDA, simple kernel give output of zero. CUDA Programming and Performance	3	3579	April 4, 2010
CUDA Noob here. Kernel does not act correctly CUDA Programming and Performance cuda , kernel	5	369	June 7, 2022
Crash in kernel function CUDA Programming and Performance	5	729	August 1, 2011
GPU Transfer problems GPU won't correctly read data out from Device to Host CUDA Programming and Performance	15	2631	August 2, 2010
Embarassingly beginner question CUDA Programming and Performance	8	3289	May 22, 2009
Cuda code performance CUDA Programming and Performance	14	3066	December 16, 2014

Kernel not doing anything

Related topics