Getting wrong output from CUDA kernel

Madeena · April 12, 2011, 8:18am

Dear All,

I have a written a kernel which is not working properly, I could not figured out why. Please help me. The following is my code and code summary.

Code Summary:

Input: Mx256 matrix and 256x1 matrix

Output1: Mx256 matrix containing the elementwise minimum values of 1x256 ans 256x1 matrices for each M

Output2: a Mx256 matrix containing the elementwise square of the differences of each 1x256 and 256x1 matrices for each row M.

Example:

Example:

A=rand(50,256);

B=rand(1,256);

[H D]=HMG1(A,B)

#include<stdio.h>

#include<cuda.h>

#include "cuda_runtime.h"

#include<math.h>

#include"mex.h"

#include"matrix.h"

__global__ void hMatchG(float *xr, float *qr, float *h, float *d,int N1){

	

	int i = blockIdx.x * blockDim.x + threadIdx.x;

  	int j = threadIdx.x;

	__shared__ float a;

	__shared__ float b;

	__shared__ float p;

	

	a=xr[i];

	b=qr[j];

	__syncthreads(); 

	if (i<N1 && j<256 ) {

		//h[i]=(a<b?a:b);

		h[i]=fminf(a,b);

		p=abs(a-b);

		d[i]=p*p;

	}

	__syncthreads(); 

}

void mexFunction(int nlhs,mxArray *plhs[],int nrhs, const mxArray *prhs[])

{

	

	float *hr,*dr,*xr,*qr,*id1,*id2,*od1,*od2,isize1,isize2,osize;

	int mi1,ni1,N1,mi2,ni2,N2;

	

	mi1=mxGetM(prhs[0]);

        ni1=mxGetN(prhs[0]);

	N1=mi1*ni1;

	

	plhs[0] = mxCreateDoubleMatrix(mi1, ni1, mxREAL);

	plhs[1] = mxCreateDoubleMatrix(mi1, ni1, mxREAL);

	if(plhs[0]==NULL && plhs[1]==NULL) mexErrMsgTxt("Could not create mxArray");

	

	

	mi2=mxGetM(prhs[1]);

        ni2=mxGetN(prhs[1]);

	N2=mi2*ni2;

	

	id1=(float *)mxGetPr(prhs[0]);

	id2=(float *)mxGetPr(prhs[1]);

  	od1=(float *)mxGetPr(plhs[0]);

	od2=(float *)mxGetPr(plhs[1]);

  	

	isize1=N1*sizeof(float);

  	isize2=N2*sizeof(float);

	osize=N1*sizeof(float);

	cudaMalloc((void **) &xr, isize1);   

	cudaMemcpy(xr, id1, isize1, cudaMemcpyHostToDevice);

	cudaMalloc((void **) &qr, isize2);   

	cudaMemcpy(qr, id2, isize2, cudaMemcpyHostToDevice);

  	cudaMalloc((void **) &hr, osize);

	cudaMalloc((void **) &dr, osize);

	int block_size=256;

	

	int n_blocks=N1/block_size;

	hMatchG<<<n_blocks,block_size>>>(xr,qr,hr, dr, N1);

       // cudaThreadSynchronize();

	

	

	cudaMemcpy(od1,hr, osize, cudaMemcpyDeviceToHost); 

	cudaMemcpy(od2,dr, osize, cudaMemcpyDeviceToHost); 

	for(int i=0;i<N1;i++)

  	mexPrintf("%f %f \n",od1[i],od2[i]) ; 

	 

	cudaFree(xr);

  	cudaFree(qr);

	cudaFree(hr);

	cudaFree(dr);

	

	

}

Please help me.

tera · April 12, 2011, 10:52am

I don’t understand what your use of shared memory is supposed to achieve. If you have multiple threads assign a value to the same variable in shared memory, the result is undefined, regardless whether the assignment is followed by __syncthreads().

Furthermore, I don’t understand your usage of [font=“Courier New”]j[/font], which will always be zero as your blocksize is one-dimensional. If it were to assume different values, you would also again have a race condition with undefined results.

I guess that what you want to do is

__global__ void hMatchG(float *xr, float *qr, float *h, float *d,int N1)

{        

        int i = blockIdx.x * blockDim.x + threadIdx.x;

if (i<N1) {

                float a=xr[i];

                for (int j=0; j<256; j++) {

                        float b=qr[j];

                        //h[i]=(a<b?a:b);

                        h[i]=fminf(a,b);

                        float p=a-b;

                        d[i]=p*p;

                }

        }

}

Madeena · April 12, 2011, 2:25pm

I don’t understand what your use of shared memory is supposed to achieve. If you have multiple threads assign a value to the same variable in shared memory, the result is undefined, regardless whether the assignment is followed by __syncthreads().

Furthermore, I don’t understand your usage of [font=“Courier New”]j[/font], which will always be zero as your blocksize is one-dimensional. If it were to assume different values, you would also again have a race condition with undefined results.

I guess that what you want to do is
__global__ void hMatchG(float *xr, float *qr, float *h, float *d,int N1)

{        

        int i = blockIdx.x * blockDim.x + threadIdx.x;

if (i<N1) {

                float a=xr[i];

                for (int j=0; j<256; j++) {

                        float b=qr[j];

                        //h[i]=(a<b?a:b);

                        h[i]=fminf(a,b);

                        float p=a-b;

                        d[i]=p*p;

                }

        }

}

Thank you very much for your reply.

I used shared memory because I thought I will speed up the variable access time as I heard that local variables are also stored in global memory and shared memory access is much faster than that of global memory. May be I misunderstood. But I used local variable and took more time.

the value of j=threadIdx.x, Sorry it was my mistake.

Actually I wanted to get rid of this for loop and yes I am facing race condition. As I was just fetching the stored data, and there is no alteration of data I thought __syncthread will resolve the problem of race condition. Seems that I have a lot of misconceptions.

So, I really didnt understand the purpose of __syncthread, wont it help me to resolve race condition while the task is only to read from memory not altering.

Another question is, Is there any way to skip this for loop, it’s really killing my time, and if it takes time more than a certain seconds this code will have no use and so my actual research.

Thanks again.

seibert · April 12, 2011, 5:29pm

Local array variables (like int x[10]) must be put into global memory, but local variables with simple types (like float) go into registers generally.

By making a and b shared, you only have one memory location shared between all threads in the block. As a result, all the threads will write to the same location when executing:

a=xr[i];

And the final value of a will be undefined.

Madeena · April 13, 2011, 2:09am

Local array variables (like int x[10]) must be put into global memory, but local variables with simple types (like float) go into registers generally.

By making a and b shared, you only have one memory location shared between all threads in the block. As a result, all the threads will write to the same location when executing:
a=xr[i];
And the final value of a will be undefined.

Seibert, thanks a lot for your kind reply. I thought shared memory keeps a copy of the variable for each thread, so, actually it keeps only one copy for all threads in a block, thank you very much for clarifying this :-)

Would you please tell me how can I keep a running sum of the threads result using shared memory, I read it in some blogs, but it is not very clear to me. But, it is required for my code.

I have another question, if two or more different blocks read [only read no alteration] from the same global memory location simultaneously, will it do any harm to my code? will the result be undefined?

Thanks in advance.

seibert · April 13, 2011, 1:29pm

If you want to sum up thread results in a block, the best way is to do a reduction in shared memory. I would suggest you take a look at this presentation:

http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf

which both explains the technique, and also shows how to optimize it. (Don’t worry about getting the absolutely best solution.)

There is probably a more detailed explanation in the book CUDA by Example. (I haven’t read it myself.) I see that in their example code for chapter 5, they do a parallel reduction within a block to perform a dot product. You can download the source code for the book’s examples here:

http://developer.nvidia.com/cuda-example-introduction-general-purpose-gpu-programming

and look for the file dot.cu inside of the chapter05/ folder.

Simultaneous reads to global memory are fine. All threads will get the same value. However, if you find that different threads are frequently reading the same memory location, then you might want to consider putting that data into shared memory. (This is probably the most common use of shared memory.) If you have a compute capability 2.x device, this is less important because the L2 cache will speed up reads to the same memory location.

Madeena · April 15, 2011, 4:27am

If you want to sum up thread results in a block, the best way is to do a reduction in shared memory. I would suggest you take a look at this presentation:

CUDA Toolkit Documentation 12.3 Update 1

which both explains the technique, and also shows how to optimize it. (Don’t worry about getting the absolutely best solution.)

There is probably a more detailed explanation in the book CUDA by Example. (I haven’t read it myself.) I see that in their example code for chapter 5, they do a parallel reduction within a block to perform a dot product. You can download the source code for the book’s examples here:

http://developer.nvidia.com/cuda-example-introduction-general-purpose-gpu-programming

and look for the file dot.cu inside of the chapter05/ folder.

Simultaneous reads to global memory are fine. All threads will get the same value. However, if you find that different threads are frequently reading the same memory location, then you might want to consider putting that data into shared memory. (This is probably the most common use of shared memory.) If you have a compute capability 2.x device, this is less important because the L2 cache will speed up reads to the same memory location.

Again thanks a lot Seibert for your help. I was really in a confusing state with the simultaneous reads to global memory and thought this as the reason of getting wrong output from my code. Now, I am sure that the reason is not the simultaneous read there must something wrong in other thing and I am going to check it it again.

Thanks dude :-)

Topic		Replies	Views
CUDA Memory Consistency CUDA Programming and Performance	23	55509	March 8, 2007
How is the compiler optimizing the thread launch? CUDA Programming and Performance	12	315	October 26, 2022
why result varied based on different number of threads per block? CUDA Programming and Performance	8	1938	March 1, 2011
Annoying problems with memory and/or syntax CUDA Programming and Performance	19	4769	April 8, 2008
Please help with __shared__ memory different usage than in samples CUDA Programming and Performance	30	3309	January 10, 2010
Reduction questions(newbie-ish) CUDA Programming and Performance	7	1793	January 14, 2009
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16298	January 30, 2011
Optimization suggestions for reading from main memory to registers and share memory CUDA Programming and Performance	10	169	May 22, 2024
Fastest way for each thread wrks with 624 integers CUDA Programming and Performance	12	8844	April 19, 2008
Using Shared Memory in CUDA C/C++ Technical Blog	36	1972	October 8, 2020

Getting wrong output from CUDA kernel

Related topics