Getting wrong output from CUDA kernel

Dear All,

I have a written a kernel which is not working properly, I could not figured out why. Please help me. The following is my code and code summary.

Code Summary:

Input: Mx256 matrix and 256x1 matrix

Output1: Mx256 matrix containing the elementwise minimum values of 1x256 ans 256x1 matrices for each M

Output2: a Mx256 matrix containing the elementwise square of the differences of each 1x256 and 256x1 matrices for each row M.

Example:

Example:

A=rand(50,256);

B=rand(1,256);

[H D]=HMG1(A,B)

#include<stdio.h>

#include<cuda.h>

#include "cuda_runtime.h"

#include<math.h>

#include"mex.h"

#include"matrix.h"

__global__ void hMatchG(float *xr, float *qr, float *h, float *d,int N1){

	

	int i = blockIdx.x * blockDim.x + threadIdx.x;

  	int j = threadIdx.x;

	__shared__ float a;

	__shared__ float b;

	__shared__ float p;

	

	a=xr[i];

	b=qr[j];

	__syncthreads(); 

	if (i<N1 && j<256 ) {

		//h[i]=(a<b?a:b);

		h[i]=fminf(a,b);

		p=abs(a-b);

		d[i]=p*p;

	}

	__syncthreads(); 

}

void mexFunction(int nlhs,mxArray *plhs[],int nrhs, const mxArray *prhs[])

{

	

	float *hr,*dr,*xr,*qr,*id1,*id2,*od1,*od2,isize1,isize2,osize;

	int mi1,ni1,N1,mi2,ni2,N2;

	

	mi1=mxGetM(prhs[0]);

        ni1=mxGetN(prhs[0]);

	N1=mi1*ni1;

	

	plhs[0] = mxCreateDoubleMatrix(mi1, ni1, mxREAL);

	plhs[1] = mxCreateDoubleMatrix(mi1, ni1, mxREAL);

	if(plhs[0]==NULL && plhs[1]==NULL) mexErrMsgTxt("Could not create mxArray");

	

	

	mi2=mxGetM(prhs[1]);

        ni2=mxGetN(prhs[1]);

	N2=mi2*ni2;

	

	id1=(float *)mxGetPr(prhs[0]);

	id2=(float *)mxGetPr(prhs[1]);

  	od1=(float *)mxGetPr(plhs[0]);

	od2=(float *)mxGetPr(plhs[1]);

  	

	isize1=N1*sizeof(float);

  	isize2=N2*sizeof(float);

	osize=N1*sizeof(float);

	cudaMalloc((void **) &xr, isize1);   

	cudaMemcpy(xr, id1, isize1, cudaMemcpyHostToDevice);

	cudaMalloc((void **) &qr, isize2);   

	cudaMemcpy(qr, id2, isize2, cudaMemcpyHostToDevice);

  	cudaMalloc((void **) &hr, osize);

	cudaMalloc((void **) &dr, osize);

	int block_size=256;

	

	int n_blocks=N1/block_size;

	hMatchG<<<n_blocks,block_size>>>(xr,qr,hr, dr, N1);

       // cudaThreadSynchronize();

	

	

	cudaMemcpy(od1,hr, osize, cudaMemcpyDeviceToHost); 

	cudaMemcpy(od2,dr, osize, cudaMemcpyDeviceToHost); 

	for(int i=0;i<N1;i++)

  	mexPrintf("%f %f \n",od1[i],od2[i]) ; 

	 

	cudaFree(xr);

  	cudaFree(qr);

	cudaFree(hr);

	cudaFree(dr);

	

	

}

Please help me.

I don’t understand what your use of shared memory is supposed to achieve. If you have multiple threads assign a value to the same variable in shared memory, the result is undefined, regardless whether the assignment is followed by __syncthreads().

Furthermore, I don’t understand your usage of [font=“Courier New”]j[/font], which will always be zero as your blocksize is one-dimensional. If it were to assume different values, you would also again have a race condition with undefined results.

I guess that what you want to do is

__global__ void hMatchG(float *xr, float *qr, float *h, float *d,int N1)

{        

        int i = blockIdx.x * blockDim.x + threadIdx.x;

if (i<N1) {

                float a=xr[i];

                for (int j=0; j<256; j++) {

                        float b=qr[j];

                        //h[i]=(a<b?a:b);

                        h[i]=fminf(a,b);

                        float p=a-b;

                        d[i]=p*p;

                }

        }

}

Thank you very much for your reply.

I used shared memory because I thought I will speed up the variable access time as I heard that local variables are also stored in global memory and shared memory access is much faster than that of global memory. May be I misunderstood. But I used local variable and took more time.

the value of j=threadIdx.x, Sorry it was my mistake.

Actually I wanted to get rid of this for loop and yes I am facing race condition. As I was just fetching the stored data, and there is no alteration of data I thought __syncthread will resolve the problem of race condition. Seems that I have a lot of misconceptions.

So, I really didnt understand the purpose of __syncthread, wont it help me to resolve race condition while the task is only to read from memory not altering.

Another question is, Is there any way to skip this for loop, it’s really killing my time, and if it takes time more than a certain seconds this code will have no use and so my actual research.

Thanks again.

Local array variables (like int x[10]) must be put into global memory, but local variables with simple types (like float) go into registers generally.

By making a and b shared, you only have one memory location shared between all threads in the block. As a result, all the threads will write to the same location when executing:

a=xr[i];

And the final value of a will be undefined.

Seibert, thanks a lot for your kind reply. I thought shared memory keeps a copy of the variable for each thread, so, actually it keeps only one copy for all threads in a block, thank you very much for clarifying this :-)

Would you please tell me how can I keep a running sum of the threads result using shared memory, I read it in some blogs, but it is not very clear to me. But, it is required for my code.

I have another question, if two or more different blocks read [only read no alteration] from the same global memory location simultaneously, will it do any harm to my code? will the result be undefined?

Thanks in advance.

If you want to sum up thread results in a block, the best way is to do a reduction in shared memory. I would suggest you take a look at this presentation:

http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf

which both explains the technique, and also shows how to optimize it. (Don’t worry about getting the absolutely best solution.)

There is probably a more detailed explanation in the book CUDA by Example. (I haven’t read it myself.) I see that in their example code for chapter 5, they do a parallel reduction within a block to perform a dot product. You can download the source code for the book’s examples here:

http://developer.nvidia.com/cuda-example-introduction-general-purpose-gpu-programming

and look for the file dot.cu inside of the chapter05/ folder.

Simultaneous reads to global memory are fine. All threads will get the same value. However, if you find that different threads are frequently reading the same memory location, then you might want to consider putting that data into shared memory. (This is probably the most common use of shared memory.) If you have a compute capability 2.x device, this is less important because the L2 cache will speed up reads to the same memory location.

Again thanks a lot Seibert for your help. I was really in a confusing state with the simultaneous reads to global memory and thought this as the reason of getting wrong output from my code. Now, I am sure that the reason is not the simultaneous read there must something wrong in other thing and I am going to check it it again.

Thanks dude :-)