Take Garbage Value wrong output how to use shared memory in a program

Dbajaj · December 22, 2009, 9:19am

Hi all,

I read a program to find the Multiplication of SquareMatrix Using shared Memory.But It gives the wrong result(outputsnapshot)and my expected result is other(shapshot expect).

Tell me?

How Can I get a exacted output?

Thanks in Advance

#include<stdio.h>

#include<cuda.h>

__global__ void Shar(float *a,float *b,float *c,int n)

{

	__shared__ float aTile[4][4],bTile[4][4];

	int row=blockIdx.y*blockDim.y+threadIdx.y;

	int col=blockIdx.x*blockDim.x+threadIdx.x;

	float sum=0.0;

	aTile[threadIdx.y][threadIdx.x]=a[row*n+threadIdx.x];

	bTile[threadIdx.y][threadIdx.x]=b[threadIdx.y*n+col];

	__syncthreads();

	for(int i=0;i<n;i++)

	{

		sum+=aTile[threadIdx.y][i]*bTile[i][threadIdx.x];

	}

	c[row*n+col]=sum;

}

int main()

{

	float *a_h,*b_h,*c_h,*a_d,*b_d,*c_d;

	int i,n;

	n=4;

	size_t size=sizeof(float)*(n*n);

	a_h=(float*)malloc(size);

	b_h=(float*)malloc(size);

	c_h=(float*)malloc(size);

	cudaMalloc((void**)&a_d,size);

	cudaMalloc((void**)&b_d,size);

	cudaMalloc((void**)&c_d,size);

	

	for(i=0;i<(n*n);i++)

	{

		a_h[i]=1;

	}

	cudaMemcpy(a_d,a_h,size,cudaMemcpyHostToDevice);

	 

		for(i=0;i<(n*n);i++)

		{

			 b_h[i]=1;

		}

		cudaMemcpy(b_d,b_h,size,cudaMemcpyHostToDevice);

	int blocksize=4;

	int nblock=n/blocksize+(n%blocksize==0?0:1);

	int TILE_DIM=4;

	Shar<<<nblock,blocksize,TILE_DIM>>>(a_d,b_d,c_d,n);

	cudaMemcpy(c_h,c_d,size,cudaMemcpyDeviceToHost);

	printf("\nMultiplication Of Matrix");

	for(i=0;i<(n*n);i++)

	{

		printf("\n%f",c_h[i]);

	}

	free(a_h);

	free(b_h);

	free(c_h);

	cudaFree(a_d);

	cudaFree(b_d);

	cudaFree(c_d);

	return 0;

}

mikola · December 23, 2009, 3:39pm

You forgot to add:

cudaThreadSynchronize();

after kernels execution. Therefore you might copy arrays before kernels stop operation on them.

Additionally If someone use shared memory, extra care should be taken to prevent collisions. I have many time received unstable results because one of the threads was reading from shared memory befor other have put something there.

seibert · December 23, 2009, 6:10pm

No, CUDA implicitly synchronizes between operations on the same stream. (If you don’t specify a stream, you are on stream 0.) You do not need cudaThreadSynchronize() for anything but benchmarking kernel runtime (and perhaps one obscure use of zero-copy memory).

You should check the return codes on your CUDA calls to see if your kernel is being run at all.

Topic		Replies	Views
error in the result of using shared memory CUDA Programming and Performance	2	614	May 29, 2015
Garbage Value Matrix multiplication using shared memory CUDA Programming and Performance	0	4645	September 25, 2009
Problem with shared memory CUDA Programming and Performance	6	998	October 23, 2015
Shared memory error CUDA Programming and Performance	1	949	June 24, 2012
Unexpected behaviour of matrix multiply demo CUDA Programming and Performance	7	6114	November 11, 2010
Some help needed with shared memory and program correctness matrix * vector operation CUDA Programming and Performance	1	1167	November 30, 2008
multiplication of matrix using shared memory problem of multiplication CUDA Programming and Performance	2	3990	September 30, 2010
Getting wrong output from CUDA kernel CUDA Programming and Performance	6	8382	April 15, 2011
nVidia CUDA Programming Guide and shared memory CUDA Programming and Performance	0	1490	January 12, 2010
using shared memory CUDA Programming and Performance	6	3015	September 17, 2009

Take Garbage Value wrong output how to use shared memory in a program

Related topics