efficiency problem with shared memory

Romeo_CCCA · September 22, 2010, 8:35am

Hi,

i have a problem when using shared memory.

#define SIZE 46

#define LOOP 100

__global__ void test0()

{

		__shared__ float  pro;

		float sum = 0.0f;

		int l=0, v=0;

#ifdef DEBUG1

		for(l=0;l<SIZE;l++)

				pro[l] = 0.0f;

#endif

		for(v=1;v<LOOP;v++)

		{

#ifdef DEBUG2

				for(l=0;l<SIZE;l++)

						pro[l] = 0.0f;

#endif

				for(l=0;l<SIZE;l++)

						if(sum >= RAND_MAX)

								sum += pro[l];

		}

#ifdef DEBUG3

		for(l=0;l<SIZE;l++)

				pro[l] = 0.0f;

#endif

}

__global__ void test1()

{

	__shared__ float pro;

	__shared__ float sum;

	int l=0, v=0;

	sum = 0.0f; 

#ifdef DEBUG1

		for(l=0;l<SIZE;l++)

				pro[l] = 0.0f;

#endif

	for(v=1;v<LOOP;v++)

	{

#ifdef DEBUG2

		for(l=0;l<SIZE;l++)

			pro[l] = 0.0f;

#endif

		for(l=0;l<SIZE;l++)

			if(sum >= RAND_MAX)

				sum += pro[l];

	}

#ifdef DEBUG3

		for(l=0;l<SIZE;l++)

				pro[l] = 0.0f;

#endif

}

__global__ void test2()

{

	__shared__ float sum;

	__shared__ float  pro;

	int l=0, v=0;

	sum = 0.0f; 

#ifdef DEBUG1

		for(l=0;l<SIZE-1;l++)

				pro[l] = 0.0f;

#endif

	for(v=1;v<LOOP;v++)

	{

#ifdef DEBUG2

		for(l=0;l<SIZE-1;l++)

			pro[l] = 0.0f;

#endif

		for(l=0;l<SIZE-1;l++)

			if(sum >= RAND_MAX)

				sum += pro[l];

	}

#ifdef DEBUG3

		for(l=0;l<SIZE-1;l++)

				pro[l] = 0.0f;

#endif

}

The difference between these 3 functions is :

[*]test0 : sum is a local variable

[*]test1 : sum is in shared memory

[*]test2 : sum is in a shared memory

[*]test0 : pro is a array in shared memory with 46 elements

[*]test1 : pro is a array in shared memory with 46 elements

[*]test2 : pro is a array in shared memory with 45 elements

I have experimented these code on a Tesla S1070 (CUDA 3.0) and a Tesla C2050 (CUDA 3.1). I obtained for 10000 calls of each kernel (one block with one thread only) the next time results :

External Media

If i remove the “if(sum >= RAND_MAX)” test, i obtained the next time results :

External Media

I don’t understand these huge time differences.

Thanks.

ps : line compilation is “nvcc -O3 -Xptxas -v -o pb_mp pb_mp.cu -lcuda -lm -lpthread -lrt -lcublas -lshrutil_x86_64 -lcutil_x86_64”

Topic		Replies	Views
problem with shared mamery CUDA Programming and Performance	4	3235	May 11, 2009
why is shared memory example not faster CUDA Programming and Performance	7	1400	May 16, 2012
operation on shared array CUDA Programming and Performance	1	863	March 31, 2012
why is shared memory example not faster CUDA Programming and Performance	1	1142	April 23, 2012
Shared memory problem CUDA Programming and Performance	10	4100	April 20, 2010
time spent for operations in cuda CUDA Programming and Performance	2	1666	August 11, 2009
__shared__ memory confused me. __shared__ memory CUDA Programming and Performance	7	4079	August 1, 2009
Shared Memory Vs Device Memory Device memory gives better result :fear: CUDA Programming and Performance	3	2790	April 16, 2007
shared memory in cuda fortran and increasing time of process CUDA Programming and Performance	0	586	September 2, 2016
shared memory problem CUDA Programming and Performance	2	1224	April 21, 2010

efficiency problem with shared memory

Related topics