Problems with using shared memory

mtrng · January 13, 2011, 1:41am

Hi,

I recently started using cuda and I have some troubles using shared memory. I hope someone here can help me figure out the problems within my code. The following code is trying to calculate the time averaged potential which is save in avg_potential, and the array named by potential is the values calculated in each iteration. The code passed the compile and linking, but didn’t run successfully.

// define number of threads and blocks
void avgCenterLinePotential( int arraysize, float *avgpotential,float *potential, int iteration)
{
int numThreads = 256; //arraysize>=256
int numBlocks = arraysize/numThreads; // assuming arraysize%numThreads = 0
averageCenterLinePotential <<< numBlocks, numThreads >>> ( arraysize,avgpotential, potential, iteration);
cutilCheckMsg(“avgCenterLinePotential kernel execution failed”);
}

// kernel
global
void averageCenterLinePotential( int arraysize, float *avg_potential, float *potential, int iteration)
{
int index = __umul24( blockIdx.x,blockDim.x ) + threadIdx.x;
if ( index >= arraysize ) return;
avg_potential[index] *= iteration;

int sharedarraysize = 256;
for ( int i = 0; i < intDivUp(arraysize, sharedarraysize); i++ ){
	__shared__ float sharedArray[256];	
	
	if ( i*sharedarraysize+threadIdx.x < arraysize )  {  
		sharedArray[threadIdx.x] = potential[index];				
	}		
	__syncthreads();
	
	if ( i*sharedarraysize+threadIdx.x < arraysize )  {
		avg_potential[index]+=sharedArray[threadIdx.x];
	}		
	__syncthreads();
}

avg_potential[index] /= iteration+1;

}

chrizh · January 13, 2011, 9:15am

Hi,

first of all, please use the code environment, then it’s better readable for the forum users.

I didn’t look at your code in that detail but what attracts my attention was that you allocate shared memory in a loop. I’m not sure if this works correctly. Normally, you allocate the needed shared memory once and then only use it.

So code should look like this:

// kernel

__global__

void averageCenterLinePotential( int arraysize, float *avg_potential, float *potential, int iteration)

{

        __shared__ float sharedArray[256];	

        int index = __umul24( blockIdx.x,blockDim.x ) + threadIdx.x;

	if ( index >= arraysize ) return;

	avg_potential[index] *= iteration;

	int sharedarraysize = 256;

	for ( int i = 0; i < intDivUp(arraysize, sharedarraysize); i++ ){

		//__shared__ float sharedArray[256];	

		

		if ( i*sharedarraysize+threadIdx.x < arraysize )  {  

			sharedArray[threadIdx.x] = potential[index];				

		}		

		__syncthreads();

		

		if ( i*sharedarraysize+threadIdx.x < arraysize )  {

			avg_potential[index]+=sharedArray[threadIdx.x];

		}		

		__syncthreads();

	}

	

	avg_potential[index] /= iteration+1;	

}

Regards

chrizh

tera · January 13, 2011, 10:24am

mtrng:
Thanks for reformating the code to something readable.

Chrizh:
Your code looks a bit weird in that each thread reads the same element of potential over and over again, and that the “average” is divided by [font=“Courier New”]iteration[/font], not by the actual number of values summed up. As going through shared memory is of no use in this example anyway, I assume the code to be a stripped down version of the actual problematic code.
What are you trying to achieve, and what do you expect this code to do?

mtrng · January 13, 2011, 1:17pm

Thank you, chrizh. This is my first post here. Thank you for reforming the codes.

To Tera:

Thank you for your time reading my codes. Basically I want to calculate the average values of potential( which is an array because it saves potentials at different coordinates in Cartesian coordinate system ) in each iteration and save the averaged numbers in avg_potential which is also an array. I worried that the sum accumulated based on iteration might be larger than the max number of float, so I only save the average numbers. At the beginning of each iteration, each averaged number is multiplied by the current iteration ( which is an integer ) to get the sum. Then the fresh calculated potential value is added to the sum. Then the sum is divided by (current iteration+1) so that the save value is still the averaged potential.
Inside the kernel each thread is supposed to calculate the averaged potential at a set of different xyz coordinates.

tera · January 13, 2011, 4:40pm

Ah ok, I now understand the multiplication/division by iteration/(iteration+1). Note though that in CUDA, unlike on X86 CPUs, variables in registers are exactly the same width as variables in memory. So dividing by iteration does not protect you against overflow, it just means that you are working with an average and not the sum.

As I understand your code, each iteration generates one new set of values [font=“Courier New”]potential[/font] for the potential, and updates [font=“Courier New”]avg_potential[/font] from its old values extend the average over the new values as well. This would however already be achieved by this simpler kernel:

// kernel

__global__

void averageCenterLinePotential( int arraysize, float *avg_potential, float *potential, int iteration)

{

        int index = __umul24( blockIdx.x,blockDim.x ) + threadIdx.x;

        if ( index >= arraysize ) return;

avg_potential[index] = (iteration * avg_potential[index] + potential[index]) / (iteration + 1);

}

What I do not yet understand is what the for loop in your kernel is supposed to iterate over. What is [font=“Courier New”]i[/font] there?

mtrng · January 13, 2011, 5:26pm

Tera,

Yes, you are right. The code you posted is a simple and good solution to my problem. However I’m trying to use shared memory instead of global memory to improve the calculation speed. Since each block can process 256 threads at the same time, I am trying to load 256 data saved in the “potential” array to shared memory at once, with each thread read one value from the global memory. i is actually the block id used for read data by each thread from global memory.

Thanks.

tera · January 13, 2011, 6:24pm

This problem cannot be optimized by using shared memory. Providing the pointers provided to the kernel are suitably aligned (which they are if you obtained them with cudaMalloc), the memory accesses of all threads in a warp will already be coalesced into 128 Byte wide memory accesses.

Shared memory is for fast communication between the threads of a block, it does not accelerate global memory access.

mtrng · January 13, 2011, 6:36pm

Got it. Thanks a lot.

Topic		Replies	Views
Shared memory and running time Results not reproducible CUDA Programming and Performance	10	1724	August 24, 2009
shared memory problems CUDA Programming and Performance	12	4981	October 2, 2010
using shared memory CUDA Programming and Performance	6	2931	September 17, 2009
Getting wrong output from CUDA kernel CUDA Programming and Performance	6	8287	April 15, 2011
copy from 1D array to shared memory matrix in cuda CUDA Programming and Performance	7	2056	June 9, 2015
strange error about shared memory CUDA Programming and Performance	4	2309	November 30, 2007
Memory management issues Global and Shared memory management CUDA Programming and Performance	12	3868	March 2, 2009
One question regarding shared memory CUDA Programming and Performance	5	1234	April 24, 2013
Shared memory error CUDA Programming and Performance	1	896	June 24, 2012
Dynamic Shared memory CUDA Programming and Performance	3	6102	June 4, 2009

Problems with using shared memory

Related topics