Shared memory and running time Results not reproducible

Magorath · August 21, 2009, 9:35am

Hi all,

I’m currently working on a kernel which deals with a lot of data. These data are stored in arrays allocated through cudaMalloc().

To improve the speed, and as some has to be accessed many times, I copy the data used by one block in the shared memory. The whole computation is then done with the shared memory and at the end of the kernel I write the results back in the shared memory.

The kernel takes less than a second to run, which is the expected time. But sometimes, when I compile the file again without changing anything, the same code takes more than 5 minutes to execute. I don’t understand why this happens.

I also noticed that if I don’t change the value stored in the shared memory this never happens. the program is always fast. This may not be relevent as the compiler may simplify many things if it can see that I never chnage the values contained in the shared memory.

Here is my kernel :

__device__ inline void evolutionComputation(int p, int nbParticlesInBlock, float* evolution)

{

	for (int n = 0; n<nbParticlesInblock; ++n)

	{

		evolution[p] += 1.f;   //If y comment ou this line. The program is always quick

	}

}

__global__ void evolutionKernel(float* p_evolution, int* p_firstParticleArray)

{

	__shared__ int lastParticleIndex, firstParticleIndex, nbParticles, cellIndex;

	extern __shared__ float s_data[];				 //Array in shared memory used to store the information of the particles of this block

	int threadIndex = threadIdx.x;					//Index of the thread inside the block

	// Thread 0 computes common data

	if (threadIndex == 0)

	{

		//Reckon information about the block

		cellIndex = blockIdx.x + blockIdx.y * gridDim.x;			//Index of the current block

		lastParticleIndex = p_firstParticleArray[cellIndex+1];	  //Index of the last particle of this block

		firstParticleIndex = p_firstParticleArray[cellIndex];	   //Index of the first particle of this block

		nbParticles = lastParticleIndex - firstParticleIndex;	   //Number of particles in the cell

	}

	//Synchronize the threads to ensure that all information are really available to all threads

	__syncthreads();

	//Reckon the global index of the particle

	int particleIndex = firstParticleIndex + threadIndex;

	//Check if the particle is in this block

	if (particleIndex < lastParticleIndex)

	{

		//Create sub arrays

		float* s_evolution = s_data;

		s_evolution[threadIndex] = p_evolution[particleIndex];

		//Wait until all threads are done with the copy operations

		__syncthreads();

		//Sets the evolution to 0

		s_evolution[threadIndex] = 0.;

		//Compute forces between particles of the same block

		evolutionComputation(threadIndex, nbParticles, s_evolution);

		//Writes the acceleration in the global memory

		p_evolution[particleIndex] = s_evolution[threadIndex];

	}

}

The two arrays received as arguments of the kernel are declared outside any functions as follow:

float* g_evolution;

int* g_firstParticleIndex;

and allocated through cudaMalloc() as usual.

Notice that there is a write in the shared memory in the global function as well but this one doesn’t slow down the computation. It is only the one in the device function.

I suppose that I’m doing something wrong somewhere.

Thanks for your help, I’m getting really confused with this not reproducible behaviour of my code.

avidday · August 21, 2009, 10:44am

I am going to guess that your problem main problem lies in the way you are declaring your shared variables. When you have dynamically allocated shared memory variables, they all wind up existing within a single dynamically allocated block. It might be that your integer shared variables and your float shared array are actually all in the same block of memory, and that when you starting writing into the array, you overwrite your integer parameters and it leads to unpredictable behaviour inside the loop of your device function (and quite possibly out of bounds memory writes and other nasty stuff). How to make several dynamic shared memory variables coexist is discussed in section 5.3 of the programming guide.

open64 (the gpu side compiler) has a very aggressive dead code optimization, so you can be sure that your device function gets optimized away during compilation when you comment out the increment statement, which makes the problem much of the problem go away.

Magorath · August 21, 2009, 12:19pm

Thanks for your advice. I’ll have a look in the programming guide.

But, I’m not quite sure that this is what happens as I always get results which seem to be correct event if the kernel takes a huge amount of time to run.

Magorath · August 21, 2009, 12:24pm

By the way, does that mean that you are only allowed to declare one shared variable by kernel without using the “programming guide’s trick” ?

avidday · August 21, 2009, 12:31pm

No, you can declare as many as you like, but you just have to be aware that when there is a dynamic allocation, they exist in the same block of memory which is the sum of their sizes, so some pointer offsets or array index offsets are required to use them correctly. I am only speculating about when there are both dynamic allocated and explicitly declared shared memory variables in the same kernel. I haven’t ever written one which mixes the two (usually I would use constant memory for the coefficients or parameters).

Magorath · August 21, 2009, 12:50pm

You would remove the shared qualifier form my first four variables and let them be computed independently by each thread instead of having one thread doing it for the others.

I can do that. It won’t change anything. Just use more registers, I guess.

Magorath · August 21, 2009, 1:23pm

Well, it didn’t change anything. Sometimes it works, sometimes it runs very slowly. I really don’t understand.

avidday · August 21, 2009, 1:43pm

I did stress the shared memory setup was just a guess…

The only potentially expensive part of that kernel is the loop in the device function, and the only way that can tie up the kernel if the contents of nbParticlesInblock is very large in at least one of the running threads. All I can suggest is check that value. How big should it be?

Magorath · August 21, 2009, 1:58pm

The maximal value is 12. It may later grow up to 50 but not higher.

Actually, it really looks like all the thread run one after the other and not at the same time.

Magorath · August 22, 2009, 9:10am

I also noticed that the program alwys run slowly if the previous run as been killed through a segmentation fault or Ctrl+C.

Is it possible that after such an end, the card is in a kind of “bad state” in which it cannot run correctly ?
I’ve read that cuda 2.3 has improved this, is it worth changing ? ('m currently using cuda 2.2)

Thanks again !!

Magorath · August 24, 2009, 9:44am

If I use emulation mode, It works perfectly. Even faster than on the device !

Does someone have a clue ?

Topic		Replies	Views
How shared are shared variables? Can shared variables from separate function calls conflict? CUDA Programming and Performance	3	2503	July 17, 2011
Problems with using shared memory CUDA Programming and Performance	7	6044	January 13, 2011
One question regarding shared memory CUDA Programming and Performance	5	1234	April 24, 2013
Getting wrong output from CUDA kernel CUDA Programming and Performance	6	8287	April 15, 2011
stupid memory question CUDA Programming and Performance	11	2066	July 28, 2009
time spent for operations in cuda CUDA Programming and Performance	2	1616	August 11, 2009
Cuda Latency problems Slow Cuda CUDA Programming and Performance	15	13931	September 5, 2008
Kernel requiring large number of parameters CUDA Programming and Performance	14	8641	September 5, 2008
shared memory performance kernel execution timings with one block CUDA Programming and Performance	3	3168	May 6, 2007
How can I configure this problem is it too big to fit in shared memory? CUDA Programming and Performance	7	3748	October 14, 2008

Shared memory and running time Results not reproducible

Related topics