Shared memory and running time Results not reproducible

Hi all,

I’m currently working on a kernel which deals with a lot of data. These data are stored in arrays allocated through cudaMalloc().

To improve the speed, and as some has to be accessed many times, I copy the data used by one block in the shared memory. The whole computation is then done with the shared memory and at the end of the kernel I write the results back in the shared memory.

The kernel takes less than a second to run, which is the expected time. But sometimes, when I compile the file again without changing anything, the same code takes more than 5 minutes to execute. I don’t understand why this happens.

I also noticed that if I don’t change the value stored in the shared memory this never happens. the program is always fast. This may not be relevent as the compiler may simplify many things if it can see that I never chnage the values contained in the shared memory.

Here is my kernel :

__device__ inline void evolutionComputation(int p, int nbParticlesInBlock, float* evolution)

{

	for (int n = 0; n<nbParticlesInblock; ++n)

	{

		evolution[p] += 1.f;   //If y comment ou this line. The program is always quick

	}

}

__global__ void evolutionKernel(float* p_evolution, int* p_firstParticleArray)

{

	__shared__ int lastParticleIndex, firstParticleIndex, nbParticles, cellIndex;

	extern __shared__ float s_data[];				 //Array in shared memory used to store the information of the particles of this block

	int threadIndex = threadIdx.x;					//Index of the thread inside the block

	// Thread 0 computes common data

	if (threadIndex == 0)

	{

		//Reckon information about the block

		cellIndex = blockIdx.x + blockIdx.y * gridDim.x;			//Index of the current block

		lastParticleIndex = p_firstParticleArray[cellIndex+1];	  //Index of the last particle of this block

		firstParticleIndex = p_firstParticleArray[cellIndex];	   //Index of the first particle of this block

		nbParticles = lastParticleIndex - firstParticleIndex;	   //Number of particles in the cell

	}

	//Synchronize the threads to ensure that all information are really available to all threads

	__syncthreads();

	//Reckon the global index of the particle

	int particleIndex = firstParticleIndex + threadIndex;

	//Check if the particle is in this block

	if (particleIndex < lastParticleIndex)

	{

		//Create sub arrays

		float* s_evolution = s_data;

		s_evolution[threadIndex] = p_evolution[particleIndex];

		//Wait until all threads are done with the copy operations

		__syncthreads();

		//Sets the evolution to 0

		s_evolution[threadIndex] = 0.;

		//Compute forces between particles of the same block

		evolutionComputation(threadIndex, nbParticles, s_evolution);

		//Writes the acceleration in the global memory

		p_evolution[particleIndex] = s_evolution[threadIndex];

	}

}

The two arrays received as arguments of the kernel are declared outside any functions as follow:

float* g_evolution;

int* g_firstParticleIndex;

and allocated through cudaMalloc() as usual.

Notice that there is a write in the shared memory in the global function as well but this one doesn’t slow down the computation. It is only the one in the device function.

I suppose that I’m doing something wrong somewhere.

Thanks for your help, I’m getting really confused with this not reproducible behaviour of my code.

I am going to guess that your problem main problem lies in the way you are declaring your shared variables. When you have dynamically allocated shared memory variables, they all wind up existing within a single dynamically allocated block. It might be that your integer shared variables and your float shared array are actually all in the same block of memory, and that when you starting writing into the array, you overwrite your integer parameters and it leads to unpredictable behaviour inside the loop of your device function (and quite possibly out of bounds memory writes and other nasty stuff). How to make several dynamic shared memory variables coexist is discussed in section 5.3 of the programming guide.

open64 (the gpu side compiler) has a very aggressive dead code optimization, so you can be sure that your device function gets optimized away during compilation when you comment out the increment statement, which makes the problem much of the problem go away.

Thanks for your advice. I’ll have a look in the programming guide.

But, I’m not quite sure that this is what happens as I always get results which seem to be correct event if the kernel takes a huge amount of time to run.

By the way, does that mean that you are only allowed to declare one shared variable by kernel without using the “programming guide’s trick” ?

No, you can declare as many as you like, but you just have to be aware that when there is a dynamic allocation, they exist in the same block of memory which is the sum of their sizes, so some pointer offsets or array index offsets are required to use them correctly. I am only speculating about when there are both dynamic allocated and explicitly declared shared memory variables in the same kernel. I haven’t ever written one which mixes the two (usually I would use constant memory for the coefficients or parameters).

You would remove the shared qualifier form my first four variables and let them be computed independently by each thread instead of having one thread doing it for the others.

I can do that. It won’t change anything. Just use more registers, I guess.

Well, it didn’t change anything. Sometimes it works, sometimes it runs very slowly. I really don’t understand.

I did stress the shared memory setup was just a guess…

The only potentially expensive part of that kernel is the loop in the device function, and the only way that can tie up the kernel if the contents of nbParticlesInblock is very large in at least one of the running threads. All I can suggest is check that value. How big should it be?

The maximal value is 12. It may later grow up to 50 but not higher.

Actually, it really looks like all the thread run one after the other and not at the same time.

I also noticed that the program alwys run slowly if the previous run as been killed through a segmentation fault or Ctrl+C.

Is it possible that after such an end, the card is in a kind of “bad state” in which it cannot run correctly ?
I’ve read that cuda 2.3 has improved this, is it worth changing ? ('m currently using cuda 2.2)

Thanks again !!

If I use emulation mode, It works perfectly. Even faster than on the device !

Does someone have a clue ?